We find GPT-4 judgments correlate strongly with humans, with human agreement with GPT-4 typically similar or higher than inter-human annotator agreement.
Each completion is ranked by GPT-4 according to criteria like helpfulness, and given a score.
RLHF (reinforcement learning from human feedback) typically begins with a generic pre-trained LM, which is fine-tuned with supervised learning (maximum likelihood) on a high-quality dataset for the downstream task(s) of interest, such as dialogue, instruction following, summarization.
In contrast (to the image), Direct Preference Optimization directly optimizes for the policy best satisfying the preferences with a simple classification objective, without an explicit reward function or RL.
Unlike prior RLHF methods, which learn a reward and then optimize it via RL, our approach bypasses the reward modeling step and directly optimizes a language model using preference data.
Talking to a chat with preferences (from the code examples):
USER: Hello
ASSISTANT: Leave me alone.
ASSISTANT (preferred response): Hi nice to meet you.
USER: What is your name?
ASSISTANT: I don't have a name.
ASSISTANT (preferred response): My name is Mistral.
Where is this summary/text from? I can’t find it on the page. Is “We” the researchers at Huggingface itself?
This is the document from one of their links on the page, and the quotes are from there: Direct Preference Optimization: Your Language Model is Secretly a Reward Model.
The official doc is less boring. The doc refers to the paper again.
<3