We find GPT-4 judgments correlate strongly with humans, with human agreement with GPT-4 typically similar or higher than inter-human annotator agreement.
Each completion is ranked by GPT-4 according to criteria like helpfulness, and given a score.
RLHF (reinforcement learning from human feedback) typically begins with a generic pre-trained LM, which is fine-tuned with supervised learning (maximum likelihood) on a high-quality dataset for the downstream task(s) of interest, such as dialogue, instruction following, summarization.
In contrast (to the image), Direct Preference Optimization directly optimizes for the policy best satisfying the preferences with a simple classification objective, without an explicit reward function or RL.
Unlike prior RLHF methods, which learn a reward and then optimize it via RL, our approach bypasses the reward modeling step and directly optimizes a language model using preference data.
Talking to a chat with preferences (from the code examples):
USER: Hello
ASSISTANT: Leave me alone.
ASSISTANT (preferred response): Hi nice to meet you.
USER: What is your name?
ASSISTANT: I don't have a name.
ASSISTANT (preferred response): My name is Mistral.
But all my friends call me Mistrial
deleted by creator