Reinforcement learning from human feedback (RLHF)
Reinforcement Learning from Human Feedback (RLHF) is a machine learning technique in which a special reward model is first trained based on human feedback, and is then used in the reinforcement learning (RL) process to optimize the behavior of an intelligent agent[1].
RLHF allows for the formalization of complex or hard-to-define goals (e.g., a "helpful," "safe," or "funny" response) through human evaluations. Instead of manually defining a complex reward function, RLHF enables a reward model to be trained directly on human preferences. This approach has become key to the alignment of large language models (LLMs), that is, bringing their behavior into line with human values and intentions[2].
Development of the Method and Early Achievements
The idea of training agents using human feedback emerged in the 2010s. One of the first significant results was the work of Paul Christiano and colleagues from OpenAI and DeepMind in 2017. They demonstrated that human preferences could replace a manually specified reward function in complex RL tasks. In their experiment, a human would view snippets of an agent's behavior (e.g., in an Atari game) and select the more preferable option. A reward model was trained on these pairwise comparisons, which successfully solved a number of complex tasks while receiving feedback on less than 1% of the agent's actions[3].
In subsequent years, the method began to be applied to training language models. In 2020, OpenAI researchers first applied RLHF to the task of text summarization. They trained a reward model to predict which summary a human would prefer and used RL to fine-tune the model to optimize this score. The results showed significantly higher-quality summarization, even outperforming models trained on human-written reference examples[4].
RLHF in Large Language Models
Large language models have benefited significantly from the implementation of RLHF to improve their responses in terms of helpfulness, accuracy, and instruction-following.
InstructGPT and ChatGPT
A key step was OpenAI's research that introduced the InstructGPT models (2022)—versions of GPT-3 fine-tuned with human involvement[5]. The methodology consisted of three stages:
- Supervised Fine-Tuning (SFT): The model is fine-tuned on a small set of high-quality demonstrations, where human labelers manually write examples of desired responses to various prompts.
- Training the Reward Model: For a multitude of prompts, the model generates several responses. Human labelers rank these responses from best to worst. A reward model is trained on this preference data, learning to assign higher scores to the responses that humans prefer.
- Optimization with RL: The original language model is fine-tuned using the Proximal Policy Optimization (PPO) algorithm to maximize the score given by the reward model. A penalty for significant deviation from the original SFT model is also introduced during optimization to prevent the degradation of language capabilities.
Tests showed that even the relatively small InstructGPT model (1.3 billion parameters) surpassed the giant GPT-3 model (175 billion parameters) in terms of helpfulness. InstructGPT models also became significantly less likely to generate toxic, biased, or untruthful content[5].
The development of this line of research led to the creation of conversational models, the most famous of which is ChatGPT (OpenAI, late 2022). ChatGPT is a model from the GPT-3.5 series, specially fine-tuned for dialogue using RLHF with a similar methodology[6].
Industry Adoption
The RLHF method was also adopted by other leading organizations. DeepMind developed the conversational agent Sparrow (2022), which was trained using RLHF with the addition of a set of rules in natural language (e.g., "do not give dangerous advice")[7]. Anthropic also used similar principles to train its models. By 2023, RLHF had become a virtually standard component in the creation of the most advanced language models[1].
Advantages of Using RLHF
- Alignment with User Intent: Models that undergo RLHF tuning are significantly better at following instructions and providing more relevant and helpful responses[5].
- Reduction of Toxicity and Harmful Content: Involving humans in the training loop allows for the explicit penalization of undesirable types of responses. As a result, RLHF models generate far less toxic and biased content[5].
- Improved Factual Accuracy and Reduced "Hallucinations": Labelers can downgrade answers with fabricated facts, encouraging the model to be more accurate. InstructGPT and ChatGPT models are less prone to "making up" facts compared to their predecessors[5].
- Training Efficiency: RLHF allows for model improvement without a proportional increase in the size of the training dataset. It requires quality preference data rather than vast quantities of data.
Limitations and Challenges
Despite its successes, the RLHF method has a number of limitations and open problems.
- Quality and Cost of Human Data Collection: The effectiveness of RLHF depends directly on the quality of the feedback. Collecting such a dataset is a laborious and expensive process. Furthermore, if the sample of labelers or their criteria are biased, the model may inherit these biases[2].
- Risk of Reward Hacking: A model being optimized for a specific reward function may start to cater to that function rather than the true objective. For example, it might learn to give excessively long answers if labelers value length, or avoid making definitive statements if they are penalized for inaccuracies.
- No Guarantee of Truthfulness: RLHF does not introduce new factual knowledge into the model; it only teaches it the form of response that humans prefer. Therefore, the problem of hallucinations is not completely solved. A model may learn to better conceal its uncertainty but may not always be able to verify facts[6].
- Scalability of Preferences: The transferability of a reward model to other tasks is also a concern. A model trained on preferences for one set of prompts may behave unpredictably when faced with new tasks that differ in style or topic.
Conclusion
RLHF has established itself as an important method for "aligning" large language models with human conceptions of good responses. It has significantly improved the quality of interaction with AI assistants, making their answers more useful and safer. RLHF is seen as a key tool on the path to creating models that can not only generate plausible text but also take into account human values, preferences, and intentions in communication[8].
Links
Literature
- Christiano, P. et al. (2017). Deep Reinforcement Learning from Human Preferences. arXiv:1706.03741.
- Stiennon, N. et al. (2020). Learning to Summarize from Human Feedback. arXiv:2009.01325.
- Nakano, R. et al. (2021). WebGPT: Browser-Assisted Question-Answering with Human Feedback. arXiv:2112.09332.
- Ouyang, L. et al. (2022). Training Language Models to Follow Instructions with Human Feedback. arXiv:2203.02155.
- Glaese, A. et al. (2022). Improving Alignment of Dialogue Agents via Targeted Human Judgements. arXiv:2209.14375.
- Bai, Y. et al. (2022). Constitutional AI: Harmlessness from AI Feedback. arXiv:2212.08073.
- Lee, H. et al. (2023). RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback. arXiv:2309.00267.
- Liu, T. et al. (2023). A Survey of Reinforcement Learning from Human Feedback. arXiv:2312.14925.
- Zhang, Y. et al. (2024). A Survey on Human Preference Learning for Large Language Models. arXiv:2406.11191.
- Li, P. et al. (2024). Advancing Translation Preference Modeling with RLHF. arXiv:2402.11525.
- McAleese, N. et al. (2024). LLM Critics Help Catch LLM Bugs. arXiv:2407.00215.
Notes
- ↑ 1.0 1.1 “What Is Reinforcement Learning From Human Feedback (RLHF)?”. IBM. [1]
- ↑ 2.0 2.1 “Reinforcement learning from human feedback”. In Wikipedia. [2]
- ↑ Christiano, P. et al. “Deep reinforcement learning from human preferences”. arXiv:1706.03741, 2017. [3]
- ↑ Stiennon, N. et al. “Learning to summarize from human feedback”. arXiv:2009.01325, 2020. [4]
- ↑ 5.0 5.1 5.2 5.3 5.4 Ouyang, L. et al. “Training language models to follow instructions with human feedback”. arXiv:2203.02155, 2022. [5]
- ↑ 6.0 6.1 “Introducing ChatGPT”. OpenAI, 2022. [6]
- ↑ Glaese, A. et al. “Improving alignment of dialogue agents via targeted human judgements”. arXiv:2209.14375, 2022. [7]
- ↑ “Aligning language models to follow instructions”. OpenAI. [8]