The popular Artificial Intelligence (AI) based chatbot i.e. ChatGPT, which is built on top of GPT’s transformer architecture, uses the Reinforcement Learning from Human Feedback (RLHF) technique. RLHF is an increasingly important method for harnessing the power of pre-trained Large Language Models (LLMs) to generate more useful, true answers that are consistent with human preferences.
In RLHF, a language model is trained to produce responses that maximize learned reward through reinforcement learning, after which a reward model is trained based on human preferences for specific prompts. Since gathering human ratings is usually less complex than collecting demonstrations for supervised detail, this approach simplifies the data collection process.
However, reward hacking is a subtle problem with RLHF, where the policy receives a large reward without fulfilling the real objectives. This is as a result of the limited generalizability of the out-of-distribution (OOD) reward model and possible imperfections in the representation of human preferences. Being a powerful LLM, the language model can provide examples of OOD to take advantage of flaws in the reward model.
The scenario is further complicated by human preference data, which is often skewed and inconsistent due to the complexity and subjectivity of tasks, flaws in rating standards, and the low caliber of raters. Verbosity is a popular example of reward hacking, in which models produce more tokens to appear more thorough or better formatted in their responses, but there is no real improvement in quality.
In order to address these issues, recent research from NVIDIA and the University of Maryland aimed to mitigate reward hacking by examining how RL algorithms and incentive models affect verbality and performance. The team presented an evaluation technique to compare different training settings and account for biases in model-based evaluations. The technique provided a comprehensive insight into various response durations by evaluating the performance on the Pareto front of evaluation score versus duration.
This procedure is intended to analyze the trade-off between LLM assessment score and response duration, allowing for a systematic comparison of different training settings. By varying the training hyperparameters, one can assess how these modifications affect the ratio of verbalization to response quality.
The study examines hyperparameters and RL techniques, such as reward clipping and length penalty, to reduce length reward hacking. The primary goal is to remove the spurious length signal from the reward, although various tuning procedures can yield better results. To achieve this, the team proposed a two-headed reward model that dissociates length representations from actual preferences. The length head is deleted during RL.
The proposed reward decoupling technique, ODIN, has been used with the help of which, even with a more expensive coordination budget, the policy was able to achieve a larger Pareto front than previous results. Proximal Policy Optimization (PPO) and ReMax both benefit from the efficiency of ODIN, indicating that it can be used to improve other RL tuning methods and reduce length violation.
In conclusion, the experimental results of this method showed a marked reduction in the relationship of the reward model with response duration. The resulting strategy performs significantly better when information quality is prioritized over verbosity. This method successfully reduces the reward violation problem associated with response length, improving the reliability and utility of LLMs trained using the RLHF paradigm.
check it Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us Twitter and Google news. Participation Our 37k+ ML SubReddit, 41k+ Facebook Community, Discord Channeland LinkedIn Groops.
If you like our work, you will love our work newsletter..
Don’t forget to join us Telegram channel
You might also like ours FREE AI Courses….
Tanya Malhotra is a senior from University of Petroleum & Energy Studies, Dehradun, pursuing BTech in Computer Science Engineering with specialization in Artificial Intelligence and Machine Learning.
He is a Data Science enthusiast with good analytical and critical thinking along with a keen interest in acquiring new skills, leading teams and managing work in an organized manner.
0 Comments
Trackbacks/Pingbacks