Researchers from NVIDIA and University of Maryland Propose ODIN: A Hacking-Mitigating Reward Decomposition Technique in Reinforcement Learning from Human Feedback (RLHF)

Written By Adarsh Shankar Jha

The popular Artificial Intelligence (AI) based chatbot i.e. ChatGPT, which is built on top of GPT’s transformer architecture, uses the Reinforcement Learning from Human Feedback (RLHF) technique. RLHF is an increasingly important method for harnessing the power of pre-trained Large Language Models (LLMs) to generate more useful, true answers that are consistent with human preferences.

In RLHF, a language model is trained to produce responses that maximize learned reward through reinforcement learning, after which a reward model is trained based on human preferences for specific prompts. Since gathering human ratings is usually less complex than collecting demonstrations for supervised detail, this approach simplifies the data collection process.

However, reward hacking is a subtle problem with RLHF, where the policy receives a large reward without fulfilling the real objectives. This is as a result of the limited generalizability of the out-of-distribution (OOD) reward model and possible imperfections in the representation of human preferences. Being a powerful LLM, the language model can provide examples of OOD to take advantage of flaws in the reward model.

The scenario is further complicated by human preference data, which is often skewed and inconsistent due to the complexity and subjectivity of tasks, flaws in rating standards, and the low caliber of raters. Verbosity is a popular example of reward hacking, in which models produce more tokens to appear more thorough or better formatted in their responses, but there is no real improvement in quality.

In order to address these issues, recent research from NVIDIA and the University of Maryland aimed to mitigate reward hacking by examining how RL algorithms and incentive models affect verbality and performance. The team presented an evaluation technique to compare different training settings and account for biases in model-based evaluations. The technique provided a comprehensive insight into various response durations by evaluating the performance on the Pareto front of evaluation score versus duration.

This procedure is intended to analyze the trade-off between LLM assessment score and response duration, allowing for a systematic comparison of different training settings. By varying the training hyperparameters, one can assess how these modifications affect the ratio of verbalization to response quality.

The study examines hyperparameters and RL techniques, such as reward clipping and length penalty, to reduce length reward hacking. The primary goal is to remove the spurious length signal from the reward, although various tuning procedures can yield better results. To achieve this, the team proposed a two-headed reward model that dissociates length representations from actual preferences. The length head is deleted during RL.

The proposed reward decoupling technique, ODIN, has been used with the help of which, even with a more expensive coordination budget, the policy was able to achieve a larger Pareto front than previous results. Proximal Policy Optimization (PPO) and ReMax both benefit from the efficiency of ODIN, indicating that it can be used to improve other RL tuning methods and reduce length violation.

In conclusion, the experimental results of this method showed a marked reduction in the relationship of the reward model with response duration. The resulting strategy performs significantly better when information quality is prioritized over verbosity. This method successfully reduces the reward violation problem associated with response length, improving the reliability and utility of LLMs trained using the RLHF paradigm.


check it Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us Twitter and Google news. Participation Our 37k+ ML SubReddit, 41k+ Facebook Community, Discord Channeland LinkedIn Groops.

If you like our work, you will love our work newsletter..

Don’t forget to join us Telegram channel

You might also like ours FREE AI Courses….


20220308 160704 1 Tanya

Tanya Malhotra is a senior from University of Petroleum & Energy Studies, Dehradun, pursuing BTech in Computer Science Engineering with specialization in Artificial Intelligence and Machine Learning.
He is a Data Science enthusiast with good analytical and critical thinking along with a keen interest in acquiring new skills, leading teams and managing work in an organized manner.


You May Also Like

0 Comments

Trackbacks/Pingbacks

  1. Revolutionary 3D scene modeling with generalized exponential bounce | BitRise - […] Researchers from the University of Oxford, KAUST, Columbia University and Snap Inc. introduced the Generalized Exponential Smoothing (GES), which,…
  2. This machine learning study tests the transformer's length generalization ability using the task of adding two integers | BitRise - […] overcome this, a team of researchers from Google DeepMind has focused on a methodical analysis of the transformer’s length…
  3. University of Washington researchers present Fiddler: A Resource-Efficient Inference Engine for LLMs with CPU-GPU Orchestration | BitRise - […] team of researchers from the University of Washington presented Fiddler, an innovative solution designed to optimize the […]
  4. The Colossal-AI team presents Open-Sora: An open source library for creating videos | BitRise - […] methods often require significant computing power and can be expensive, limiting accessibility for researchers and content creators. The complexity…