Authors:
(1) Corby Rosset, Microsoft Research and Correspondence to [email protected];
(2) Ching-An Cheng, Microsoft Research;
(3) Arindam Mitra, Microsoft Research;
(4) Michael Santacroce, Microsoft Research;
(5) Ahmed Awadallah, Microsoft Research and Correspondence to [email protected];
(6) Tengyang Xie, Microsoft Research and Correspondence to [email protected].
Authors:
Authors:(1) Corby Rosset, Microsoft Research and Correspondence to [email protected];
(2) Ching-An Cheng, Microsoft Research;
(3) Arindam Mitra, Microsoft Research;
(4) Michael Santacroce, Microsoft Research;
(5) Ahmed Awadallah, Microsoft Research and Correspondence to [email protected];
(6) Tengyang Xie, Microsoft Research and Correspondence to [email protected].
Table of Links
Abstract and 1 Introduction2 Preliminaries2.1 RLHF Based on Reward Models
2.1 RLHF Based on Reward Models2.2 RLHF with General Preferences
2.2 RLHF with General Preferences3 Direct Nash Optimization and 3.1 Derivation of Algorithm 1
3 Direct Nash Optimization and 3.1 Derivation of Algorithm 13.2 Theoretical Analysis4 Practical Algorithm – Iterative Contrastive Self-Improvement
4 Practical Algorithm – Iterative Contrastive Self-Improvement5 Experiments and 5.1 Experimental Setup
5 Experiments and 5.1 Experimental Setup5.2 Results and Analysis6 Related Work7 Conclusion and References
Appendix
AppendixA Extension to Regularized Preferences
A Extension to Regularized PreferencesB Detailed ProofsC Additional Experimental Details
C Additional Experimental DetailsAbstract
This paper studies post-training large language models (LLMs) using preference feedback from a powerful oracle to help a model iteratively improve over itself. The typical approach for post-training LLMs involves Reinforcement Learning from Human Feedback (RLHF), which traditionally separates reward learning and subsequent policy optimization. However, such a reward maximization approach is limited by the nature of “point-wise” rewards (such as that of the Bradley-Terry model), which fails to express complex intransitive or cyclic preference relations. While advances on RLHF show reward learning and policy optimization can be merged into a single contrastive objective for stability, they yet still remain tethered to the reward maximization framework. Recently, a new wave of research sidesteps the reward maximization presumptions in favor of directly optimizing over “pair-wise” or general preferences. In this paper, we introduce Direct Nash Optimization (DNO), a provable and scalable algorithm that marries the simplicity and stability of contrastive learning with theoretical generality from optimizing general preferences. Because DNO is a batched on-policy algorithm using a regression-based objective, its implementation is straightforward and efficient. Moreover, DNO enjoys monotonic improvement across iterations which helps it improve even over a strong teacher (such as GPT-4). In our experiments, a resulting 7B parameter Orca-2.5 model aligned by DNO achieves the state-of-the-art win-rate against GPT-4-Turbo of 33% on AlpacaEval 2.0 (even after controlling for response length), an absolute gain of 26% (7% → 33%) over the initializing model. It outperforms models with far more parameters, including Mistral Large, Self-Rewarding LM (70B parameters), and older versions of GPT-4. Our ablation studies analyze critical design decisions surrounding the choice of preference pairs, and the use of LLMs-as-preference-annotators. These results underscore the promise of DNO in the LLMs post-training, as well as offer actionable insights for the AI research community.
1 Introduction
The field of artificial intelligence is evolving towards advanced models that can understand, reason, follow complex instructions, and create nuanced content, while aligning with human values and preferences. Large Language Models (LLMs) (e.g., Brown et al., 2020; Ouyang et al., 2022; Touvron et al., 2023; OpenAI et al., 2023) have demonstrated remarkable capabilities in generating human-like text, answering questions, and coding, yet they still face challenges in tasks that require a high degree of reliability, safety, and ethical alignment. To address these challenges, fine-tuning LLMs using Reinforcement Learning from Human Feedback (RLHF) (Christiano et al., 2017; Bai et al., 2022a; Ouyang et al., 2022) has demonstrates strong potential for making LLMs more helpful by aligning them with human values.
The RLHF framework has been long studied in the context of preference-based reinforcement learning (RL) or RL from human preferences (e.g., Knox and Stone, 2008; Akrour et al., 2012; Griffith et al., 2013; Wirth et al., 2017; Christiano et al., 2017). The conventional methods for RLHF typically assume that the preference is determined by a scalar reward function through some model, such as the frequently used Bradley-Terry (BT) model (Bradley and Terry, 1952).[1] RLHF then optimizes toward the preference in a two-step procedure: reward learning, and policy optimization (through RL) to maximize the learned reward. Under certain conditions, the two-step procedure can be streamlined into a single-step contrastive learning approach (Rafailov et al., 2023), eliminating the need for explicit reward learning. Algorithms of this kind (e.g., Rafailov et al., 2023, DPO) leverage the insight that a policy can be expressed equivalently by an “internal reward function” that the policy is optimal to, so they reduce the RLHF problem to regressing the policy’s internal reward function to that of the preference model. These algorithms are originally offline, and boast enhanced stability and ease of optimization. Nonetheless, two-step RLHF algorithms and their single-step contrastive variants still fundamentally rely on the reward maximization framework, wherein reward-based preferences are governed by, e.g., the BT model.
The reward maximization framing poses a major limitation. Reward functions, defined to output a scalar score r(x, y) for a single response y to input x, cannot express general preferences y ≻ y ′ | x between a pair of outputs in all cases, e.g., intransitive or cyclic preferences (Elo, 1978). Hence, LLMs trained under reward maximization cannot always align with human preference. Furthermore, recent works show that even in settings where preferences can be perfectly expressed under the reward-based BT models, optimizing towards rewards yields problematic behaviors; we refer the reader to Bertrand et al. (2023); Azar et al. (2023); Munos et al. (2023) for more details. Lastly, reward functions in practice can quickly become “stale” as the distribution of the policy shifts under training (Ross et al., 2011; Cheng et al., 2023; Azar et al., 2023; Munos et al., 2023) – leaving them vulnerable to “reward hacking” (Amodei et al., 2016)
We are motivated to overcome two separate challenges: the limited expressivity of reward-based RLHF, and the lack of clarity on how to scale up optimizing with respect to general preferences. Recent advances in reward-based optimization e.g., DPO, already have efficient and scalable implementations – we seek a similarly efficient solution under the framework of general preferences.
We propose a provable and scalable RLHF algorithm – Direct Nash Optimization (DNO) (Algorithm 1) that achieves the best of both worlds, combining the scalability of contrastive objectives with the theoretical soundness of general preference optimization. DNO is designed as a batched on-policy algorithm with a regression-based learning objective; this design choice makes DNO stable and scalable, striking a balance between deployment efficiency and adaptability.
Direct Nash OptimizationDirect Nash Optimization
We summarize at a high level the key ingredients and insights of DNO below.
-
To address the issue that reward functions cannot express general preferences, we leverage recent insights that the notion of reward of ought to be expressed as expected win-rates with regard to a general preference function.[2]
-
To address the issue found in previous work that optimizing this more general objective with online algorithms is sample-inefficient or unstable, we decompose the learning procedure into a sequence of “batched on-policy” iterations, wherein each step instead optimizes a simple regression objective.
-
The regression objective (we choose binary cross-entropy) aligns the “internal reward function” of the policy to the expected win-rate compared with itself (as defined in Line 3 of Algorithm 1). By sampling outputs from the current policy to use for training (i.e., “self-play”), this procedure incentivizes self-improving behavior.
-
Our framework is general enough to admit off-policy samples into training, importantly, those from a more powerful teacher (See choice of µ1 and µ2 in Algorithm 1).
-
Furthermore, to ensure stability and computational efficiency, we propose a filtering scheme such that the reward regression is only performed on preference pairs with a sufficiently large margin (for theoretical explanation, see Section 4; in practice, see Section 5.2).
-
DNO repeats this procedure for multiple iterations to let the policy optimize toward the general preference. Since each step involves a regression problem it can be easily implemented at scale.
To address the issue that reward functions cannot express general preferences, we leverage recent insights that the notion of reward of ought to be expressed as expected win-rates with regard to a general preference function.[2]
To address the issue that reward functions cannot express general preferences, we leverage recent insights that the notion of reward of ought to be expressed as expected win-rates with regard to a general preference function.[2]
To address the issue found in previous work that optimizing this more general objective with online algorithms is sample-inefficient or unstable, we decompose the learning procedure into a sequence of “batched on-policy” iterations, wherein each step instead optimizes a simple regression objective.
To address the issue found in previous work that optimizing this more general objective with online algorithms is sample-inefficient or unstable, we decompose the learning procedure into a sequence of “batched on-policy” iterations, wherein each step instead optimizes a simple regression objective.
The regression objective (we choose binary cross-entropy) aligns the “internal reward function” of the policy to the expected win-rate compared with itself (as defined in Line 3 of Algorithm 1). By sampling outputs from the current policy to use for training (i.e., “self-play”), this procedure incentivizes self-improving behavior.
The regression objective (we choose binary cross-entropy) aligns the “internal reward function” of the policy to the expected win-rate compared with itself (as defined in Line 3 of Algorithm 1). By sampling outputs from the current policy to use for training (i.e., “self-play”), this procedure incentivizes self-improving behavior.
Our framework is general enough to admit off-policy samples into training, importantly, those from a more powerful teacher (See choice of µ1 and µ2 in Algorithm 1).
Our framework is general enough to admit off-policy samples into training, importantly, those from a more powerful teacher (See choice of µ1 and µ2 in Algorithm 1).
Furthermore, to ensure stability and computational efficiency, we propose a filtering scheme such that the reward regression is only performed on preference pairs with a sufficiently large margin (for theoretical explanation, see Section 4; in practice, see Section 5.2).
Furthermore, to ensure stability and computational efficiency, we propose a filtering scheme such that the reward regression is only performed on preference pairs with a sufficiently large margin (for theoretical explanation, see Section 4; in practice, see Section 5.2).
DNO repeats this procedure for multiple iterations to let the policy optimize toward the general preference. Since each step involves a regression problem it can be easily implemented at scale.
DNO repeats this procedure for multiple iterations to let the policy optimize toward the general preference. Since each step involves a regression problem it can be easily implemented at scale.
Theoretically, we prove DNO converges to the intended Nash equilibrium on average, and that it can improve monotonically across iterations (see Section 3.1). Furthermore, our finite-sample analysis shows that approximation error at any iteration between the learned policy and the target is tightly bounded (Theorem 1).
On the practical side, we provide a scalable implementation of DNO (Algorithm 2): an iterative self-improving algorithm with contrastive updates, which approximates Algorithm 1 under several critical design choices. Those choices include: sampling multiple online outputs from the policy being trained, using GPT-4 as the preference oracle, comparing onpolicy samples to GPT-4’s own (teacher) outputs, and training only on pairs with “large margin” (for theoretical explanation, see Section 4; in practice, see Section 5.2).
The primary distinction of our work over related works of Nash-MD (Munos et al., 2023) and SPO (Swamy et al., 2024) is that they both exhibit sample efficiency issues (two timescale updates or sample-inefficient RL steps), and both use purely on-policy samples. We resolve the efficiency issue with a sample-efficient objective that works in practice, and DNO is more flexible to incorporate off-policy samples from e.g., a powerful teacher.
Most importantly, DNO works in practice – we provide comprehensive empirical evaluations, resulting in state-of-the-art performance:
• The resulting 7B parameter Orca-2.5 model, aligned using the practical implementation of DNO (Algorithm 2), achieves the state-of-the-art win-rate of any 7B model, exceeding 33% against GPT-4-Turbo beyond on the AlpacaEval 2.0, even after controlling for length. This is an over 26% absolute gain (7%→33%) compared to the initialized model. It outperforms several recent advanced closed-source models, including Mistral Large and GPT-4-0613, as well as open-source models with far more (10×) parameters, such as Self-Rewarding LM (Yuan et al., 2024) which has 70B parameters.
• Our thorough ablation studies in Section 5.2 examine critical design touchpoints surrounding choice of loss function (supervised finetuning or contrastive), training paradigm (with or without on-policy samples), preference annotator quality (large margin or not), and training pair construction (self-play, teacher-vs-student, etc). Our findings highlight that carefully-crafted methods encoded in Algorithm 2 lead to substantial gains.
• We show some examples of outputs across iterations which demonstrate qualitative improvements such as better addressing nuanced issues and presumptious questions (Table 5), better organization and clarity while refraining from making misleading statements (Table 6), and higher information density in answers (Table 7).
We hope that the results presented herein will provide clarity to the community regarding the use of AI feedback for post-training LLMs.
This paper is available on arxiv under CC BY 4.0 DEED license.
This paper is available on arxiv under CC BY 4.0 DEED license.
available on arxiv[1] We use “reward model” to denote a framework that translates preferences into rewards, e.g., Bradley-Terry, while “reward function” is a (possibly learned) function that outputs reward scalars.