Authors:
(1) Corby Rosset, Microsoft Research and Correspondence to [email protected];
(2) Ching-An Cheng, Microsoft Research;
(3) Arindam Mitra, Microsoft Research;
(4) Michael Santacroce, Microsoft Research;
(5) Ahmed Awadallah, Microsoft Research and Correspondence to [email protected];
(6) Tengyang Xie, Microsoft Research and Correspondence to [email protected].
Authors:
Authors:(1) Corby Rosset, Výzkum a korespondence společnosti Microsoft na adresu [email protected];
(2) Ching-An Cheng, Microsoft Research;
(3) Arindam Mitra, Microsoft Research;
(4) Michael Santacroce, Výzkum společnosti Microsoft;
(5) Ahmed Awadallah, Výzkum společnosti Microsoft a korespondence na [email protected];
(6) Tengyang Xie, Microsoft Research and Correspondence to [email protected].
Tabela odkazů
Abstract and 1 Introduction2 Preliminaries2.1 RLHF Based on Reward Models
2.1 RLHF Based on Reward Models2.2 RLHF with General Preferences
2.2 RLHF with General Preferences3 Direct Nash Optimization and 3.1 Derivation of Algorithm 1
3 Direct Nash Optimization and 3.1 Derivation of Algorithm 13.2 Theoretical Analysis4 Practical Algorithm – Iterative Contrastive Self-Improvement
4 Practical Algorithm – Iterative Contrastive Self-Improvement5 Experiments and 5.1 Experimental Setup
5 Experiments and 5.1 Experimental Setup5.2 Results and Analysis6 Related Work7 Conclusion and References
Appendix
PřílohaA Extension to Regularized Preferences
A Extension to Regularized PreferencesB Detailed ProofsC Additional Experimental Details
C Additional Experimental DetailsAbstrakt
Tato práce se zaměřuje na analýzu a analýzu výsledků výzkumu, které zahrnují analýzu výsledků výzkumu a analýzu výsledků výzkumu a analýzu výsledků výzkumu a analýzu výsledků výzkumu a analýzu výsledků výzkumu a analýzu výsledků výzkumu a analýzy výsledků výzkumu a analýzy výsledků výzkumu a analýzy výsledků výzkumu a analýzy výsledků výzkumu a analýzy výsledků výzkumu a analýzy výsledků výzkumu a analýzy výsledků výzkumu a analýzy výsledků výzkumu a analýzy výsledků výzkumu a analýzy výsledků výzkumu a analýzy výsledků výzkumu a analýzy výsledků výzkumu a analýzy výsledků výzkumu
1 Úvod
The field of artificial intelligence is evolving towards advanced models that can understand, reason, follow complex instructions, and create nuanced content, while aligning with human values and preferences. Large Language Models (LLMs) (e.g., Brown et al., 2020; Ouyang et al., 2022; Touvron et al., 2023; OpenAI et al., 2023) have demonstrated remarkable capabilities in generating human-like text, answering questions, and coding, yet they still face challenges in tasks that require a high degree of reliability, safety, and ethical alignment. To address these challenges, fine-tuning LLMs using Reinforcement Learning from Human Feedback (RLHF) (Christiano et al., 2017; Bai et al., 2022a; Ouyang et al., 2022) has demonstrates strong potential for making LLMs more helpful by aligning them with human values.
The RLHF framework has been long studied in the context of preference-based reinforcement learning (RL) or RL from human preferences (e.g., Knox and Stone, 2008; Akrour et al., 2012; Griffith et al., 2013; Wirth et al., 2017; Christiano et al., 2017). The conventional methods for RLHF typically assume that the preference is determined by a scalar reward function through some model, such as the frequently used Bradley-Terry (BT) model (Bradley and Terry, 1952).[1] RLHF then optimizes toward the preference in a two-step procedure: reward learning, and policy optimization (through RL) to maximize the learned reward. Under certain conditions, the two-step procedure can be streamlined into a single-step contrastive learning approach (Rafailov et al., 2023), eliminating the need for explicit reward learning. Algorithms of this kind (e.g., Rafailov et al., 2023, DPO) leverage the insight that a policy can be expressed equivalently by an “internal reward function” that the policy is optimal to, so they reduce the RLHF problem to regressing the policy’s internal reward function to that of the preference model. These algorithms are originally offline, and boast enhanced stability and ease of optimization. Nonetheless, two-step RLHF algorithms and their single-step contrastive variants still fundamentally rely on the reward maximization framework, wherein reward-based preferences are governed by, e.g., the BT model.
The reward maximization framing poses a major limitation. Reward functions, defined to output a scalar score r(x, y) for a single response y to input x, cannot express general preferences y ≻ y ′ | x between a pair of outputs in all cases, e.g., intransitive or cyclic preferences (Elo, 1978). Hence, LLMs trained under reward maximization cannot always align with human preference. Furthermore, recent works show that even in settings where preferences can be perfectly expressed under the reward-based BT models, optimizing towards rewards yields problematic behaviors; we refer the reader to Bertrand et al. (2023); Azar et al. (2023); Munos et al. (2023) for more details. Lastly, reward functions in practice can quickly become “stale” as the distribution of the policy shifts under training (Ross et al., 2011; Cheng et al., 2023; Azar et al., 2023; Munos et al., 2023) – leaving them vulnerable to “reward hacking” (Amodei et al., 2016)
We are motivated to overcome two separate challenges: the limited expressivity of reward-based RLHF, and the lack of clarity on how to scale up optimizing with respect to general preferences. Recent advances in reward-based optimization e.g., DPO, already have efficient and scalable implementations – we seek a similarly efficient solution under the framework of general preferences.
Navrhujeme prokazatelný a škálovatelný algoritmus RLHF – Direct Nash Optimization (DNO) (Algorithm 1) (Algorithm 1) který dosahuje nejlepšího z obou světů, kombinuje škálovatelnost kontrastních cílů s teoretickou solidností optimalizace obecných preferencí.Direct Nash OptimizationDirect Nash Optimization
Na vysoké úrovni shrnuje klíčové složky a poznatky DNO níže.
-
To address the issue that reward functions cannot express general preferences, we leverage recent insights that the notion of reward of ought to be expressed as expected win-rates with regard to a general preference function.[2]
-
To address the issue found in previous work that optimizing this more general objective with online algorithms is sample-inefficient or unstable, we decompose the learning procedure into a sequence of “batched on-policy” iterations, wherein each step instead optimizes a simple regression objective.
-
The regression objective (we choose binary cross-entropy) aligns the “internal reward function” of the policy to the expected win-rate compared with itself (as defined in Line 3 of Algorithm 1). By sampling outputs from the current policy to use for training (i.e., “self-play”), this procedure incentivizes self-improving behavior.
-
Our framework is general enough to admit off-policy samples into training, importantly, those from a more powerful teacher (See choice of µ1 and µ2 in Algorithm 1).
-
Furthermore, to ensure stability and computational efficiency, we propose a filtering scheme such that the reward regression is only performed on preference pairs with a sufficiently large margin (for theoretical explanation, see Section 4; in practice, see Section 5.2).
-
DNO repeats this procedure for multiple iterations to let the policy optimize toward the general preference. Since each step involves a regression problem it can be easily implemented at scale.
Pro řešení problému, že funkce odměny nemohou vyjádřit obecné preference, využíváme nedávné poznatky, že pojem odměny by měl být vyjádřen jako očekávané výhry ve vztahu k funkci obecné preference.[2]
To address the issue that reward functions cannot express general preferences, we leverage recent insights that the notion of reward of ought to be expressed as expected win-rates with regard to a general preference function.[2]
To address the issue found in previous work that optimizing this more general objective with online algorithms is sample-inefficient or unstable, we decompose the learning procedure into a sequence of “batched on-policy” iterations, wherein each step instead optimizes a simple regression objective.
Abychom řešili problém nalezený v předchozích pracích, že optimalizace tohoto obecnějšího cíle s on-line algoritmy je vzorkově neefektivní nebo nestabilní, rozkládáme učební postup na sekvenci „balených na politice“ iterací, kde každý krok místo toho optimalizuje jednoduchý regresní cíl.
The regression objective (we choose binary cross-entropy) aligns the “internal reward function” of the policy to the expected win-rate compared with itself (as defined in Line 3 of Algorithm 1). By sampling outputs from the current policy to use for training (i.e., “self-play”), this procedure incentivizes self-improving behavior.
The regression objective (we choose binary cross-entropy) aligns the “internal reward function” of the policy to the expected win-rate compared with itself (as defined in Line 3 of Algorithm 1). By sampling outputs from the current policy to use for training (i.e., “self-play”), this procedure incentivizes self-improving behavior.
Náš rámec je dostatečně obecný na to, aby připustil mimořádné vzorky do výcviku, zejména ty od silnějšího učitele (viz výběr μ1 a μ2 v algoritmu 1).
Our framework is general enough to admit off-policy samples into training, importantly, those from a more powerful teacher (See choice of µ1 and µ2 in Algorithm 1).
Furthermore, to ensure stability and computational efficiency, we propose a filtering scheme such that the reward regression is only performed on preference pairs with a sufficiently large margin (for theoretical explanation, see Section 4; in practice, see Section 5.2).
Furthermore, to ensure stability and computational efficiency, we propose a filtering scheme such that the reward regression is only performed on preference pairs with a sufficiently large margin (for theoretical explanation, see Section 4; in practice, see Section 5.2).
DNO repeats this procedure for multiple iterations to let the policy optimize toward the general preference. Since each step involves a regression problem it can be easily implemented at scale.
DNO opakuje tento postup pro více iterací, aby se politika optimalizovala směrem k obecné preferenci.
Protože každý krok zahrnuje problém regrese, může být snadno implementován na stupnici.
Theoretically, we prove DNO converges to the intended Nash equilibrium on average, and that it can improve monotonically across iterations (see Section 3.1). Furthermore, our finite-sample analysis shows that approximation error at any iteration between the learned policy and the target is tightly bounded (Theorem 1).
Na praktické straně poskytujeme škálovatelnou implementaci DNO (Algorithm 2): iterativní sebezlepšující algoritmus s kontrastními aktualizacemi, který přibližuje algoritmus 1 pod několika kritickými designovými volbami. Tyto volby zahrnují: odběr vzorků z několika online výstupů ze školené politiky, použití GPT-4 jako preferenčního oraklu, porovnání vzorků na politice s vlastními (učitelskými) výstupy GPT-4, a školení pouze na párech s „velkou marží“ (pro teoretické vysvětlení viz oddíl 4; v praxi viz oddíl 5.2).
The primary distinction of our work over related works of Nash-MD (Munos et al., 2023) and SPO (Swamy et al., 2024) is that they both exhibit sample efficiency issues (two timescale updates or sample-inefficient RL steps), and both use purely on-policy samples. We resolve the efficiency issue with a sample-efficient objective that works in practice, and DNO is more flexible to incorporate off-policy samples from e.g., a powerful teacher.
Most importantly, DNO works in practice – we provide comprehensive empirical evaluations, resulting in state-of-the-art performance:
• Výsledný model s parametrem 7B Orca-2.5, vyrovnaný pomocí praktické implementace DNO (Algorithm 2), dosahuje nejmodernější míry výhry jakéhokoli modelu 7B, překračující o 33% oproti GPT-4-Turbo v AlpacaEval 2.0, a to i po ovládání délky. To je více než 26% absolutní zisk (7%→33%) ve srovnání s inicializovaným modelem.
• Naše důkladné ablační studie v oddíle 5.2 zkoumají kritické designové kontaktní body obklopující volbu funkce ztráty (nadřízený finetuning nebo kontrastní), vzdělávací paradigma (s nebo bez vzorků v rámci politiky), kvalita preferenčního anotátora (velká marže nebo ne) a konstrukce tréninkového páru (samostatná hra, učitel vs. student atd.) Naše zjištění zdůrazňují, že pečlivě navržené metody kódované v algoritmu 2 vedou k podstatným ziskům.
• We show some examples of outputs across iterations which demonstrate qualitative improvements such as better addressing nuanced issues and presumptious questions (Table 5), better organization and clarity while refraining from making misleading statements (Table 6), and higher information density in answers (Table 7).
We hope that the results presented herein will provide clarity to the community regarding the use of AI feedback for post-training LLMs.
This paper is available on arxiv under CC BY 4.0 DEED license.
This paper is available on arxiv under CC BY 4.0 DEED license.
available on arxiv[1] We use “reward model” to denote a framework that translates preferences into rewards, e.g., Bradley-Terry, while “reward function” is a (possibly learned) function that outputs reward scalars.