Authors:
(1) Corby Rosset, Microsoft Research and Correspondence to [email protected];
(2) Ching-An Cheng, Microsoft Research;
(3) Arindam Mitra, Microsoft Research;
(4) Michael Santacroce, Microsoft Research;
(5) Ahmed Awadallah, Microsoft Research and Correspondence to [email protected];
(6) Tengyang Xie, Microsoft Research and Correspondence to [email protected].
Table of Links
2.1 RLHF Based on Reward Models
2.2 RLHF with General Preferences
3 Direct Nash Optimization and 3.1 Derivation of Algorithm 1
4 Practical Algorithm – Iterative Contrastive Self-Improvement
5 Experiments and 5.1 Experimental Setup
Appendix
A Extension to Regularized Preferences
C Additional Experimental Details
4 Practical Algorithm – Iterative Contrastive Self-Improvement
In this section, we shift our focus to the algorithmic design of the practically scalable version of DNO, following the principles discussed in the last section. A primary challenge encountered in the implementation of the conceptual algorithm DNO (Algorithm 1) stems from the necessity to compute the expectation with respect to the preference function P under the current policy πt. Perhaps surprisingly, as we will show, all we need is a properly implemented iterative DPO-like contrastive learning algorithm.
We present our the practical implementation of DNO in Algorithm 2 (DNO-Prct), which is a batched on-policy algorithm that conducts self-improvement iteratively via contrastive learning. One key consideration in our algorithmic design is that we only need to implicitly use the reward function rt. This comes from the specifically designed on-policy sampling, data filtering, and pair construction. While these specific design choices make DNO-Prct seem similar to simply performing DPO iteratively, there are significant reasons for these design decisions, as we will discuss below.
Relationship between DNO-Prct and DPO. The reader may discern that DNO-Prct (Algorithm 2)—the practical implementation of DNO—can be described as an iterative version of the DPO algorithm. Such similarity is by design, intended to harness the simplicity and effectiveness of DPO (Rafailov et al., 2023) and build on empirical advancements from recent work that applies DPO iteratively (e.g., Yuan et al., 2024; Tran et al., 2024). Our experiments point to the importance of several design choices which help accommodate the general preferences, such as rankings derived from pair-wise win rates. More interestingly, our findings point to a surprising connection—that “a meticulously designed iterative DPO algorithm” could approach the Nash equilibrium of any given general preferences.
Our general algorithmic framework—DNO (Algorithm 1)—is broader and fundamentally different from iterative DPO. For example, the DNO framework could also be directly extended to the regularized preference case (as discussed in Appendix A) or equipped with other advanced sample techniques (e.g., Liu et al., 2024b, RSO) as suggested by Theorem 1 for sample efficiency. On the other hand, although the soft policy iteration (or the KL-regularized reward optimization) is used in both DNO and DPO, they arise from fundamentally different reasons. For DNO, KL-regularization
originates from online learning, no-regret learning through mirror descent (Nemirovskij and Yudin, 1983) or followthe-regularized-leader (FTRL) (Kalai and Vempala, 2005; Cesa-Bianchi and Lugosi, 2006; Shalev-Shwartz et al., 2012; Hazan et al., 2016). For DPO and PPO, the KL-regularization is an approximation for the total variation penalty to ensure monotonic improvement of the policy (Kakade and Langford, 2002; Schulman et al., 2015). Later, this approach was simplified by Schulman et al. (2017, PPO), and recently used for post-training LLMs (Ouyang et al., 2022).
This paper is available on arxiv under CC BY 4.0 DEED license.