How Contrastive Learning Helps AI Self-Improve

Authors:

(1) Corby Rosset, Microsoft Research and Correspondence to [email protected];

(2) Ching-An Cheng, Microsoft Research;

(3) Arindam Mitra, Microsoft Research;

(4) Michael Santacroce, Microsoft Research;

(5) Ahmed Awadallah, Microsoft Research and Correspondence to [email protected];

(6) Tengyang Xie, Microsoft Research and Correspondence to [email protected].

Table of Links

Abstract and 1 Introduction

2 Preliminaries

2.1 RLHF Based on Reward Models

2.2 RLHF with General Preferences

3 Direct Nash Optimization and 3.1 Derivation of Algorithm 1

3.2 Theoretical Analysis

4 Practical Algorithm – Iterative Contrastive Self-Improvement

5 Experiments and 5.1 Experimental Setup

5.2 Results and Analysis

6 Related Work

7 Conclusion and References

Appendix

A Extension to Regularized Preferences

B Detailed Proofs

C Additional Experimental Details

4 Practical Algorithm – Iterative Contrastive Self-Improvement

In this section, we shift our focus to the algorithmic design of the practically scalable version of DNO, following the principles discussed in the last section. A primary challenge encountered in the implementation of the conceptual algorithm DNO (Algorithm 1) stems from the necessity to compute the expectation with respect to the preference function P under the current policy πt. Perhaps surprisingly, as we will show, all we need is a properly implemented iterative DPO-like contrastive learning algorithm.

We present our the practical implementation of DNO in Algorithm 2 (DNO-Prct), which is a batched on-policy algorithm that conducts self-improvement iteratively via contrastive learning. One key consideration in our algorithmic design is that we only need to implicitly use the reward function rt. This comes from the specifically designed on-policy sampling, data filtering, and pair construction. While these specific design choices make DNO-Prct seem similar to simply performing DPO iteratively, there are significant reasons for these design decisions, as we will discuss below.

Relationship between DNO-Prct and DPO. The reader may discern that DNO-Prct (Algorithm 2)—the practical implementation of DNO—can be described as an iterative version of the DPO algorithm. Such similarity is by design, intended to harness the simplicity and effectiveness of DPO (Rafailov et al., 2023) and build on empirical advancements from recent work that applies DPO iteratively (e.g., Yuan et al., 2024; Tran et al., 2024). Our experiments point to the importance of several design choices which help accommodate the general preferences, such as rankings derived from pair-wise win rates. More interestingly, our findings point to a surprising connection—that “a meticulously designed iterative DPO algorithm” could approach the Nash equilibrium of any given general preferences.

Our general algorithmic framework—DNO (Algorithm 1)—is broader and fundamentally different from iterative DPO. For example, the DNO framework could also be directly extended to the regularized preference case (as discussed in Appendix A) or equipped with other advanced sample techniques (e.g., Liu et al., 2024b, RSO) as suggested by Theorem 1 for sample efficiency. On the other hand, although the soft policy iteration (or the KL-regularized reward optimization) is used in both DNO and DPO, they arise from fundamentally different reasons. For DNO, KL-regularization

originates from online learning, no-regret learning through mirror descent (Nemirovskĳ and Yudin, 1983) or followthe-regularized-leader (FTRL) (Kalai and Vempala, 2005; Cesa-Bianchi and Lugosi, 2006; Shalev-Shwartz et al., 2012; Hazan et al., 2016). For DPO and PPO, the KL-regularization is an approximation for the total variation penalty to ensure monotonic improvement of the policy (Kakade and Langford, 2002; Schulman et al., 2015). Later, this approach was simplified by Schulman et al. (2017, PPO), and recently used for post-training LLMs (Ouyang et al., 2022).

This paper is available on arxiv under CC BY 4.0 DEED license.

How Contrastive Learning Helps AI Self-Improve

Too Long; Didn't Read

Table of Links

4 Practical Algorithm – Iterative Contrastive Self-Improvement

About Author

TOPICS

THIS ARTICLE WAS FEATURED IN...

Categories

Trending Topics

How Contrastive Learning Helps AI Self-Improve

Too Long; Didn't Read

Table of Links

4 Practical Algorithm – Iterative Contrastive Self-Improvement

About Author

TOPICS

THIS ARTICLE WAS FEATURED IN...

RELATED STORIES

Categories

Trending Topics