How Contrastive Learning Helps AI Self-Improve

by Language Models (dot tech)April 16th, 2025
Read on Terminal Reader

Too Long; Didn't Read

This section introduces DNO-Prct, a practical and scalable implementation of Direct Nash Optimization. It leverages iterative contrastive learning—similar to DPO—but is designed for batched on-policy training with general preferences. By implicitly using reward signals and structuring pairwise comparisons, DNO-Prct enables efficient self-improvement and approaches Nash equilibrium in complex AI preference models.
featured image - How Contrastive Learning Helps AI Self-Improve
Language Models (dot tech) HackerNoon profile picture
0-item

Authors:

(1) Corby Rosset, Microsoft Research and Correspondence to [email protected];

(2) Ching-An Cheng, Microsoft Research;

(3) Arindam Mitra, Microsoft Research;

(4) Michael Santacroce, Microsoft Research;

(5) Ahmed Awadallah, Microsoft Research and Correspondence to [email protected];

(6) Tengyang Xie, Microsoft Research and Correspondence to [email protected].

Abstract and 1 Introduction

2 Preliminaries

2.1 RLHF Based on Reward Models

2.2 RLHF with General Preferences

3 Direct Nash Optimization and 3.1 Derivation of Algorithm 1

3.2 Theoretical Analysis

4 Practical Algorithm – Iterative Contrastive Self-Improvement

5 Experiments and 5.1 Experimental Setup

5.2 Results and Analysis

6 Related Work

7 Conclusion and References


Appendix

A Extension to Regularized Preferences

B Detailed Proofs

C Additional Experimental Details

4 Practical Algorithm – Iterative Contrastive Self-Improvement

In this section, we shift our focus to the algorithmic design of the practically scalable version of DNO, following the principles discussed in the last section. A primary challenge encountered in the implementation of the conceptual algorithm DNO (Algorithm 1) stems from the necessity to compute the expectation with respect to the preference function P under the current policy πt. Perhaps surprisingly, as we will show, all we need is a properly implemented iterative DPO-like contrastive learning algorithm.


We present our the practical implementation of DNO in Algorithm 2 (DNO-Prct), which is a batched on-policy algorithm that conducts self-improvement iteratively via contrastive learning. One key consideration in our algorithmic design is that we only need to implicitly use the reward function rt. This comes from the specifically designed on-policy sampling, data filtering, and pair construction. While these specific design choices make DNO-Prct seem similar to simply performing DPO iteratively, there are significant reasons for these design decisions, as we will discuss below.





Relationship between DNO-Prct and DPO. The reader may discern that DNO-Prct (Algorithm 2)—the practical implementation of DNO—can be described as an iterative version of the DPO algorithm. Such similarity is by design, intended to harness the simplicity and effectiveness of DPO (Rafailov et al., 2023) and build on empirical advancements from recent work that applies DPO iteratively (e.g., Yuan et al., 2024; Tran et al., 2024). Our experiments point to the importance of several design choices which help accommodate the general preferences, such as rankings derived from pair-wise win rates. More interestingly, our findings point to a surprising connection—that “a meticulously designed iterative DPO algorithm” could approach the Nash equilibrium of any given general preferences.


Our general algorithmic framework—DNO (Algorithm 1)—is broader and fundamentally different from iterative DPO. For example, the DNO framework could also be directly extended to the regularized preference case (as discussed in Appendix A) or equipped with other advanced sample techniques (e.g., Liu et al., 2024b, RSO) as suggested by Theorem 1 for sample efficiency. On the other hand, although the soft policy iteration (or the KL-regularized reward optimization) is used in both DNO and DPO, they arise from fundamentally different reasons. For DNO, KL-regularization


Figure 2: Comparison of various post-training techniques showing that Direct Nash Optimization (DNO) is the most effective. All methods with colorful error bands are 1) implemented by ourselves, 2) initialized with a 7B parameter Orca-2.5 LLM, and 3) are “batched on-policy” (except SFT and Offline DPO which are epochs), all else being equal.


originates from online learning, no-regret learning through mirror descent (Nemirovskij and Yudin, 1983) or followthe-regularized-leader (FTRL) (Kalai and Vempala, 2005; Cesa-Bianchi and Lugosi, 2006; Shalev-Shwartz et al., 2012; Hazan et al., 2016). For DPO and PPO, the KL-regularization is an approximation for the total variation penalty to ensure monotonic improvement of the policy (Kakade and Langford, 2002; Schulman et al., 2015). Later, this approach was simplified by Schulman et al. (2017, PPO), and recently used for post-training LLMs (Ouyang et al., 2022).


This paper is available on arxiv under CC BY 4.0 DEED license.


Trending Topics

blockchaincryptocurrencyhackernoon-top-storyprogrammingsoftware-developmenttechnologystartuphackernoon-booksBitcoinbooks