Understanding Concentrability in Direct Nash Optimization

Written by languagemodels | Published 2025/04/17
Tech Story Tags: llm-fine-tuning | direct-nash-optimization | contrastive-learning-ai | ai-feedback-loops | ai-preference-optimization | how-to-train-ai | rhlf-optimization | dno-algorithm

TLDR In this section, we provide detailed theoretical proofs supporting the Direct Nash Optimization (DNO) framework. The proof of Theorem 2 involves a two-step procedure, beginning with regression using logarithmic loss and leading to a squared error bound. The definitions and assumptions draw heavily on concentrability from reinforcement learning theory (specifically the works of Xie et al., 2021, 2023). While the section simplifies some concepts for clarity, a full theoretical analysis is beyond the paper's scope. The proofs also leverage standard results from regression theory, with additional references provided for deeper understanding. via the TL;DR App

Authors:

(1) Corby Rosset, Microsoft Research and Correspondence to [email protected];

(2) Ching-An Cheng, Microsoft Research;

(3) Arindam Mitra, Microsoft Research;

(4) Michael Santacroce, Microsoft Research;

(5) Ahmed Awadallah, Microsoft Research and Correspondence to [email protected];

(6) Tengyang Xie, Microsoft Research and Correspondence to [email protected].

Table of Links

Abstract and 1 Introduction

2 Preliminaries

2.1 RLHF Based on Reward Models

2.2 RLHF with General Preferences

3 Direct Nash Optimization and 3.1 Derivation of Algorithm 1

3.2 Theoretical Analysis

4 Practical Algorithm – Iterative Contrastive Self-Improvement

5 Experiments and 5.1 Experimental Setup

5.2 Results and Analysis

6 Related Work

7 Conclusion and References

Appendix

A Extension to Regularized Preferences

B Detailed Proofs

C Additional Experimental Details

B Detailed Proofs

In this section, we provide detailed proofs for our theoretical results. Note that, the definitions and assumptions presented heavily adopts the ideas related to version space and concentrability from reinforcement learning theory literature (esp., Xie et al., 2021, 2023). Nevertheless, the descriptions provided herein are intentionally simplified to elucidate the core insights into the algorithmic design. A full and exhaustive theoretical analysis falls outside the primary scope of this paper. We now make the following definitions and assumptions.

Definition 2 can be viewed as a natural extension of concentrability from the (offline) reinforcement learning literature to our setup.

Proof of Theorem 2. We will now present the proof using the following two-step procedure.

Step 1: From regression with log loss to squared error bound. By standard results on the regression with the logarithmic loss, we know

Note that similar results could also apply beyond finite Π. For simplicity, we omit the detailed discussion in our paper. For more in-depth discussions about regression with the logarithmic loss, the reader can refer to, e.g., Foster and Krishnamurthy (2021).

On the other hand, we have

This paper is available on arxiv under CC BY 4.0 DEED license.


Written by languagemodels | Large Language Models (LLMs) ushered in a technological revolution. We breakdown how the most important models work.
Published by HackerNoon on 2025/04/17