Exploring Cutting-Edge Approaches to Iterative LLM Fine Tuning

Authors:

(1) Corby Rosset, Microsoft Research and Correspondence to [email protected];

(2) Ching-An Cheng, Microsoft Research;

(3) Arindam Mitra, Microsoft Research;

(4) Michael Santacroce, Microsoft Research;

(5) Ahmed Awadallah, Microsoft Research and Correspondence to [email protected];

(6) Tengyang Xie, Microsoft Research and Correspondence to [email protected].

Table of Links

Abstract and 1 Introduction

2 Preliminaries

2.1 RLHF Based on Reward Models

2.2 RLHF with General Preferences

3 Direct Nash Optimization and 3.1 Derivation of Algorithm 1

3.2 Theoretical Analysis

4 Practical Algorithm – Iterative Contrastive Self-Improvement

5 Experiments and 5.1 Experimental Setup

5.2 Results and Analysis

6 Related Work

7 Conclusion and References

Appendix

A Extension to Regularized Preferences

B Detailed Proofs

C Additional Experimental Details

6 Related Work

We divide the space of related work into whether or not the techniques use SFT or contrastive losses, in offline or online update settings.

Online RLHF algorithms: RLHF innovated how to align language models with human preferences (Christiano et al., 2017; Stiennon et al., 2020), but it is unstable to train and memory-intensive, requiring all three of the parameterized policy model, reward model, and advantage model to be on device for training.

Reward-model Augmented SFT: Since the introduction of RLHF, several emergent techniques apply reward models in various ways, such as to filter training data or rank responses. Reward rAnked Finetuning (RAFT) (Dong et al., 2023) and RRHF (Yuan et al., 2023b) offer the conceptually simplest solution for offline preference learning, which is to sample multiple outputs from a policy, rank them with a reward model, and then finetune on the best sampled output using SFT. This resembles the iterative behavior-cloning technique DAgger (Ross et al., 2011).

Offline Contrastive Preference Learning: There exist several loss functions for contrastive preference learning, first introduced in the offline setting, namely Direct Preference Optimization (Rafailov et al., 2023, DPO) and Calibrated Sequence Likelihood Estimation a.k.a. SLiC (Zhao et al., 2023). Azar et al. (2023) make it clear that point-wise reward estimates are no substitute for pair-wise preferences, and that a policy can easily overfit to deterministic preferences without proper regularization. They derive a more general objective for RLHF, IPO, to directly optimize offline preference probabilities.

Statistical Rejection Sampling Optimization (RSO) generates multiple samples from an initial model, ranks them to create training pairs, and optimizes them under a unified framework encompassing DPO and SLiC (Liu et al., 2024b). Inspired by the learning-to-rank literature, Listwise preference optimization (LIPO) extends pair-wise preference learning to list-wise (Liu et al., 2024a). Preference Ranking Optimization (PRO) also learns towards list-wise preferences (Song et al., 2024). The KTO algorithm takes a different approach from DPO and does not assume that a pair of good-vs-bad outputs for the same input exist, but rather a pool of good outputs and a pool of bad outputs for any inputs exist and optimizes an “unpaired” loss (Ethayarajh et al., 2024).

Iterative Reward-based Finetuning: Reinforced Self-Training (ReST) is one of the first methods to explore iterative self-improving training strategies framed as a two-stage “Grow” step that samples from the current policy, and a “Improve” step that uses a reward model to filter ever-higher quality samples that are then used to improve the policy with offline RL (Gulcehre et al., 2023). A follow-up work explores the use of AI feedback rather than reward ranking (Singh et al., 2023).

On-policy Contrastive Learning: Self-Rewarding Language Models (Yuan et al., 2024) is in practice very similar to DNO. They study the benefits of batched iteratively training on preferences derived from a recent policy’s sampled outputs, but in their work, they use the policy itself as the annotator, which starts off being able to provide only weak preference signals. Self-Play Fine-Tuning (Chen et al., 2024) a.k.a SPIN and Adversarial Preference Optimization a.k.a APO (Cheng et al., 2023) are both iterative LLM training techniques that are compatible with contrastive losses, but they make a very limiting assumption that the teacher is better than the student (without regard to any annotator feedback).

The Cringe Loss (Adolphs et al., 2022) is a token-level loss function that contrasts the correct next token with a hard-negative token from the vocabulary that has high logit weight but still incorrect. The Pairwise Cringe Loss (Xu et al., 2023b) applies the cringe loss to an iterative self-improving style of training.

On-Policy General Preference Optimization: Wang et al. (2023) consider finding the von Neumann winner of general preferences via multi-agent RL from the theoretical perspective. Nash-MD optimizes a policy towards the Nash equilibrium of a generalized preference model using policy gradients, showing that by sampling from a mixture of policies, one can converge to the Nash equilibrium in the last iteration (Munos et al., 2023). Self-play Preference Optimization (SPO) is another online two-player mini-max game that converges to a Nash equilibrium with no-regret guarantees (Swamy et al., 2024). However, these techniques are not as data efficient as contrastive losses and are difficult to implement faithfully without cumbersome two-timescale updates (Munos et al., 2023). A concurrent improvement, IPO-MD, mitigates these difficulties by using purely on-policy IPO updates and is empirically evaluated on an article summarization task (Calandriello et al., 2024). Guo et al. (2024) also propose to eliminate rewards in online AI-feedback (OAIF) by using another LLM to annotate which of two online-sampled outputs from the current policy is preferred. However, all the above studies only consider training pairs constructed between self-play “student vs student” samples, and between student and initial πref. That is, there is no concept of a more powerful “teacher” to compare against in their training pairs. We showed in Table 2 that omitting these “student vs teacher” preferences may hinder performance.

This paper is available on arxiv under CC BY 4.0 DEED license.