Authors:
(1) Corby Rosset, Microsoft Research and Correspondence to [email protected];
(2) Ching-An Cheng, Microsoft Research;
(3) Arindam Mitra, Microsoft Research;
(4) Michael Santacroce, Microsoft Research;
(5) Ahmed Awadallah, Microsoft Research and Correspondence to [email protected];
(6) Tengyang Xie, Microsoft Research and Correspondence to [email protected].
Table of Links
2.1 RLHF Based on Reward Models
2.2 RLHF with General Preferences
3 Direct Nash Optimization and 3.1 Derivation of Algorithm 1
4 Practical Algorithm â Iterative Contrastive Self-Improvement
5 Experiments and 5.1 Experimental Setup
Appendix
A Extension to Regularized Preferences
C Additional Experimental Details
5 Experiments
Algorithm 2 is chosen for its efficiency and simplicity from an implementation standpoint (in this section, we will use DNO to denote Algorithm 2 or DNO-Prct for simplicity). Once the input dataset {xi â X } is chosen, each iteration of DNO proceeds in three phrases: sampling outputs from the current policy, annotating outputs for preference pair generation, and then training the next policy with the new training pairs. Iteration 0 is defined to start by sampling from the initial SFT model to produce training data for iteration 1.
5.1 Experimental Setup
Every experiment except one in this study solely uses UltraFeedback. The exception is one âscaled upâ experiment with about 10x more data sourced from a mixture of datasets aggregated including Anthropic HH (Bai et al., 2022a), UltraChat (Ding et al., 2023), MetaMathQA (Yu et al., 2023), EvolInstruct (Xu et al., 2023a), UltraFeedback (Cui et al., 2023) and Orca-2 (Mitra et al., 2023). Note that we only use the input prompts for these datasets and collect a GPT-4-Turbo responses for all 600k of these input prompts.
Sampling from the Policy: At the end of training, we sample 5 outputs from the resulting student policy using top p sampling with p = 0.95 and temperature 0.7. Several works have shown the benefit of sampling and comparing multiple diverse outputs from the policy (Yuan et al., 2023a; Mitra et al., 2024; Liu et al., 2024b; Dong et al., 2023; Wang et al., 2022). We implement a simple defect detection system which flags any sample that has a high amount of repeated n-grams as automatic negative.
Preference Annotation: We use GPT-4-Turbo âas a judgeâ to label preferences among the 5 policy samples and 1 gold sample (which is also GPT-4-Turbo) as shown in Figure 3. This prompt contains a few minor modifications from the that used in (Yuan et al., 2024). It implements an additive scoring framework on a 6-point scale where a score of 6 represents the highest quality answer according to certain dimensions like âcorrectnessâ, âexpert knowledgeâ, âconcisenessâ etc. By following this rubric, GPT-4 acting as an annotator represents a best-effort general preference model because it compares multiple candidate responses side-by-side in the context window, and stratifies them along meaningful dimensions of quality.
Training Pair Construction: Adhering to Line 6 in Algorithm 2 implies that not all pairs are suitable for training. Firstly, we must enforce the positives to be high quality in an absolute sense, and secondly, the negatives are directionally worse by a large margin. On the 6 point annotation scale, only samples that score a 5 or 6 are allowed to be positives. From the positives that meet this criteria, if any, we then construct all pairs such that the negative is at least 2 points lower. If the positive happens to be from the student, we relax this constraint to 1 point margin since the GPT-4-Turbo teacher outputs rarely receive a score less than 5 (as shown by the average teacher score in Table 2).
Additionally, we are motivated to preserve the preference behavior from previous iterations so that new policies do not inadvertently regress to past bad behavior. To enforce this, we incorporate an exponentially decaying proportion of prior iterationsâ training pairs into the current iteration, i.e. we sample at most 30% of training pairs from iteration t â 1, 15% from t â 2, and so on. We do not re-inference outputs for those inputs from the most recent policy. Recall that previous iterationsâ inputs are non-overlapping with the splits for other iterations.
Training: To prevent overfitting, we train our batched on-policy methods for at most one epoch on newly constructed pairs. Our effective batch size is fixed to 64 for all experiments. Our learning rate, beta, and alpha are found with brief hyperparameter searches. For most experiments, the learning rate is 5E-5, beta is either 0.1 or 0.05, and alpha is 0.005. We found that at higher iterations, the learning rate needs to be lowered. In SFT (supervised fine-tuning) experiments, our learning rate is 5E-6 and we mask out loss for the inputs. We use the open-source TRL libraryâs implementation to run our experiments.
Evaluation: Our primary goal is to train a policy that is comparable to the most powerful state-of-the-art langauge models. Hence, AlpacaEval 2.0 (Dubois et al., 2023) is an appropriate benchmark because it computes win-rate against GPT-4-Turbo in a head-to-head fashion on a dataset of 805 input prompts that is shown to correlate with human preferences (0.93 spearman correlation with Chatbot Arena). While it is known that auto-eval methods also correlate with spurious features such as length, a new version of AlpacaEval 2.0 corrects for this with a length-controlled win-rate
that has an even higher spearman correlation (0.98) with Chatbot Arena [4].
We also evaluate on MT-Bench (Zheng et al., 2023) which allows the llm-as-a-judge to first explain its reasoning before providing a scalar score on 1-10 for the candidate response to a bank of 80 questions. One crucial difference between AlpacaEval 2.0 and MT Bench is that the former asks GPT-4-Turbo to predict which of two side-by-side responses humans would prefer, weighted by the logits to represent its uncertainty, whereas MT-Bench asks the model to first generate a justification and then output a score on 1-10, but it neither defines the ratings (e.g. how a 7 is different than a 5) nor accounts for uncertainty in the logits of the score.
We also evaluate on the OpenLLM leaderboard (Beeching et al., 2023), which measures reasoning ability on downstream NLP tasks like coding and question answering by evaluating the accuracy of the multiple choice answer option with the highest logit. Since our training data is primarily instruction-following and not trained to output just the sole answer option, this benchmark is not the primary target of this study; nonetheless, DNO on instruction tuning tasks ought to show no regression on reasoning tasks.
This paper is available on arxiv under CC BY 4.0 DEED license.
[4] https://github.com/tatsu-lab/alpaca_eval