Authors:
(1) Corby Rosset, Microsoft Research and Correspondence to [email protected];
(2) Ching-An Cheng, Microsoft Research;
(3) Arindam Mitra, Microsoft Research;
(4) Michael Santacroce, Microsoft Research;
(5) Ahmed Awadallah, Microsoft Research and Correspondence to [email protected];
(6) Tengyang Xie, Microsoft Research and Correspondence to [email protected].
Table of Links
2 PreliminariesDirect Nash Optimization
2.1 RLHF Based on Reward Models
2.2 RLHF with General Preferences
3 Direct Nash Optimization and 3.1 Derivation of Algorithm 1
4 Practical Algorithm – Iterative Contrastive Self-Improvement
5 Experiments and 5.1 Experimental Setup
Appendix
A Extension to Regularized Preferences
C Additional Experimental Details
C Additional Experimental Details
Batched Prompting: We also show in Figure 3 the prompt that we send to GPT-4 to annotate preferences. For the sake of efficiency, we “batch” requests to GPT-4, meaning that instead of sending every pair of candidate responses to be annotated, we show all candidates side-by-side and ask GPT-4 to apply the scoring rubric to each one in the context window
Cost Analysis: We also do a brief cost analysis associated with the scaled-up experiment on 600k training inputs. The major line items are the cost of sampling outputs, annotating them with GPT-4 to construct training pairs, and then training the next iteration against those pairs. For each of the six iterations:
-
Sampling: it took about 18-24 hours to inference 5 outputs for all 100k examples on 10 8xA100 80GB pods, depending on the average length, costing about $6,000 based on spot pricing.
-
Annotation: the average number of prompt tokens sent to GPT-4 for annotation across iterations was about 450M, with an average of about 60M completion tokens, amounting to about $34,000 based on the version of the endpoint we were using.
-
Training: ironically, training was the cheapest step, taking only 12-24 hours on two 8xA100 80GB nodes
Table 5: Outputs of the various DNO models across iterations for the question: When will the earth run out of fresh water?. The output for Iter-1 is a bit too long as shown in Table 2. We believe this could be addressed with better hyperparameter tuning or preference data. We fine the initial SFT model’s response to be missing important points about how the premise of the question is best addressed by highlighting access to freshwater. The last reponse, for Iter 3 is more informative and specific than the initial SFT response.
Table 6: Outputs of the various DNO models across iterations for an interview question relating to the design of a URL lookup system.. The last response from Iter 3 is more informative and clearer, and doesn’t contain misleading information (searching a trie runs in linear complexity in the length of the strings, not constant...). The response for Iter 1 contains an implementation of a Trie, which may be unnecessary because the user did not ask for it.
Table 7: Outputs of first and last DNO-More-Data iterations for the third example What factors led to the outbreak of ww1?. The last response from Iter 6 has a higher information density; it recalls more key facts and entities, including the specific date of the start of WW1, Triple Entente, the Triple Alliance, Gavrilo Princip, and the Black Hand. Iteration 1 also contains some of this information, but is too wordy. The initial SFT model seems a lot more superficial.
This paper is available on arxiv under CC BY 4.0 DEED license.