新しい歴史

対照的な学習がAIが自らを改善するのを助ける方法

に Language Models (dot tech)3m2025/04/16

Read on Terminal Reader

JA

長すぎる; 読むには

このセクションでは、Direct Nash Optimization の実践的でスケーラブルな実装である DNO-Prct を紹介しています。DNO-Prct は、DPO に類似する反対的な学習を活用しますが、一般的な好みを含むバッテリーオンポリシートレーニングのために設計されています。報酬信号を暗示的に使用し、カップルの比較を構造化することで、DNO-Prct は効率的な自己改善を可能にし、複雑な AI 好みモデルの Nash バランスをアプローチします。

featured image - 対照的な学習がAIが自らを改善するのを助ける方法

‘ai algorithm on a laptop screen’ Image created by HackerNoon AI Image Generator

著者:

(1) Corby Rosset, Microsoft Research and Correspondence to [email protected];

(2) Ching-An Cheng, Microsoft Research;

(3) Arindam Mitra, Microsoft Research;

(4) Michael Santacroce, Microsoft Research;

(5) Ahmed Awadallah, Microsoft Research and Correspondence to [email protected];

(6) Tengyang Xie, Microsoft Research and Correspondence to [email protected].

著者:

著者:

(1) Corby Rosset, Microsoft Research and Correspondence to [email protected];

(2) Ching-An Cheng, Microsoft Research;

(3) アリンダム・ミトラ、Microsoft Research;

(4)マイケル・サンタクロース、Microsoft Research;

(5) Ahmed Awadallah, Microsoft Research and Correspondence to [email protected];

(6) Tengyang Xie, Microsoft Research and Correspondence to [email protected].

Table of Links

Abstract and 1 Introduction
Abstract and 1 Introduction
2 Preliminaries
2 Preliminaries
2.1 RLHF Based on Reward Models
2.1 RLHF Based on Reward Models
2.2 RLHF with General Preferences
2.2 RLHF with General Preferences
3 Direct Nash Optimization and 3.1 Derivation of Algorithm 1
3 Direct Nash Optimization and 3.1 Derivation of Algorithm 1
3.2 Theoretical Analysis
3.2 Theoretical Analysis
4 Practical Algorithm – Iterative Contrastive Self-Improvement
4 Practical Algorithm – Iterative Contrastive Self-Improvement
5 Experiments and 5.1 Experimental Setup
5 Experiments and 5.1 Experimental Setup
5.2 Results and Analysis
5.2 Results and Analysis
6 Related Work
6 Related Work
7 Conclusion and References
7 Conclusion and References

サプリメント
サプリメント
A Extension to Regularized Preferences
A Extension to Regularized Preferences
B Detailed Proofs
B Detailed Proofs
C Additional Experimental Details
C Additional Experimental Details

4 Practical Algorithm – Iterative Contrastive Self-Improvement

このセクションでは、DNOの実質的にスケーラブルなバージョンのアルゴリズム設計に焦点を当て、最後のセクションで議論された原則に従って、概念アルゴリズムDNO(アルゴリズム1)の実装に直面した主な課題は、現在のポリシー πtの下で優先機能Pに対する期待を計算する必要性から生じています。

We present our the practical implementation of DNO in Algorithm 2 (DNO-Prct), which is a batched on-policy algorithm that conducts self-improvement iteratively via contrastive learning. One key consideration in our algorithmic design is that we only need to implicitly use the reward function rt. This comes from the specifically designed on-policy sampling, data filtering, and pair construction. While these specific design choices make DNO-Prct seem similar to simply performing DPO iteratively, there are significant reasons for these design decisions, as we will discuss below.

DNO-Prct と DPO の間の関係読者は DNO-Prct (アルゴリズム 2) - DNO の実践的実施 - は、DPO アルゴリズムのイテラティブなバージョンとして記述することができることを認識することができる。 DNO-Prct は、DPO (Rafailov et al., 2023) のシンプルさと効率性を活用し、DPO をイテラティブに適用する最近の仕事からの実験的進歩に基づいて設計されたものである(例えば、Yuan et al., 2024; Tran et al., 2024)。私たちの実験は、一般的な好みに適合するいくつかの設計選択肢の重要性を指摘しています。さらに興味深いことに、私たちの発見はRelationship between および“a meticulously designed iterative 「a meticulously designed iterative」アルゴリズム」は、任意の一般的な好みのナッシュ均衡に近づくことができます。アルゴリズム」は、任意の一般的な好みのナッシュ均衡に近づくことができます。

私たちの一般的なアルゴリズムフレームワーク—DNO(アルゴリズム1)—は、イテラティブDPOとはより広く、根本的に異なります。例えば、DNOフレームワークはまた、規則化された優先事例(附属書Aで議論されたように)に直接拡張されるか、または他の高度なサンプルテクニック(例えば、Liu et al., 2024b, RSO)を装備することもできます。

は、オンライン学習から生まれたもので、鏡の下(Nemirovskij and Yudin, 1983)を通じて後悔しない学習(Nemirovskij and Yudin, 1983)またはそれに続く規制化リーダー(FTRL)(Kalai and Vempala, 2005; Cesa-Bianchi and Lugosi, 2006; Shalev-Shwartz et al., 2012; Hazan et al., 2016)です。 DPOとPPOにとって、KL規制化は、政策の単調な改善を確保するための総変数罰則(Kakade and Langford, 2002; Schulman et al., 2015)のアプローチです。後で、このアプローチは Schulman et al. (2017, PPO)によって簡素化され、最近、訓練後のLLM(Ouyang et al.,

This paper is available on arxiv under CC BY 4.0 DEED license.

This paper is available on arxiv under CC BY 4.0 DEED license.
available on arxiv

Miro-Leaders

L O A D I N G
. . . comments & more!

About Author

Language Models (dot tech)@languagemodels

Large Language Models (LLMs) ushered in a technological revolution. We breakdown how the most important models work.

Read my stories

ラベル

purcat-img

tech-stories #llm-fine-tuning #direct-nash-optimization #contrastive-learning-ai #ai-feedback-loops #ai-preference-optimization #how-to-train-ai #rhlf-optimization #dno-algorithm

この記事は...

Read on Terminal Reader

Read this story w/o Javascript

Also published here

Join HackerNoon

Latest technology trends. Customized Experience. Curated Stories. Publish Your Ideas

Categories

Trending Topics

blockchaincryptocurrencyhackernoon-top-storyprogrammingsoftware-developmenttechnologystartuphackernoon-booksBitcoinbooks