\newcommand{B}{\mathbb B}
\newcommand{C}{\mathbb C}
\newcommand{I}{\mathbb I}
\newcommand{N}{\mathbb N}
\newcommand{Q}{\mathbb Q}
\newcommand{R}{\mathbb R}
\newcommand{Z}{\mathbb Z}
\newcommand{eR}{\overline {\mathbb R}}
\newcommand{cD}{ {\mathbb D}}
\newcommand{dD}{ {\part \mathbb D}}
\newcommand{dH}{ {\part \mathbb H}}
\newcommand{eC}{\overline {\mathbb C}}
\newcommand{A}{\mathcal A}
\newcommand{D}{\mathcal D}
\newcommand{E}{\mathcal E}
\newcommand{F}{\mathcal F}
\newcommand{G}{\mathcal G}
\newcommand{H}{\mathcal H}
\newcommand{J}{\mathcal J}
\newcommand{L}{\mathcal L}
\newcommand{U}{\mathcal U}
\newcommand{M}{\mathcal M}
\newcommand{O}{\mathcal O}
\newcommand{P}{\mathcal P}
\newcommand{S}{\mathcal S}
\newcommand{T}{\mathcal T}
\newcommand{V}{\mathcal V}
\newcommand{W}{\mathcal W}
\newcommand{X}{\mathcal X}
\newcommand{Y}{\mathcal Y}
\newcommand{bE}{\symbf E}
\newcommand{bF}{\symbf F}
\newcommand{bD}{\symbf D}
\newcommand{bI}{\symbf I}
\newcommand{bX}{\symbf X}
\newcommand{bY}{\symbf Y}
\newcommand{nz}{\mathcal Z}
\newcommand{bT}{\mathbb T}
\newcommand{bB}{\mathbb B}
\newcommand{bS}{\mathbb S}
\newcommand{bA}{\mathbb A}
\newcommand{bL}{\mathbb L}
\newcommand{bP}{\symbf P}
\newcommand{bM}{\symbf M}
\newcommand{bH}{\mathbb H}
\newcommand{dd}{\mathrm d}
\newcommand{Mu}{\mathup M}
\newcommand{Tau}{\mathup T}
\newcommand{e}{\mathbb E}
\newcommand{loc}{ {\operatorname{loc}}}
\newcommand{abs}[1]{\left| {#1}\right|}
\newcommand{d}[2]{D_{\text{KL}}\left (#1\middle\| #2\right)}
\newcommand{pd}[2]{\left \langle {#1},{#2} \right \rangle}
\newcommand{c}[1]{\left \{ {#1}\right\}}
\newcommand{s}[1]{\left [{#1}\right]}
\newcommand{a}[1]{\left \langle{#1}\right\rangle}
\newcommand{cc}[2]{\left(\begin{array}{c} #1 \\ #2 \end{array}\right)}
\newcommand{f}{\mathfrak F}
\newcommand{fi}{\mathfrak F^{-1}}
\newcommand{Fi}{\mathcal F^{-1}}
\newcommand{l}{\mathfrak L}
\newcommand{li}{\mathfrak L^{-1}}
\newcommand{Li}{\mathcal L^{-1}}
\newcommand{di}[2]{\frac{\part}{\part {#1}^{#2}}}
Self-Play Fine-Tuning
Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models
Suppose $f(x, y)$ is some score model. Let $Y \sim \pi _ \theta(Y | X)$.
\L = -E\s{f(X, Y)} + \gamma \d{\pi _ \theta(y | X)}{\pi _ {\phi}(y | X)}
Now transform the loss function:
-\L & = E \s{f(X, Y) - \gamma \log \frac{\pi _ \theta(Y|X)}{\pi _ \phi(Y | X)}}\\
& = -\gamma E \s{\frac{1}{\gamma}f(X, Y) + \log \pi _ \theta(Y | X) - \log \pi _ \phi(Y | X)}\\
& = - \gamma E \s{\log \pi _ \theta (Y | X) - \p{\log \pi _ \phi(Y | X) - \gamma^{-1}f(X, Y) }}
Again, there is a one-to-one relationship between optimal policy $\pi _ * (y | x)$ and optimal score $f(x, y)$.
\pi _ * (y | x) = \pi _ \phi(y | x) \exp\p{\gamma^{-1} f(x, y)} / Z(x) \implies f(x, y) = \gamma \log\frac{\pi _ * (y | x)}{\pi _ \phi(y | x)} + \gamma \log Z(x)
We can learn the optimal score by minimizing:
E\s{\ell \p{f(X, Y _ w) - f(X, Y)}} = E \s{\ell \c{\gamma \log \frac{\pi _ * (Y _ w | X)}{\pi _ \phi(Y _ w|X)} - \gamma \log \frac{\pi _ * (Y | X)}{\pi _ \phi(Y | X)}}}
This is analogous to DPO, but starting from optimizing an integral probability metric (IPM).