0f Divergence

$$ \nonumber \newcommand{aster}{*} \newcommand{exist}{\exists} \newcommand{B}{\mathbb B} \newcommand{C}{\mathbb C} \newcommand{I}{\mathbb I} \newcommand{N}{\mathbb N} \newcommand{Q}{\mathbb Q} \newcommand{R}{\mathbb R} \newcommand{Z}{\mathbb Z} \newcommand{eR}{\overline {\mathbb R}} \newcommand{cD}{ {\mathbb D}} \newcommand{dD}{ {\part \mathbb D}} \newcommand{dH}{ {\part \mathbb H}} \newcommand{eC}{\overline {\mathbb C}} \newcommand{A}{\mathcal A} \newcommand{D}{\mathcal D} \newcommand{E}{\mathcal E} \newcommand{F}{\mathcal F} \newcommand{G}{\mathcal G} \newcommand{H}{\mathcal H} \newcommand{J}{\mathcal J} \newcommand{L}{\mathcal L} \newcommand{U}{\mathcal U} \newcommand{M}{\mathcal M} \newcommand{O}{\mathcal O} \newcommand{P}{\mathcal P} \newcommand{S}{\mathcal S} \newcommand{T}{\mathcal T} \newcommand{V}{\mathcal V} \newcommand{W}{\mathcal W} \newcommand{X}{\mathcal X} \newcommand{Y}{\mathcal Y} \newcommand{bE}{\symbf E} \newcommand{bF}{\symbf F} \newcommand{bD}{\symbf D} \newcommand{bI}{\symbf I} \newcommand{bX}{\symbf X} \newcommand{bY}{\symbf Y} \newcommand{nz}{\mathcal Z} \newcommand{bT}{\mathbb T} \newcommand{bB}{\mathbb B} \newcommand{bS}{\mathbb S} \newcommand{bA}{\mathbb A} \newcommand{bL}{\mathbb L} \newcommand{bP}{\symbf P} \newcommand{bM}{\symbf M} \newcommand{bH}{\mathbb H} \newcommand{dd}{\mathrm d} \newcommand{Mu}{\mathup M} \newcommand{Tau}{\mathup T} \newcommand{ae}{\operatorname{a.e.}} \newcommand{aut}{\operatorname{aut}} \newcommand{adj}{\operatorname{adj}} \newcommand{char}{\operatorname{char}} \newcommand{cov}{\operatorname{Cov}} \newcommand{cl}{\operatorname{cl}} \newcommand{cont}{\operatorname{cont}} \newcommand{e}{\mathbb E} \newcommand{pp}{\operatorname{primitive}} \newcommand{dist}{\operatorname{dist}} \newcommand{diam}{\operatorname{diam}} \newcommand{fp}{\operatorname{Fp}} \newcommand{from}{\leftarrow} \newcommand{Gal}{\operatorname{Gal}} \newcommand{GCD}{\operatorname{GCD}} \newcommand{LCM}{\operatorname{LCM}} \newcommand{fg}{\mathrm{fg}} \newcommand{gf}{\mathrm{gf}} \newcommand{im}{\operatorname{Im}} \newcommand{image}{\operatorname{image}} \newcommand{inj}{\hookrightarrow} \newcommand{irr}{\operatorname{irr}} \newcommand{lcm}{\operatorname{lcm}} \newcommand{ltrieq}{\mathrel{\unlhd}} \newcommand{ltri}{\mathrel{\lhd}} \newcommand{loc}{ {\operatorname{loc}}} \newcommand{null}{\operatorname{null}} \newcommand{part}{\partial} \newcommand{pf}{\operatorname{Pf}} \newcommand{pv}{\operatorname{Pv}} \newcommand{rank}{\operatorname{rank}} \newcommand{range}{\operatorname{range}} \newcommand{re}{\operatorname{Re}} \newcommand{span}{\operatorname{span}} \newcommand{su}{\operatorname{supp}} \newcommand{sgn}{\operatorname{sgn}} \newcommand{syn}{\operatorname{syn}} \newcommand{var}{\operatorname{Var}} \newcommand{res}{\operatorname{Res}} \newcommand{data}{\operatorname{data}} \newcommand{erfc}{\operatorname{erfc}} \newcommand{erfcx}{\operatorname{erfcx}} \newcommand{tr}{\operatorname{tr}} \newcommand{col}{\operatorname{Col}} \newcommand{row}{\operatorname{Row}} \newcommand{sol}{\operatorname{Sol}} \newcommand{lub}{\operatorname{lub}} \newcommand{glb}{\operatorname{glb}} \newcommand{ltrieq}{\mathrel{\unlhd}} \newcommand{ltri}{\mathrel{\lhd}} \newcommand{lr}{\leftrightarrow} \newcommand{phat}{^\widehat{\,\,\,}} \newcommand{what}{\widehat} \newcommand{wbar}{\overline} \newcommand{wtilde}{\widetilde} \newcommand{iid}{\operatorname{i.i.d.}} \newcommand{Exp}{\operatorname{Exp}} \newcommand{abs}[1]{\left| {#1}\right|} \newcommand{d}[2]{D_{\text{KL}}\left (#1\middle\| #2\right)} \newcommand{n}[1]{\|#1\|} \newcommand{norm}[1]{\left\|{#1}\right\|} \newcommand{pd}[2]{\left \langle {#1},{#2} \right \rangle} \newcommand{argmax}[1]{\underset{#1}{\operatorname{argmax}}} \newcommand{argmin}[1]{\underset{#1}{\operatorname{argmin}}} \newcommand{p}[1]{\left({#1}\right)} \newcommand{c}[1]{\left \{ {#1}\right\}} \newcommand{s}[1]{\left [{#1}\right]} \newcommand{a}[1]{\left \langle{#1}\right\rangle} \newcommand{cc}[2]{\left(\begin{array}{c} #1 \\ #2 \end{array}\right)} \newcommand{f}{\mathfrak F} \newcommand{fi}{\mathfrak F^{-1}} \newcommand{Fi}{\mathcal F^{-1}} \newcommand{l}{\mathfrak L} \newcommand{li}{\mathfrak L^{-1}} \newcommand{Li}{\mathcal L^{-1}} \newcommand{const}{\text{const.}} \newcommand{Int}{\operatorname{Int}} \newcommand{Ext}{\operatorname{Ext}} \newcommand{Bd}{\operatorname{Bd}} \newcommand{Cl}{\operatorname{Cl}} \newcommand{Iso}{\operatorname{Iso}} \newcommand{Lim}{\operatorname{Lim}} \newcommand{src}{\operatorname{src}} \newcommand{tgt}{\operatorname{tgt}} \newcommand{input}{\operatorname{input}} \newcommand{output}{\operatorname{output}} \newcommand{weight}{\operatorname{weight}} \newcommand{paths}{\operatorname{paths}} \newcommand{init}{\bullet} \newcommand{fin}{\circledcirc} \newcommand{advance}{\operatorname{advance}} \newcommand{di}[2]{\frac{\part}{\part {#1}^{#2}}} $$

Divergences #

Suppose on measurable space $(\Omega, \mathcal A)$ there are two probability measures $P, Q$.

Integral probability metrics #

Sriperumbudur, B.K., Fukumizu, K., Gretton, A., Scholkopf, B., & Lanckriet, G.R. (2009). On integral probability metrics, φ-divergences and binary classification. arXiv: Information Theory.

Suppose $\F \subseteq \L(\Omega \to \R)$ is a set of real-valued bounded measurable functions on $\Omega$.

Define the integral probability metric (IPM) $D _ \F(P \Vert Q)$ between $P, Q$ defined by $\F$ is $$ D _ \F (P \Vert Q) := \sup _ {f \in \F} \abs{\int _ {\Omega} f(\omega) \dd P(\omega) - \int _ {\Omega} f(\omega) \dd Q(\omega)} $$

Total variation distance #

Let $\F \subseteq \L(\Omega \to \R)$ be the set of all indicator functions. The integral probability metric defined by $\F$ is called total variation distance.

f-divergence #

https://en.wikipedia.org/wiki/F-divergence

To compute $f$-divergence, we require $P \ll Q$. By Radon-Nikodym theorem, there exists a unique $\dd P / \dd Q \in L(\Omega \to [0, \infty])$ derivative density.

Suppose $f: [0, \infty] \to (-\infty, +\infty]$ is a convex (smile) function. $f(0, \infty) \subset \R$. And $f(1) = 0$.

Define the $f$-divergence $D _ f$ between distributions as following. $$ D _ f(P \Vert Q):= \int _ \Omega f\p{\frac{\dd P}{ \dd Q}} \dd Q $$

  • By Jensen's inequality, all $f$-divergences are non-negative. $$ D _ f(P \Vert Q) = E _ Q\s{f\p{\frac{\dd P}{\dd Q}}} \ge f\p{E _ Q \s{\frac{\dd P}{\dd Q}}} = f(1) = 0 $$

Suppose $(\Omega, \F, \mu)$ is a reference measure space. And we have density $P = p \dd \mu$ and $Q = q \dd \mu$. We also write $$ D _ f(p \Vert q) := \int _ {\Omega} f\p{\frac{p(x)}{q(x)}} q(x) \dd \mu(x) $$

  • The two definitions are equivalent.

Suppose $X, Y$ are two random variables to the same space $(\Omega, \F)$. We write $D _ f(X \Vert Y) := D _ f(P _ X \Vert P _ Y)$.

KL-divergence #

The KL-divergence is a special $f$-divergence.

$f(x) = x\ln (x)$ gives the KL-divergence, and $g(x) = -\ln(x)$ gives the inverse KL-divergence. $$ D _ {\mathup{KL}}(p \Vert q) = \int p(x) \ln \frac{p(x)}{q(x)} \dd \mu(x) = - \int \ln \frac{q(x)}{p(x)} p(x) \dd \mu (x) = D _ {-\mathup{KL}}(q \Vert p) $$ Suppose $p _ * (x)$ is a data density on $\Omega$, and $p _ \theta(x)$ is a density generated by a statistical model.

  • $\d{p _ * (x)}{p _ \theta(x)}$ is known as the forward KL.
    • Minimizing the forward KL is equivalent to maximizing log likelihood. Since $$ \d{p _ * }{p _ \theta} = \int p _ * (x) \log \frac{p _ * (x)}{p _ \theta(x)} \dd x = -H(p _ * ) - \int p _ * (x) \log p _ \theta(x) \dd x $$
  • $\d{p _ \theta(x)}{p _ * (x)}$ is known as the backward KL.
    • Minimizing the backward KL is known to let $p _ \theta$ focusing on a particular mode of $p$.
    • Suppose we know $p _ * (x)$ is of the form $p _ * (x) = e^{-U(x)} / Z$. Then for $X \sim p _ \theta(x)$, $$ \d{p _ \theta}{p _ * } = E\s{\log p _ \theta(X)} + E [U(X)] + \log Z $$
      • Optimizing the backward KL can be done with unnormalized density $p _ * $.
Log-sum inequality #

Suppose $(a _ k) _ {k \in I}$ and $(b _ k) _ {k \in I}$ are nonnegative real numbers. Suppose $I$ is countable.

Suppose $\sum _ {k \in I} a _ k = a \in (0, \infty)$ and $\sum _ {k \in I} b _ k = b \in (0, \infty)$. Then $$ \sum _ {k \in I} a _ k \log \frac{a _ k}{b _ k} \ge a \log \frac{a} {b} $$ Define $p _ k = a _ k / a$ and $q _ k = b _ k / b$. And its equivalent to $\d{p}{q} \ge 0$.