\newcommand{B}{\mathbb B}
\newcommand{C}{\mathbb C}
\newcommand{I}{\mathbb I}
\newcommand{N}{\mathbb N}
\newcommand{Q}{\mathbb Q}
\newcommand{R}{\mathbb R}
\newcommand{Z}{\mathbb Z}
\newcommand{eR}{\overline {\mathbb R}}
\newcommand{cD}{ {\mathbb D}}
\newcommand{dD}{ {\part \mathbb D}}
\newcommand{dH}{ {\part \mathbb H}}
\newcommand{eC}{\overline {\mathbb C}}
\newcommand{A}{\mathcal A}
\newcommand{D}{\mathcal D}
\newcommand{E}{\mathcal E}
\newcommand{F}{\mathcal F}
\newcommand{G}{\mathcal G}
\newcommand{H}{\mathcal H}
\newcommand{J}{\mathcal J}
\newcommand{L}{\mathcal L}
\newcommand{U}{\mathcal U}
\newcommand{M}{\mathcal M}
\newcommand{O}{\mathcal O}
\newcommand{P}{\mathcal P}
\newcommand{S}{\mathcal S}
\newcommand{T}{\mathcal T}
\newcommand{V}{\mathcal V}
\newcommand{W}{\mathcal W}
\newcommand{X}{\mathcal X}
\newcommand{Y}{\mathcal Y}
\newcommand{bE}{\symbf E}
\newcommand{bF}{\symbf F}
\newcommand{bD}{\symbf D}
\newcommand{bI}{\symbf I}
\newcommand{bX}{\symbf X}
\newcommand{bY}{\symbf Y}
\newcommand{nz}{\mathcal Z}
\newcommand{bT}{\mathbb T}
\newcommand{bB}{\mathbb B}
\newcommand{bS}{\mathbb S}
\newcommand{bA}{\mathbb A}
\newcommand{bL}{\mathbb L}
\newcommand{bP}{\symbf P}
\newcommand{bM}{\symbf M}
\newcommand{bH}{\mathbb H}
\newcommand{dd}{\mathrm d}
\newcommand{Mu}{\mathup M}
\newcommand{Tau}{\mathup T}
\newcommand{e}{\mathbb E}
\newcommand{loc}{ {\operatorname{loc}}}
\newcommand{abs}[1]{\left| {#1}\right|}
\newcommand{d}[2]{D_{\text{KL}}\left (#1\middle\| #2\right)}
\newcommand{pd}[2]{\left \langle {#1},{#2} \right \rangle}
\newcommand{c}[1]{\left \{ {#1}\right\}}
\newcommand{s}[1]{\left [{#1}\right]}
\newcommand{a}[1]{\left \langle{#1}\right\rangle}
\newcommand{cc}[2]{\left(\begin{array}{c} #1 \\ #2 \end{array}\right)}
\newcommand{f}{\mathfrak F}
\newcommand{fi}{\mathfrak F^{-1}}
\newcommand{Fi}{\mathcal F^{-1}}
\newcommand{l}{\mathfrak L}
\newcommand{li}{\mathfrak L^{-1}}
\newcommand{Li}{\mathcal L^{-1}}
\newcommand{di}[2]{\frac{\part}{\part {#1}^{#2}}}
Suppose on measurable space $(\Omega, \mathcal A)$ there are two probability measures $P, Q$.
Integral probability metrics
Sriperumbudur, B.K., Fukumizu, K., Gretton, A., Scholkopf, B., & Lanckriet, G.R. (2009). On integral probability metrics, φ-divergences and binary classification. arXiv: Information Theory.
Suppose $\F \subseteq \L(\Omega \to \R)$ is a set of real-valued bounded measurable functions on $\Omega$.
Define the integral probability metric (IPM) $D _ \F(P \Vert Q)$ between $P, Q$ defined by $\F$ is
D _ \F (P \Vert Q) := \sup _ {f \in \F} \abs{\int _ {\Omega} f(\omega) \dd P(\omega) - \int _ {\Omega} f(\omega) \dd Q(\omega)}
Total variation distance
Let $\F \subseteq \L(\Omega \to \R)$ be the set of all indicator functions. The integral probability metric defined by $\F$ is called total variation distance.
To compute $f$-divergence, we require $P \ll Q$. By Radon-Nikodym theorem, there exists a unique $\dd P / \dd Q \in L(\Omega \to [0, \infty])$ derivative density.
Suppose $f: [0, \infty] \to (-\infty, +\infty]$ is a convex (smile) function. $f(0, \infty) \subset \R$. And $f(1) = 0$.
Define the $f$-divergence $D _ f$ between distributions as following.
D _ f(P \Vert Q):= \int _ \Omega f\p{\frac{\dd P}{ \dd Q}} \dd Q
- By Jensen's inequality, all $f$-divergences are non-negative.
D _ f(P \Vert Q) = E _ Q\s{f\p{\frac{\dd P}{\dd Q}}} \ge f\p{E _ Q \s{\frac{\dd P}{\dd Q}}} = f(1) = 0
Suppose $(\Omega, \F, \mu)$ is a reference measure space. And we have density $P = p \dd \mu$ and $Q = q \dd \mu$. We also write
D _ f(p \Vert q) := \int _ {\Omega} f\p{\frac{p(x)}{q(x)}} q(x) \dd \mu(x)
- The two definitions are equivalent.
Suppose $X, Y$ are two random variables to the same space $(\Omega, \F)$. We write $D _ f(X \Vert Y) := D _ f(P _ X \Vert P _ Y)$.
The KL-divergence is a special $f$-divergence.
$f(x) = x\ln (x)$ gives the KL-divergence, and $g(x) = -\ln(x)$ gives the inverse KL-divergence.
D _ {\mathup{KL}}(p \Vert q) = \int p(x) \ln \frac{p(x)}{q(x)} \dd \mu(x) = - \int \ln \frac{q(x)}{p(x)} p(x) \dd \mu (x) = D _ {-\mathup{KL}}(q \Vert p)
Suppose $p _ * (x)$ is a data density on $\Omega$, and $p _ \theta(x)$ is a density generated by a statistical model.
- $\d{p _ * (x)}{p _ \theta(x)}$ is known as the forward KL.
- Minimizing the forward KL is equivalent to maximizing log likelihood. Since
\d{p _ * }{p _ \theta} = \int p _ * (x) \log \frac{p _ * (x)}{p _ \theta(x)} \dd x = -H(p _ * ) - \int p _ * (x) \log p _ \theta(x) \dd x
- $\d{p _ \theta(x)}{p _ * (x)}$ is known as the backward KL.
- Minimizing the backward KL is known to let $p _ \theta$ focusing on a particular mode of $p$.
- Suppose we know $p _ * (x)$ is of the form $p _ * (x) = e^{-U(x)} / Z$. Then for $X \sim p _ \theta(x)$,
\d{p _ \theta}{p _ * } = E\s{\log p _ \theta(X)} + E [U(X)] + \log Z
- Optimizing the backward KL can be done with unnormalized density $p _ * $.
Log-sum inequality
Suppose $(a _ k) _ {k \in I}$ and $(b _ k) _ {k \in I}$ are nonnegative real numbers. Suppose $I$ is countable.
Suppose $\sum _ {k \in I} a _ k = a \in (0, \infty)$ and $\sum _ {k \in I} b _ k = b \in (0, \infty)$. Then
\sum _ {k \in I} a _ k \log \frac{a _ k}{b _ k} \ge a \log \frac{a} {b}
Define $p _ k = a _ k / a$ and $q _ k = b _ k / b$. And its equivalent to $\d{p}{q} \ge 0$.