Zhijun's Notes

$$ \nonumber \newcommand{aster}{*} \newcommand{exist}{\exists} \newcommand{B}{\mathbb B} \newcommand{C}{\mathbb C} \newcommand{I}{\mathbb I} \newcommand{N}{\mathbb N} \newcommand{Q}{\mathbb Q} \newcommand{R}{\mathbb R} \newcommand{Z}{\mathbb Z} \newcommand{eR}{\overline {\mathbb R}} \newcommand{cD}{ {\mathbb D}} \newcommand{dD}{ {\part \mathbb D}} \newcommand{dH}{ {\part \mathbb H}} \newcommand{eC}{\overline {\mathbb C}} \newcommand{A}{\mathcal A} \newcommand{D}{\mathcal D} \newcommand{E}{\mathcal E} \newcommand{F}{\mathcal F} \newcommand{G}{\mathcal G} \newcommand{H}{\mathcal H} \newcommand{J}{\mathcal J} \newcommand{L}{\mathcal L} \newcommand{U}{\mathcal U} \newcommand{M}{\mathcal M} \newcommand{O}{\mathcal O} \newcommand{P}{\mathcal P} \newcommand{S}{\mathcal S} \newcommand{T}{\mathcal T} \newcommand{V}{\mathcal V} \newcommand{W}{\mathcal W} \newcommand{X}{\mathcal X} \newcommand{Y}{\mathcal Y} \newcommand{bE}{\symbf E} \newcommand{bF}{\symbf F} \newcommand{bD}{\symbf D} \newcommand{bI}{\symbf I} \newcommand{bX}{\symbf X} \newcommand{bY}{\symbf Y} \newcommand{nz}{\mathcal Z} \newcommand{bT}{\mathbb T} \newcommand{bB}{\mathbb B} \newcommand{bS}{\mathbb S} \newcommand{bA}{\mathbb A} \newcommand{bL}{\mathbb L} \newcommand{bP}{\symbf P} \newcommand{bM}{\symbf M} \newcommand{bH}{\mathbb H} \newcommand{dd}{\mathrm d} \newcommand{Mu}{\mathup M} \newcommand{Tau}{\mathup T} \newcommand{ae}{\operatorname{a.e.}} \newcommand{aut}{\operatorname{aut}} \newcommand{adj}{\operatorname{adj}} \newcommand{char}{\operatorname{char}} \newcommand{cov}{\operatorname{Cov}} \newcommand{cl}{\operatorname{cl}} \newcommand{cont}{\operatorname{cont}} \newcommand{e}{\mathbb E} \newcommand{pp}{\operatorname{primitive}} \newcommand{dist}{\operatorname{dist}} \newcommand{diam}{\operatorname{diam}} \newcommand{fp}{\operatorname{Fp}} \newcommand{from}{\leftarrow} \newcommand{Gal}{\operatorname{Gal}} \newcommand{GCD}{\operatorname{GCD}} \newcommand{LCM}{\operatorname{LCM}} \newcommand{fg}{\mathrm{fg}} \newcommand{gf}{\mathrm{gf}} \newcommand{im}{\operatorname{Im}} \newcommand{image}{\operatorname{image}} \newcommand{inj}{\hookrightarrow} \newcommand{irr}{\operatorname{irr}} \newcommand{lcm}{\operatorname{lcm}} \newcommand{ltrieq}{\mathrel{\unlhd}} \newcommand{ltri}{\mathrel{\lhd}} \newcommand{loc}{ {\operatorname{loc}}} \newcommand{null}{\operatorname{null}} \newcommand{part}{\partial} \newcommand{pf}{\operatorname{Pf}} \newcommand{pv}{\operatorname{Pv}} \newcommand{rank}{\operatorname{rank}} \newcommand{range}{\operatorname{range}} \newcommand{re}{\operatorname{Re}} \newcommand{span}{\operatorname{span}} \newcommand{su}{\operatorname{supp}} \newcommand{sgn}{\operatorname{sgn}} \newcommand{syn}{\operatorname{syn}} \newcommand{var}{\operatorname{Var}} \newcommand{res}{\operatorname{Res}} \newcommand{data}{\operatorname{data}} \newcommand{erfc}{\operatorname{erfc}} \newcommand{erfcx}{\operatorname{erfcx}} \newcommand{tr}{\operatorname{tr}} \newcommand{col}{\operatorname{Col}} \newcommand{row}{\operatorname{Row}} \newcommand{sol}{\operatorname{Sol}} \newcommand{lub}{\operatorname{lub}} \newcommand{glb}{\operatorname{glb}} \newcommand{ltrieq}{\mathrel{\unlhd}} \newcommand{ltri}{\mathrel{\lhd}} \newcommand{lr}{\leftrightarrow} \newcommand{phat}{^\widehat{\,\,\,}} \newcommand{what}{\widehat} \newcommand{wbar}{\overline} \newcommand{wtilde}{\widetilde} \newcommand{iid}{\operatorname{i.i.d.}} \newcommand{Exp}{\operatorname{Exp}} \newcommand{abs}[1]{\left| {#1}\right|} \newcommand{d}[2]{D_{\text{KL}}\left (#1\middle\| #2\right)} \newcommand{n}[1]{\|#1\|} \newcommand{norm}[1]{\left\|{#1}\right\|} \newcommand{pd}[2]{\left \langle {#1},{#2} \right \rangle} \newcommand{argmax}[1]{\underset{#1}{\operatorname{argmax}}} \newcommand{argmin}[1]{\underset{#1}{\operatorname{argmin}}} \newcommand{p}[1]{\left({#1}\right)} \newcommand{c}[1]{\left \{ {#1}\right\}} \newcommand{s}[1]{\left [{#1}\right]} \newcommand{a}[1]{\left \langle{#1}\right\rangle} \newcommand{cc}[2]{\left(\begin{array}{c} #1 \\ #2 \end{array}\right)} \newcommand{f}{\mathfrak F} \newcommand{fi}{\mathfrak F^{-1}} \newcommand{Fi}{\mathcal F^{-1}} \newcommand{l}{\mathfrak L} \newcommand{li}{\mathfrak L^{-1}} \newcommand{Li}{\mathcal L^{-1}} \newcommand{const}{\text{const.}} \newcommand{Int}{\operatorname{Int}} \newcommand{Ext}{\operatorname{Ext}} \newcommand{Bd}{\operatorname{Bd}} \newcommand{Cl}{\operatorname{Cl}} \newcommand{Iso}{\operatorname{Iso}} \newcommand{Lim}{\operatorname{Lim}} \newcommand{src}{\operatorname{src}} \newcommand{tgt}{\operatorname{tgt}} \newcommand{input}{\operatorname{input}} \newcommand{output}{\operatorname{output}} \newcommand{weight}{\operatorname{weight}} \newcommand{paths}{\operatorname{paths}} \newcommand{init}{\bullet} \newcommand{fin}{\circledcirc} \newcommand{advance}{\operatorname{advance}} \newcommand{di}[2]{\frac{\part}{\part {#1}^{#2}}} $$

Gaussian Density Cheatsheet #

Gaussian density function #

A parameterized probability density for $x \in \R^d$, $\mu \in \R^d$ and positive definite real matrix $\Sigma\in \R^{d \times d}$. $$ p _ {\mu, \Sigma}(x) = \mathcal N(x ; \mu, \Sigma) =\frac{1}{(2 \pi)^{m / 2}(\operatorname{det} \Sigma)^{1 / 2}} \exp\left[{-\frac{1}{2}{(x-\mu)}^T{\Sigma^{-1}(x-\mu)}}\right] $$

Splitting joint Normal distribution #

Suppose $X _ a \in \L(\Omega \to \R^n)$, $X _ b \in \L(\Omega \to \R^m)$ and $X = (X _ a, X _ b) \sim \mathcal N(\mu, \Sigma)$. Suppose $\mu=\left(\begin{array}{l}\mu _ {a} \\ \mu _ {b}\end{array}\right)$ and $\Sigma =\left(\begin{array}{ll} \Sigma _ {a a} & \Sigma _ {a b} \\ \Sigma _ {b a} & \Sigma _ {b b}\end{array}\right)$, and $\Lambda =\left(\begin{array}{cc} \Lambda _ {a a} & \Lambda _ {a b} \\ \Lambda _ {b a} & \Lambda _ {b b}\end{array}\right)$.

We can immediately see that $p(x _ a|x _ b)$ is Normal by looking at unnormalized conditional density.

We can solve for its mean and variance as following:
Recall that: $$ p(x _ a, x _ b) = \frac{1}{(2\pi)^{m/2}|\Sigma|^{1/2}} \exp \left (-\frac{1}{2} (x^T \Lambda x - 2x^T \Lambda \mu + \text{const.})\right) $$
Define $\Delta _ a = (x _ a - \mu _ a)$ and $\Delta _ b = (x _ b - \mu _ b)$. Then we have $$ p(x _ a, x _ b) = \frac{1}{(2\pi)^{m/2}|\Sigma|^{1/2}} \exp \left (-\frac{1}{2}(\Delta _ a^T \Lambda _ {aa}\Delta _ a + 2 \Delta _ a^T \Lambda _ {ab} \Delta _ b + \Delta _ b^T \Lambda _ {bb} \Delta _ b)\right) $$
Now consider $x _ b$ as constant, and sort in power of $x _ a$ gives: $$ \frac{1}{(2\pi)^{m/2}|\Sigma|^{1/2}} \exp\left ( -\frac{1}{2}\left(x _ {a}^T \Lambda _ {aa}x _ a + 2 x _ a^T(\Lambda _ {ab}(x _ b - \mu _ b) - \Lambda _ {aa}\mu _ a) + \text{const.}\right) \right) $$
Therefore conditional density $p(x _ a|x _ b)$ is $\mathcal N(x _ a;\mu _ a - \Lambda _ {aa}^{-1} \Lambda _ {ab}(x _ b - \mu _ b), \Lambda _ {aa}^{-1})$.
Recall the partitioned matrix inverse identity: $$ \left(\begin{array}{ll} A & B \\ C & D \end{array}\right)^{-1}=\left(\begin{array}{cc} M & - M B D ^{-1} \\ - D ^{-1} C M & D ^{-1}+ D ^{-1} C M B D ^{-1}\end{array}\right);\quad M =\left( A - B D ^{-1} C \right)^{-1} $$
By definition, we have $\left(\begin{array}{ll} \Sigma _ {a a} & \Sigma _ {a b} \\ \Sigma _ {b a} & \Sigma _ {b b}\end{array}\right)^{-1}=\left(\begin{array}{ll} \Lambda _ {a a} & \Lambda _ {a b} \\ \Lambda _ {b a} & \Lambda _ {b b}\end{array}\right)$.
Therefore we can express $\Lambda$ blocks in $\Sigma$:
- $\Lambda _ {a a}=\left( \Sigma _ {a a}- \Sigma _ {a b} \Sigma _ {b b}^{-1} \Sigma _ {b a}\right)^{-1}$.
- $\Lambda _ {a b}=-\left( \Sigma _ {a a}- \Sigma _ {a b} \Sigma _ {b b}^{-1} \Sigma _ {b a}\right)^{-1} \Sigma _ {a b} \Sigma _ {b b}^{-1}$.
Therefore $p(x _ a|x _ b) = \mathcal N(x _ a; \mu _ {a}+ \Sigma _ {a b} \Sigma _ {b b}^{-1}\left( x _ {b}- \mu _ {b}\right), \Sigma _ {a a}- \Sigma _ {a b} \Sigma _ {b b}^{-1} \Sigma _ {b a})$.

Marginal density is $p(x _ a) = \mathcal N(x _ a; \mu _ a, \Sigma _ {aa})$ following from the marginal theorem of characteristic functions.

Linear Gaussian models #

Suppose $X \in \L(\Omega \to \R^m)$ and $Y, Z\in \L(\Omega \to \R^n)$ where $X \sim \mathcal N(\mu, \Lambda^{-1})$.

Suppose $A \in \R^{n \times m}$ and $Z \sim \mathcal N(b, L^{-1})$ where $X \perp Z$.

Define $Y = AX + Z$. Then the joint distribution of $(X, Y)$ is still Gaussian.

Notice that $\left(\begin{array}{c} X \\ Y\end{array}\right) = \left(\begin{array}{cc}I & O \\ A & I\end{array}\right)\left(\begin{array}{l} X \\ Z\end{array}\right)$.
Therefore $(X, Y)$ is jointly normal distributed.
- $E\left(\begin{array}{c} X\\ Y\end{array}\right)= \left(\begin{array}{cc}I & O \\ A & I\end{array}\right)\left(\begin{array}{c}\mu\\ b\end{array}\right) = \left(\begin{array}{c}\mu\\ A\mu + b\end{array}\right)$.
- $\cov\left(\begin{array}{c}X \\ Y\end{array}\right) = \left(\begin{array}{cc}I & O \\ A & I\end{array}\right)\left(\begin{array}{cc}\Lambda^{-1} & O \\ O & L^{-1}\end{array}\right)\left(\begin{array}{cc}I & A^T \\ O & I\end{array}\right) = \left(\begin{array}{cc}\Lambda^{-1} & \Lambda^{-1}A^T \\ A\Lambda^{-1} & A\Lambda^{-1}A^T + L^{-1}\end{array}\right)$
So far we have density functions:
- $p(x) = \mathcal N(x; \mu, \Lambda^{-1})$.
- $p(y|x) = \mathcal N(y; Ax + b, L^{-1})$.
- $p(y) = \mathcal N(y; A\mu + b, A \Lambda^{-1}A^T + L^{-1})$.
We know that $p(x|y) = \mathcal N\left(\Sigma\left\{ A ^{ T } L ( y - b )+ \Lambda \mu \right\}, \Sigma\right)$. Where $\Sigma =\left( \Lambda + A ^{ T } L A \right)^{-1}$.

KL-divergence of Gaussian densities #

Suppose $p _ 1(x) = \mathcal N(x; \mu _ 1, \Sigma _ 1)$ and $p _ 2(x) = \mathcal N(x; \mu _ 2, \Sigma _ 2)$ are two density on $\R^d$. $$ \d{p _ 1(x)}{p _ 2(x)} = \frac{1}{2} \left [ \log \det\Sigma _ 2 - \log \det \Sigma _ 1 - n + \operatorname{tr}(\Sigma _ 2^{-1}\Sigma _ 1) + (\mu _ 2 - \mu _ 1)^T \Sigma _ 2^{-1}(\mu _ 2 - \mu _ 1) \right] $$ Further suppose that $p _ 1$ and $p _ 2$ are diagonal then $$ \d{p _ 1(x)}{p _ 2(x)} = \frac{1}{2} \left [ 2 \sum _ {i} \log \sigma _ {2,i} - 2 \sum _ i \log \sigma _ {1, i} - n + \sum _ i{\frac{\sigma^2 _ {1,i}}{\sigma^2 _ {2,i}}} + \sum _ i\frac{(\mu _ 2 - \mu _ 1)^2 _ i}{\sigma^2 _ {2,i}} \right] $$

Differential entropy of gaussian density #

For single gaussian $X \sim \mathcal N (\mu, \Sigma)$, the entropy would be: $$ h(X) = -E _ {X} \log p(X) = \frac{1}{2} \log [(2\pi e)^n|\Sigma|] = \frac{1}{2}[\log(2\pi)n + n + 2 \sum _ i \log \sigma _ {i}] $$

Cross entropy of gaussian densities #

$$ h(p,q) = -\int p(x) \log q(x) \dd x = -E _ {X\sim p(x)}\log q(X) = h(p) + D(p || q) $$

For diagonal gaussians, the cross entropy would be: $$ h(p _ 1, p _ 2) = \frac{1}{2}\left [ \log(2\pi) n + 2 \sum _ i \log \sigma _ {2, i} + \sum _ i \frac{\sigma _ {1, i}^2}{\sigma _ {2, i}^2} + \sum _ i \frac{(\mu _ 2 - \mu _ 1) _ i^2}{\sigma _ {2,i}^2} \right] $$