# 1014: 生成式自编码器 (GAE) Generative Autoencoder

CATSMILE-1014

## 前言

• 目标: 重新梳理生成式自编码器的相关框架,用一个统一的优化视角看待 MLR,PCA,NMF到tSNE,LLE,NN-VAE,GAN,等等的一系列统计学习框架. 在概率建模和最优化之间的边界进行思考

• 背景与动机: 写1013的时候由于隐变量框架不清晰,给自己绕晕了. 得出大致感觉后,把主要框架提炼出来,重新整理一下

• 结论:

• 备注: 历史悠久,历久弥新. 隐隐感觉这会是CATSMILE的一个里程碑

• 更新时间:

• 20220629:加入边际分布变分视角

• 20220619:加入变分视角

• 20220614

• 相关模型:

• PCA

• LLGAE

• Deep image Prior

• 关键词:

• 自编码器AutoEncoder

• 降维Dimensionality Reduction,

• 无监督学习Unsupervised Learning,

• 压缩感知Compressed Sensing

• 展望方向:

• Transformer Attention Variational Autoencoders

• Prior works quoted by VAE

## 变分视角负损失函数: 从自映射联合分布到边际分布

\begin{split}\begin{align} L(e,r) &= -D_{KL}( p_{data}(x) \delta(y|x) || p_{data}(x) \sum_z q_e(z|x) q_r(y|z) ) \\ &= \sum_{x=y} p_{data}(x) \delta(y|x) (\log { p_{data}(x) \sum_zq_e(z|x) q_r(y|z) \over p_{data}(x) \delta(y|x)} ) \\ &= c + \sum_{x=y} p_{data}(x) (\log { \sum_z q_e(z|x) q_r(x|z) } ) \\ &= c + \sum_{x=y} p_{data}(x) (\log { \sum_z q_e(z) q_r(x|z) } ) \\ &= c - D_{KL} (p_{data}(x) || \sum_z q_e(z) q_r(x|z) ) \end{align}\end{split}

\begin{split}\begin{align} L(w,r) &= -D_{KL}(p_{data}(x) || \sum_z q_w(z) q_r(x|z)) \\ &= c + E_{p_{data}(x)}[ \log \sum_z q_w(z) q_r(x|z) ] \\ &= c + E_{p_{data}(x)}[ \log \sum_z \exp \log q_w(z) q_r(x|z) ] \\ &\geq c + E_{p_{data}(x)}[ \max_z \log q_w(z) q_r(x|z) ] \\ &\geq c + E_{p_{data}(x)}[ l_m(z_0,t) ] \\ z_t &= f_m(z_0,t) \\ l_m(z_0,t) &= \log q_w(z_t) q_r(x|z_t) \\ L_{GAE}(m) &= E_{p_{data}(x)}[ l_m(z_0,t) ] \leq E_{p_{data}(x)}[ \max_z q_w(z) q_r(x|z)] \end{align}\end{split}

## 变分视角负损失函数: 自映射

\begin{split}\begin{align} L(e,r) &= -D_{KL}( p_{data}(x) \delta(y|x) || p_{data}(x) \sum_z q_e(z|x) q_r(y|z) ) \\ &= \sum_{x=y} p_{data}(x) \delta(y|x) (\log { p_{data}(x) \sum_zq_e(z|x) q_r(y|z) \over p_{data}(x) \delta(y|x)} ) \\ &= \sum_{x=y} p_{data}(x) \delta(y|x) (\log { \sum_z q_e(z|x) q_r(y|z) \over \delta(y|x)} ) \\ &= E_{p_{data}(x)}[-D_{KL}(\delta(y|x) || \sum_z q_e(z|x) q_r(y=x|z))] \\ &= \sum_x p_{data}(x) \log { \sum_z q_e(z|x) q_r(y=x|z) } \\ &= E_{p_{data}(x)}[\log { \sum_z q_e(z|x) q_r(y=x|z) }] \\ &\geq E_{p_{data}(x)}[ \max_z \log q_e(z|x) q_r(y=x|z) ] \\ &= E_{p_{data}(x)}[ \max_z \log q(y=x,z|x) ] \\ &= E_{p_{data}(x)}[ \max_z \log q(z|x) ] \\ &= E_{p_{data}(x)}[ \max_z \log { q(x|z) q(z) \over q(x) } ] \\ L(r)&= E_{p_{data}(x)}[ \max_z \log { q_r(x|z) q_g(z) } ] + c \\ \end{align}\end{split}

## 损失函数原始推导

\begin{split} \begin{align} z &= e(x) \\ \hat x &= g(z) \\ \hat x &=g(e(x)) \\ L(m) &= E_{p(x)}[ -d(\hat x(m),x) ] \end{align} \end{split}

\begin{split} \begin{align} L(m) &= E_{p(x)}[ -d(\hat x(m),x) ] \\ \text{or} \ L(m) &= \min_X [ -d(\hat x(m),x) ]\\ \text{or} \ L(m) &= -\log \sum_x \exp \left( \beta \cdot d\left(\hat x(m),x\right) \right) \\ \text{or} \ L(m) &= -\log E_{p(x)} [\exp \left( \beta \cdot d\left(\hat x(m),x\right) \right)] \end{align} \end{split}

\begin{split} \left \{ \begin{aligned}\sum f_k(z) &= 1 \\ -d(\hat x(m),x) &= \log p(x|m,z) \\ p(x) &= 1/|X|={1\over N_x} \end{aligned} \right. \end{split}
\begin{split} \begin{align} L(m) &= E[ -d(\hat x(m),x) ]\\ &= E [ \log p(x|m,z) ] \\ &= E[\log \sum_k f_k(z) p(x|k,m)] \\ &= E[l(x,m,z)] \end{align} \end{split}

\begin{split} \begin{align} L(m) &\leq E[\max_z l(x,m,z)] \\ &\leq \max_m E[\max_z l(x,m,z)]\\ &=\max L \end{align} \end{split}

\begin{split} \begin{align} \hat z &= e(x)\\ &=\text{argmax}_z l(x,m,z)\\ &= \text{argmax}_z [\log \sum_k f_k(z) p(x|k,m)] \\ &= \text{argmax}_z [\log p(x|m,z)] \end{align} \end{split}

### 特例: 基于高斯核连续编码空间混合模型的GAE (GGAE)

CATSMILE-1013中,我们构建了一个正则化 的编码空间,用来对可选组分进行混合得出生成分布.

### 特例: 基于线性连续编码空间混合模型的GAE (LGAE)

$\log p(x|m,z)\propto - ||x - \mu -W_m z || ^2$

$z = \text{argmax}_z - ||x - \mu - W_m z || ^2$

${\partial \over \partial z} L = 2 W_m^T (x-\mu - W_m z)$

$\begin{split} {\partial \over \partial z} L = 2( W_m^T (x-\mu ) - z) \\ z + {1\over 2}{\partial \over \partial z} L = W_m^T (x-\mu ) \end{split}$

$\begin{split} {\partial \over \partial z} L =\vec{0} = 2( W_m^T (x-\mu ) - z) \\ z = W_m^T (x-\mu ) \end{split}$

### 特例: 局部线性的生成式自编码模型 LLGAE

\begin{split} \begin{align} f_k(x) &\propto \exp -(x-\mu_k)^2 \\ \log p(x|m,z) &\propto -||x - \mu_k(x) - W(x)_k z || ^2 \\ &= -||x - \sum_k \text{softargmax}(-(x-\mu_k)^2) (\mu_k + W_k z)||^2 \end{align} \end{split}

\begin{split} \begin{align} L(x,m,k,z) &= \log p(x|m,k, z) \\ &\propto - ||x - \mu_k(x) - W(x)_k z || ^2 \\ &\leq \max_{\mu_k,W_k}\max_k \max_z - || x - \mu_k - W_k z|| \end{align} \end{split}

### 特例: 深度图像先验 Deep Image Prior

https://arxiv.org/pdf/1711.10925.pdf

\begin{split}\begin{align} L(f) &= E_{p_{data}(x)} [ \max_\theta -||f_{\theta}(z) - x || ^2 ]\\ \log q_r(x | \theta ) &= -||f_{\theta}(z) - x || ^2 \end{align}\end{split}

### 特例: 最大激活输入 Maximal Activated Input

\begin{split}\begin{align} \max_x f_i(x) &= \max_x \log q(z_i=1|x) \\ \end{align}\end{split}

### 扩展材料:神经网络自编码器

NN-VAE celebA https://github.com/AntixK/PyTorch-VAE

Comparative thesis on AE https://github.com/mainak-ghosh/AutoEncoder

Empirical comparison between autoencoders and traditional dimensionality reduction methods https://arxiv.org/pdf/2103.04874.pdf

AE发展脉络

Quote from 2013 paper on DAE. 很奇怪的是,似乎没有人提到Kmeans是一种 自编码算法

Auto-encoders learn an encoder function from input to representation and a decoder function back from representation to input space, such that the reconstruction (composition of encoder and de- coder) is good for training examples. Regularized auto-encoders also involve some form of regu- larization that prevents the auto-encoder from simply learning the identity function, so that recon- struction error will be low at training examples (and hopefully at test examples) but high in general. Different variants of auto-encoders and sparse coding have been, along with RBMs, among the most successful building blocks in recent research in deep learning (Bengio et al., 2013b). Whereas the usefulness of auto-encoder variants as feature learners for supervised learning can directly be assessed by performing supervised learning experiments with unsupervised pre-training, what has remained until recently rather unclear is the interpretation of these algorithms in the context of pure unsupervised learning, as devices to capture the salient structure of the input data distribution. Whereas the answer is clear for RBMs, it is less obvious for regularized auto-encoders. Do they completely characterize the input distribution or only some aspect of it? For example, clustering algorithms such as k-means only capture the modes of the distribution, while manifold learning algorithms characterize the low-dimensional regions where the density concentrates.

Some of the first ideas about the probabilistic interpretation of auto-encoders were proposed by Ran- zato et al. (2008): they were viewed as approximating an energy function through the reconstruction error, i.e., being trained to have low reconstruction error at the training examples and high recon- struction error elsewhere (through the regularizer, e.g., sparsity or otherwise, which prevents the auto-encoder from learning the identity function). An important breakthrough then came, yielding a first formal probabilistic interpretation of regularized auto-encoders as models of the input dis- tribution, with the work of Vincent (2011). This work showed that some denoising auto-encoders (DAEs) correspond to a Gaussian RBM and that minimizing the denoising reconstruction error (as a squared error) estimates the energy function through a regularized form of score matching, with the regularization disappearing as the amount of corruption noise goes to 0, and then converging to the same solution as score matching (Hyv ̈arinen, 2005). This connection and its generalization to other 1 arXiv:1305.6663v4 [cs.LG] 11 Nov 2013 energy functions, giving rise to the general denoising score matching training criterion, is discussed in several other papers (Kingma and LeCun, 2010; Swersky et al., 2011; Alain and Bengio, 2013).

Another breakthrough has been the development of an empirically successful sampling algorithm for contractive auto-encoders (Rifai et al., 2012), which basically involves composing encoding, de- coding, and noise addition steps. This algorithm is motivated by the observation that the Jacobian matrix (of derivatives) of the encoding function provides an estimator of a local Gaussian approxi- mation of the density, i.e., the leading singular vectors of that matrix span the tangent plane of the manifold near which the data density concentrates. However, a formal justification for this algorithm remains an open problem.

The last step in this development (Alain and Bengio, 2013) generalized the result from Vincent (2011) by showing that when a DAE (or a contractive auto-encoder with the contraction on the whole encode/decode reconstruction function) is trained with small Gaussian corruption and squared error loss, it estimates the score (derivative of the log-density) of the underlying data-generating distri- bution, which is proportional to the difference between reconstruction and input. This result does not depend on the parametrization of the auto-encoder, but suffers from the following limitations: it applies to one kind of corruption (Gaussian), only to continuous-valued inputs, only for one kind of loss (squared error), and it becomes valid only in the limit of small noise (even though in practice, best results are obtained with large noise levels, comparable to the range of the input).

What we propose here is a different probabilistic interpretation of DAEs, which is valid for any data type, any corruption process (so long as it has broad enough support), and any reconstruction loss (so long as we can view it as a log-likelihood). The basic idea is that if we corrupt observed random variable X into ̃X using conditional distribution C( ̃X|X), we are really training the DAE to estimate the reverse conditional P (X| ̃X). Combining this estimator with the known C( ̃X|X), we show that we can recover a consistent estimator of P (X) through a Markov chain that alternates between sampling from P (X| ̃X) and sampling from C( ̃X|X), i.e., encode/decode, sample from the reconstruction distribution model P (X| ̃X), apply the stochastic corruption procedure C( ̃X|X), and iterate. This theoretical result is validated through experiments on artificial data in a non-parametric setting and experiments on real data in a parametric setting (with neural net DAEs). We find that we can improve the sampling behavior by using the model itself to define the corruption process, yielding a training procedure that has some surface similarity to the contrastive divergence algorithm (Hinton, 1999; Hinton et al., 2006)