#! https://zhuanlan.zhihu.com/p/528519879

1014: 生成式自编码器 (GAE) Generative Autoencoder

CATSMILE-1014

前言

  • 目标: 重新梳理生成式自编码器的相关框架,用一个统一的优化视角看待 MLR,PCA,NMF到tSNE,LLE,NN-VAE,GAN,等等的一系列统计学习框架. 在概率建模和最优化之间的边界进行思考

  • 背景与动机: 写1013的时候由于隐变量框架不清晰,给自己绕晕了. 得出大致感觉后,把主要框架提炼出来,重新整理一下

  • 结论:

  • 备注: 历史悠久,历久弥新. 隐隐感觉这会是CATSMILE的一个里程碑

  • 更新时间:

    • 20220629:加入边际分布变分视角

    • 20220619:加入变分视角

    • 20220614

  • 相关模型:

    • PCA

    • LLGAE

    • Deep image Prior

  • 关键词:

    • 自编码器AutoEncoder

    • 降维Dimensionality Reduction,

    • 无监督学习Unsupervised Learning,

    • 压缩感知Compressed Sensing

  • 展望方向:

    • Transformer Attention Variational Autoencoders

    • Prior works quoted by VAE

自编码器

自编码器是一个可以方便地用来定量研究数据压缩的框架.和压缩感知是很相关 的思考角度,考虑流形上的所有样本 \(x\in \{X\}\), 自编码器需要对任意样本 进行(编码->解码)两步操作,来保持恢复流形结构所需要的信息.这种信息的保持 一般用一个度量函数来衡量. 一般地,流形可以用一个向量空间上的测度来表示 \(p(x)\),

变分视角负损失函数: 从自映射联合分布到边际分布

如果我们放弃一个显式的编码器,断开x->z之间的连接 \(q_e(z|x) = q_e(z) = q_w(z)\)

\[\begin{split}\begin{align} L(e,r) &= -D_{KL}( p_{data}(x) \delta(y|x) || p_{data}(x) \sum_z q_e(z|x) q_r(y|z) ) \\ &= \sum_{x=y} p_{data}(x) \delta(y|x) (\log { p_{data}(x) \sum_zq_e(z|x) q_r(y|z) \over p_{data}(x) \delta(y|x)} ) \\ &= c + \sum_{x=y} p_{data}(x) (\log { \sum_z q_e(z|x) q_r(x|z) } ) \\ &= c + \sum_{x=y} p_{data}(x) (\log { \sum_z q_e(z) q_r(x|z) } ) \\ &= c - D_{KL} (p_{data}(x) || \sum_z q_e(z) q_r(x|z) ) \end{align}\end{split}\]

我们可以发现,GAE可以避免引入显式的编码器,也就不需要用联合分布的KL逼近边际分布的KL. 其实GAE的思想非常简单, 就是给定一个生成模型后,用点估计来近似后验分布,从而估算边际分布. 考虑如下对KL损失的分解,使用均匀先验 \(q_w(z)=c\)

\[\begin{split}\begin{align} L(w,r) &= -D_{KL}(p_{data}(x) || \sum_z q_w(z) q_r(x|z)) \\ &= c + E_{p_{data}(x)}[ \log \sum_z q_w(z) q_r(x|z) ] \\ &= c + E_{p_{data}(x)}[ \log \sum_z \exp \log q_w(z) q_r(x|z) ] \\ &\geq c + E_{p_{data}(x)}[ \max_z \log q_w(z) q_r(x|z) ] \\ &\geq c + E_{p_{data}(x)}[ l_m(z_0,t) ] \\ z_t &= f_m(z_0,t) \\ l_m(z_0,t) &= \log q_w(z_t) q_r(x|z_t) \\ L_{GAE}(m) &= E_{p_{data}(x)}[ l_m(z_0,t) ] \leq E_{p_{data}(x)}[ \max_z q_w(z) q_r(x|z)] \end{align}\end{split}\]

实际上求解 \(\max_z\) 的过程,也可以使用梯度上升来确保连续变量可导性, 也就是把t步上升后的结果写成 \(z_t = f_m(z_0,t)\) ,对应位置的损失函数写成 \(l_m(z_0,t) = \log q_w(z_t) q_r(x|z_t)\). 这样做的好处是可以使用优化技巧来定义后验分布的点估计 \(z_t\). 实际上,我们也可以考虑用多个样本的 logsumexp 来近似原本的 logsumexp 积分, \(max_z\) 其实是 n = 1 的特殊情况.

于是我们就有了一个可导的目标函数 \(L_{GAE}\) ,注意我们通过定义 \(q_w,q_r\) 之后就可以诱导出一个后验分布的点估计近似 \(\delta(z=f_m(z_0,t))\), 并计算了该点的损失函数 \(l_m(z_0,t)\) ,然后用这个函数来为参数 m 提供梯度. 实际上, 带有梯度下降的 \(l_m\) 会包含 m 的一阶梯度,因此实际优化中会对 m 有一些二阶梯度的表达式.

变分视角负损失函数: 自映射

通常的自编码问题可以表示为对数据流形附近的条件概率分布的近似,生成式自编码器使用了 MAP下界对这个损失函数进行近似. 可以看到,忽略隐变量先验的情况下,我们可以用一个解码器 \(q_r(x|z)\) 唯一定义一个GAE

\[\begin{split}\begin{align} L(e,r) &= -D_{KL}( p_{data}(x) \delta(y|x) || p_{data}(x) \sum_z q_e(z|x) q_r(y|z) ) \\ &= \sum_{x=y} p_{data}(x) \delta(y|x) (\log { p_{data}(x) \sum_zq_e(z|x) q_r(y|z) \over p_{data}(x) \delta(y|x)} ) \\ &= \sum_{x=y} p_{data}(x) \delta(y|x) (\log { \sum_z q_e(z|x) q_r(y|z) \over \delta(y|x)} ) \\ &= E_{p_{data}(x)}[-D_{KL}(\delta(y|x) || \sum_z q_e(z|x) q_r(y=x|z))] \\ &= \sum_x p_{data}(x) \log { \sum_z q_e(z|x) q_r(y=x|z) } \\ &= E_{p_{data}(x)}[\log { \sum_z q_e(z|x) q_r(y=x|z) }] \\ &\geq E_{p_{data}(x)}[ \max_z \log q_e(z|x) q_r(y=x|z) ] \\ &= E_{p_{data}(x)}[ \max_z \log q(y=x,z|x) ] \\ &= E_{p_{data}(x)}[ \max_z \log q(z|x) ] \\ &= E_{p_{data}(x)}[ \max_z \log { q(x|z) q(z) \over q(x) } ] \\ L(r)&= E_{p_{data}(x)}[ \max_z \log { q_r(x|z) q_g(z) } ] + c \\ \end{align}\end{split}\]

这里一般对 \(z\) 取均匀先验 \(q_g(z)=c\) , 其中 \(\max_z\) 操作对应对每个数据进行编码. 这里的推断过程可以隐式地认为使用了狄拉克分布 \(\delta_w(z)\) 去近似了解码器 在贝叶斯意义下所对应的自然后验分布 \(q_r(x|z)q(z)/q(x)\)

损失函数原始推导

下文中用用 \(m\) 表示所有模型参数

\[\begin{split} \begin{align} z &= e(x) \\ \hat x &= g(z) \\ \hat x &=g(e(x)) \\ L(m) &= E_{p(x)}[ -d(\hat x(m),x) ] \end{align} \end{split}\]

注意我们可以对自编码负损失做均值要求,或者是下限要求. 鉴于均值要求对应着 期望函数,求导时可以更高效地利用所有样本,本文优先探讨这种设定. 从长远来看,对自编码负损失的下限做出限制,具有更强的工程学意义,但同时 也更加困难.注意到,下限可以平滑处理,于是就可以恢复出一个期望形式

\[\begin{split} \begin{align} L(m) &= E_{p(x)}[ -d(\hat x(m),x) ] \\ \text{or} \ L(m) &= \min_X [ -d(\hat x(m),x) ]\\ \text{or} \ L(m) &= -\log \sum_x \exp \left( \beta \cdot d\left(\hat x(m),x\right) \right) \\ \text{or} \ L(m) &= -\log E_{p(x)} [\exp \left( \beta \cdot d\left(\hat x(m),x\right) \right)] \end{align} \end{split}\]

下文中,我们先讨论最简单的情况. \(L(m) =E_{p(x)}[ -d(\hat x(m),x) ]\) 特别地,我们要求生成器在概率意义下重建 \(\hat x\),尽管这会引入不确定性, 但是允许简单地联系到混合模型, 且后续可以通过用EOL替换LEOP来对LOE下界做出约束. 借助变分思想,考虑引入一个混合函数,来构建混合模型

\[\begin{split} \left \{ \begin{aligned}\sum f_k(z) &= 1 \\ -d(\hat x(m),x) &= \log p(x|m,z) \\ p(x) &= 1/|X|={1\over N_x} \end{aligned} \right. \end{split}\]
\[\begin{split} \begin{align} L(m) &= E[ -d(\hat x(m),x) ]\\ &= E [ \log p(x|m,z) ] \\ &= E[\log \sum_k f_k(z) p(x|k,m)] \\ &= E[l(x,m,z)] \end{align} \end{split}\]

我们发现,阶段化地最大化这个负损失,对应着建模的不同阶段

\[\begin{split} \begin{align} L(m) &\leq E[\max_z l(x,m,z)] \\ &\leq \max_m E[\max_z l(x,m,z)]\\ &=\max L \end{align} \end{split}\]

第一个不等号,在最优编码时取到,第二个不等号,在最优模型时取到. 因此最优编码就是

\[\begin{split} \begin{align} \hat z &= e(x)\\ &=\text{argmax}_z l(x,m,z)\\ &= \text{argmax}_z [\log \sum_k f_k(z) p(x|k,m)] \\ &= \text{argmax}_z [\log p(x|m,z)] \end{align} \end{split}\]

所以我们可以认为,生成式自编码模型,其实意味着用一个可变参数 模型的对数似然作为损失度量,从而引导出一个由极大似然定义的自编码器. 这样做的好处是,编码器和解码器,都可以通过同一个分布 \(p(x|m,z)\) 来加以定义

特例: 基于离散编码空间和混合模型的GAE

如果我们使用混合模型分解解码器,并且不对 \(f_k(z)\) 加以限制,那么容易 得到最优编码 \(f_k(\hat z)=\delta_k( \text{argmax}_k p(x|k,m))\). KMEANS就可以认为是一种这样的自编码器.

特别的,如果组分数量等于或大于数据点的数量,那么总能够确保 \(L(X)=0\) 很难说这究竟算不算过拟合,趋近于KNN算法,模型逐渐收敛数据本身, 不再有额外的参数.但是这类模型的问题是要在大量数据里高效寻找KNN.

特例: 基于高斯核连续编码空间混合模型的GAE (GGAE)

CATSMILE-1013中,我们构建了一个正则化 的编码空间,用来对可选组分进行混合得出生成分布.

特例: 基于线性连续编码空间混合模型的GAE (LGAE)

如果让解码分布成为一个活动的固定噪声的高斯分布,并加入一个升维的参数 \(W_m\), 我们得到一个类似线性回归的损失函数形式.

\[ \log p(x|m,z)\propto - ||x - \mu -W_m z || ^2 \]

编码器

\[ z = \text{argmax}_z - ||x - \mu - W_m z || ^2 \]

在这里, \(W_m\) 定义了一个低维的线性子空间,用来对原始流形进行近似. 注意到,如果我们显式地对z进行梯度下降,你会得到这么一个式子, 可以看成一个带有自反馈的残差神经网络.

\[ {\partial \over \partial z} L = 2 W_m^T (x-\mu - W_m z) \]

如果类似PCA去进一步要求单位正交性 \(W_m^T W_m = I\) 梯度为

\[\begin{split} {\partial \over \partial z} L = 2( W_m^T (x-\mu ) - z) \\ z + {1\over 2}{\partial \over \partial z} L = W_m^T (x-\mu ) \end{split}\]

额这里怎么把z给消除了是啥意思?…也就是无论从哪里出发,都能一步到位到这个 特殊位置,这个位置也是梯度为零向量的位置

\[\begin{split} {\partial \over \partial z} L =\vec{0} = 2( W_m^T (x-\mu ) - z) \\ z = W_m^T (x-\mu ) \end{split}\]

注意这个式子恰好就是PCA的编码过程,也就是说原始的 损失函数在正交单位矩阵的约束下是可以用PCA/SVD直接解出的. 实际编程中,难点在于确保单位正交性 \(W_m^T W_m = I\)

特例: 局部线性的生成式自编码模型 LLGAE

注意到全局线性模型的一个问题, 就是如果实际的流形仅仅是局部线性,那全局线性 假设就是有问题的. 因此 \(W\) 需要也成为隐变量 \(z\) 的函数. 一个简单的办法 是建立一个插值映射 \(W(x) = f_k(x) W_k\).这个映射定义在\(X\)上比在\(z\)上要更加方便,因为\(z\)是线性空间内部的坐标,原则上不可能反过来定义线性空间本身.考虑高斯核形式的 \(f_k(x)\) ,并加入偏移量 \(\mu_k\) , 得到

\[\begin{split} \begin{align} f_k(x) &\propto \exp -(x-\mu_k)^2 \\ \log p(x|m,z) &\propto -||x - \mu_k(x) - W(x)_k z || ^2 \\ &= -||x - \sum_k \text{softargmax}(-(x-\mu_k)^2) (\mu_k + W_k z)||^2 \end{align} \end{split}\]

注意这里使用了平滑处理,实际上把一个离散隐变量给消去了.如果联立 两个隐藏变量 \(k,z\) 就会出现一个层级结构

\[\begin{split} \begin{align} L(x,m,k,z) &= \log p(x|m,k, z) \\ &\propto - ||x - \mu_k(x) - W(x)_k z || ^2 \\ &\leq \max_{\mu_k,W_k}\max_k \max_z - || x - \mu_k - W_k z|| \end{align} \end{split}\]

对于z上的优化,可以使用梯度下降.对于k上的优化,直接取最优值即可. 为了公平和其他降维算法比较,也可以考虑用高斯核嵌入空间参数化 \(f_k(w)\).

注意到LLGAE需要对每个可选组分同时进行优化,复杂度是 \(O(NKE)\),对每个样本, 需要做K*E次z上的梯度下降.这或许可以用一些采样或者嵌入算法进行加速,来减小\(K\)

特例: 深度图像先验 Deep Image Prior

https://arxiv.org/pdf/1711.10925.pdf

注意到深度图像先验在操作中,和GAE具有类似的损失形式,因此我们可以对应地写出其条件生成分布,注意, 这里的推断的隐变量是网络权重 \(\theta\) , 而z一般是固定的先验高斯噪音或者某种向量,不需要被推断, \(f_\theta\) 中的 \(f\) 代表网络除去权重的所有定义,如连接性,激活函数. 一般来讲,DIP有大量的隐变量 需要通过梯度推断,但是由于隐藏变量之间的互相作用是局域性的,所以仍然会有一些好的性质.

\[\begin{split}\begin{align} L(f) &= E_{p_{data}(x)} [ \max_\theta -||f_{\theta}(z) - x || ^2 ]\\ \log q_r(x | \theta ) &= -||f_{\theta}(z) - x || ^2 \end{align}\end{split}\]

特例: 最大激活输入 Maximal Activated Input

一种对于神经网络隐藏单元的测试就是观查如何输入图片来最大化某个神经元 \(i\) 的激活量,这相当于 GAE的逆形式. 让我们假设激活量对应于某个条件概率的对数,那么我们就是在推断符合这个概念 \(z_i\) 的输入. 不过这更多地像一个采样过程,而非推断过程. 也就是说,如果优化发生在数据向量,就是一种采样,如果优化 发生在隐变量层面,就是一种推断

\[\begin{split}\begin{align} \max_x f_i(x) &= \max_x \log q(z_i=1|x) \\ \end{align}\end{split}\]

扩展材料:神经网络自编码器

NN-VAE celebA https://github.com/AntixK/PyTorch-VAE

Comparative thesis on AE https://github.com/mainak-ghosh/AutoEncoder

Empirical comparison between autoencoders and traditional dimensionality reduction methods https://arxiv.org/pdf/2103.04874.pdf

SE post https://stats.stackexchange.com/questions/531706/autoencoders-as-dimensionality-reduction-tools

VAE https://arxiv.org/abs/1312.6114

Zhihu post on VI https://www.zhihu.com/question/31032863/answer/315311293

AE发展脉络

Quote from 2013 paper on DAE. 很奇怪的是,似乎没有人提到Kmeans是一种 自编码算法

Auto-encoders learn an encoder function from input to representation and a decoder function back from representation to input space, such that the reconstruction (composition of encoder and de- coder) is good for training examples. Regularized auto-encoders also involve some form of regu- larization that prevents the auto-encoder from simply learning the identity function, so that recon- struction error will be low at training examples (and hopefully at test examples) but high in general. Different variants of auto-encoders and sparse coding have been, along with RBMs, among the most successful building blocks in recent research in deep learning (Bengio et al., 2013b). Whereas the usefulness of auto-encoder variants as feature learners for supervised learning can directly be assessed by performing supervised learning experiments with unsupervised pre-training, what has remained until recently rather unclear is the interpretation of these algorithms in the context of pure unsupervised learning, as devices to capture the salient structure of the input data distribution. Whereas the answer is clear for RBMs, it is less obvious for regularized auto-encoders. Do they completely characterize the input distribution or only some aspect of it? For example, clustering algorithms such as k-means only capture the modes of the distribution, while manifold learning algorithms characterize the low-dimensional regions where the density concentrates.

Some of the first ideas about the probabilistic interpretation of auto-encoders were proposed by Ran- zato et al. (2008): they were viewed as approximating an energy function through the reconstruction error, i.e., being trained to have low reconstruction error at the training examples and high recon- struction error elsewhere (through the regularizer, e.g., sparsity or otherwise, which prevents the auto-encoder from learning the identity function). An important breakthrough then came, yielding a first formal probabilistic interpretation of regularized auto-encoders as models of the input dis- tribution, with the work of Vincent (2011). This work showed that some denoising auto-encoders (DAEs) correspond to a Gaussian RBM and that minimizing the denoising reconstruction error (as a squared error) estimates the energy function through a regularized form of score matching, with the regularization disappearing as the amount of corruption noise goes to 0, and then converging to the same solution as score matching (Hyv ̈arinen, 2005). This connection and its generalization to other 1 arXiv:1305.6663v4 [cs.LG] 11 Nov 2013 energy functions, giving rise to the general denoising score matching training criterion, is discussed in several other papers (Kingma and LeCun, 2010; Swersky et al., 2011; Alain and Bengio, 2013).

Another breakthrough has been the development of an empirically successful sampling algorithm for contractive auto-encoders (Rifai et al., 2012), which basically involves composing encoding, de- coding, and noise addition steps. This algorithm is motivated by the observation that the Jacobian matrix (of derivatives) of the encoding function provides an estimator of a local Gaussian approxi- mation of the density, i.e., the leading singular vectors of that matrix span the tangent plane of the manifold near which the data density concentrates. However, a formal justification for this algorithm remains an open problem.

The last step in this development (Alain and Bengio, 2013) generalized the result from Vincent (2011) by showing that when a DAE (or a contractive auto-encoder with the contraction on the whole encode/decode reconstruction function) is trained with small Gaussian corruption and squared error loss, it estimates the score (derivative of the log-density) of the underlying data-generating distri- bution, which is proportional to the difference between reconstruction and input. This result does not depend on the parametrization of the auto-encoder, but suffers from the following limitations: it applies to one kind of corruption (Gaussian), only to continuous-valued inputs, only for one kind of loss (squared error), and it becomes valid only in the limit of small noise (even though in practice, best results are obtained with large noise levels, comparable to the range of the input).

What we propose here is a different probabilistic interpretation of DAEs, which is valid for any data type, any corruption process (so long as it has broad enough support), and any reconstruction loss (so long as we can view it as a log-likelihood). The basic idea is that if we corrupt observed random variable X into ̃X using conditional distribution C( ̃X|X), we are really training the DAE to estimate the reverse conditional P (X| ̃X). Combining this estimator with the known C( ̃X|X), we show that we can recover a consistent estimator of P (X) through a Markov chain that alternates between sampling from P (X| ̃X) and sampling from C( ̃X|X), i.e., encode/decode, sample from the reconstruction distribution model P (X| ̃X), apply the stochastic corruption procedure C( ̃X|X), and iterate. This theoretical result is validated through experiments on artificial data in a non-parametric setting and experiments on real data in a parametric setting (with neural net DAEs). We find that we can improve the sampling behavior by using the model itself to define the corruption process, yielding a training procedure that has some surface similarity to the contrastive divergence algorithm (Hinton, 1999; Hinton et al., 2006)