# Q3 – Reparameterization Trick of Variational Autoencoder

Contributed by Chin-Wei Huang.

Consider a generative model that factorizes as follows $p(x,z) = p(x|z)p(z)$, where $p(x|z)$ is mapped through a neural net, i.e. $p(x|z) = p(x;h_\theta(z))$, $\theta$ being the set of parameters for the generative network (i.e. decoder), a simple distribution parameterized by $h(\cdot)$ such as Gaussian or Bernoulli (i.e. $p(x|z) = \prod_j p(x_j|z)$). In the case of Gaussian, $h_\theta(z)$ refers to the mean and variance, per dimension as it is fully factorized in the common setting. We have $z\in\mathbf{R}^K$, which implies a continuous latent space model, and $p(z)=\mathcal{N}(0,I_K)$. The framework of auto-encoding variational Bayes considers maximizing the variational lower bound on the log-likelihood $\mathcal{L}(\theta,\phi)\leq \log p(x)$, which is expressed as

$\mathcal{L}(\theta,\phi) = \mathbf{E}_{q_\phi}[\log p(x|z)] - \mathbf{KL}(q_\phi(z|x)||p_\theta(z))$,

where $\phi$ is the set of parameters used for the inference network (i.e. encoder). The reparameterization trick used in the original work rewrites the random variable in the variational distribution as

$z = \mu(x) + \sigma(x)\odot\epsilon$              (1)

where $\epsilon\sim\mathcal{N}(\epsilon;0,I)$, so that gradient can be backpropagated through the stochastic bottleneck.

1. Prove that the samples drawn from the linearly transformation of Gaussian noise (1) has the same mean and variance as $\mathcal{N}(z;\mu(x),\sigma(x))$. What if we write $z=\mu(x)+S(x)\epsilon$, where $S(x)\in\mathbf{R}^{K\times K}$ could be a reshaped $K^2$ dimensional output of a neural net? Comment on the new distribution this reparameterization induces.
2. If the full covariance variational distribution, i.e. with $z=\mu(x)+S(x)\epsilon$, is used, derive the second term of the lower bound $\mathbf{KL}(q_\phi(z|x)||p_\theta(z))$.
3. If the traditional mean field variational method is used, i.e. if we factorize the variational distribution as a product of distributions: $q^{mf}(z_i) = \prod_j \mathcal{N}(z_{i,j}|m_{i,j},\sigma^2_{i,j})$ and we maximize the lower bound with respect to the variational parameters and model parameters iteratively, can the inference network used in the variational autoencoder $q_\phi$ (1) outperform the mean field method? What is the advantage of using an encoder as in VAE?