MCB128: AI in Molecular Biology (Spring 2026)

(Under construction)


block 5: Autoencoders, Variational Autoencoders/Gene expression, scRNA-seq

Introduction

Autoencoders are a form of neural networks that are mostly used to extract a compressed representation of the input data. Autoencoders are trained unsupervised, that is, at training the only goal is to reproduce at the end the same values that were put in. What can we learn by this simple exercise?

The input data to an autoencoder usually has high dimensionality, and the network introduces a number of layers that progressively reduce the dimensionality (the encoder) until it reaches a lowest dimension, called the bottleneck. Once the bottleneck has been reached, the same or similar layers are applied backwards (the decoder) in order to reverse the processes. The decoder layers keep increasing the dimensionality until the input dimension is restored.


Figure 1. Schematic description of an autoencoder

In an autoencoder, the bottleneck layer can then be interpret as a low-dimensional representation (for feature extraction) of the multidimensional data.

An autoencoder is unsupervised. The leaning task is simply to achieve \(x^\prime \approx x\).

Learning as communication


Figure 2. From Shannon's 1949 paper "A mathematical theory of communication", 1948.

Autoencoders (and learning in general) have an important connection with the field of communications. As stated in Shannon’s famous 1948 paper, any communication system depends on a source, a transmitter and a receiver (Figure 2). Shannon’s description of a communication channel in Figure 2 looks a lot like an autoencoder as described in Figure 1.

In communications, we would like to investigate what is the minimal size of the channel (the latent variable z for us) that allows a faithful transmission of the information.


Figure 3. From MacKay's book, chapter 40 "Learning as communication".

Similarly, as described in MacKay’s book, chapter 40, and I am paraphrasing, autoencoders involve the production of a set of latent values (z) by a transmitter (or encoder) in response to an input data point (x). The decoder (or receiver) uses the latent values (z) to reconstruct the input data. Thus, the latent variable plays the role of a communication channel transporting the information contained the input data, so that it can be reconstructed faithfully and economically.

Variational Autoencoders (VAE)

In this block we are also going to discuss Variational autoencoders (VAE). VAE are a particular type of autoencoder in which we do not just try to identify a latent value z for each x, but a probability distribution \(p_\theta(z\mid x)\), that depends on some parameters here represented by \(\theta\).

A VAE decoder does not just produce one output \(x^\prime\) per input but a probability distribution \(p_\theta(x)\), the marginal probability of the data given the generative model, that we hope matches the data probability distributions \(p_{data}(x)\). And the objective is to approximate \(p_\theta(x) = \int_z p_\theta(x\mid z)\, p_\theta(z)\) to \(p_{data}(x)\)


Figure 4. Schematic description of a variational autoencoder (VAE).

VAE bring us into the realm of generative models. The rest of the models we will discuss in our last block 6 will be all generative. There is a major division in the world of modeling and machine learning between discriminative and generative models. Discriminative models are defined by a conditional probability \(p(z\mid x)\) that expresses the latent variable only as a function of a given input \(x\). A generative model is defined by a joint distribution \(p_\theta(x,z)\) of all the variables (observed data x, and latent variables z). See Figure 4. A generative model is usually parameterized by two probability distributions \(p_\theta(x\mid z) p_\theta(z)\). A discriminative (inference) model can only be used to predict labels (z) for any given data point (x). A generative model, can be turn into a discriminative model, and in addition can perform unique tasks such as

Our practical examples

As our practical case, we are going to use autoencoders to process gene expression data from many different cells. The goal is that the autoencoder will provide a lower-dimension representation of the gene expression patterns that could allow us to identify for instance cell states (cancer or otherwise) dependent on the expression (or lack there off) of distinct groups of genes.

As our practical case for a variational autoencoder (VAE) we will discuss the method single-cell variational inference (scVI) which is used for the analysis of gene expression in single cells for datasets including hundreds of thousands of cells.

documentation

Some documentation: this is a good detailed autoencoder’s reference. This is the foundational paper that introduced autoencoders, “Reducing the Dimensionality of Data with Neural Networks”, by Hinton and Salakhutdinov, 2006. \textcolor{blue}{This paper describes AE as a non-linear generalization of PCA, why?} This is an excellent primer on VAEs

Autoencoder definition


Figure 5. An autoencoder architecture. This is an example of an undercomplete symmetric encoder/decoder architecture. All the hidden layers include a non linear activation, and optionally dropout as well.

The encoder

The encoder is a sequence of fully-connected neural network layers. The encoder layers progressively reduce the dimension of the features.

For instance, in the AE code described in Figure 5, the dimensions go: from D=2000 to d1 = 512, d2 = 256, and d = 32. The last layer of the encoder outputs the latent variable (z) also referred to as “the code” or the hidden variable.

The bottleneck

The bottleneck is the core of an AE. The bottleneck is the output of the last encoder layer \(z[d]\). This latent variable is expected to learn a compact representation of the data, and hopefully provide insights about the data that cannot be seen directly from the data due to its large dimensionality.

The decoder

The decoder has an architecture that usually is the reverse of the encoder, although that is not necessary. The decoder just needs to produce an final output in the dimensions of the input data.

The activation of the last layer of the decoder is important because the output has to match the input.

in the code described in Figure 6, the input data is continuous and unbounded (generated as samples of a normal distribution of zero mean and variance one), thus the decoder does not use any non-linear activation for its last layer.

The AE loss

The training objective of an AE is to recover the input values. The type of data determines the loss used. For instance, for continuous input data and no activation, the loss is often the Mean Squared Error (MSE),

\[MSE(x,x^\prime) = \frac{1}{D} \sum_{i=1}^D (x_i - x_i^\prime)^2.\]

If the data and reconstructions are categorical \(0\leq x_i,x^\prime_i \leq 1\) and \(\sum_{i=1}^D x_i = 1\), \(\sum_{i=1}^D x^\prime_i = 1\), one can use the cross-entropy loss

\[CrossEntropy(x,x^\prime) = - \frac{1}{D} \sum_{i=1}^D x_i \log{x^\prime_i}.\]

The AE learning

In an autoencoder, the encoder and the decoder are trained together using the standard backpropagation method that we have introduced and described before. Both the encoder and decoder weights are adjusted together until reconstruction is achieved and \(x^\prime \approx x\).

As for the usefulness of the latent variable or code z, if the bottleneck dimension is insufficient for the capacity of the data, optimization will not be possible.

Undercomplete and overcomplete autoencoders

The autoencoders we have discussed so far are undercomplete, which means that the dimension of the code (or latent variable) d is smaller than that of the input data D (\(d<D\)).

There are also overcomplete autoencoder for which the dimension on the latent variable is larger than the dimension of the input data, \(d > D\). Overcomplete autoencoders seems counterintuitive, what can the AE learn by passing the input to the output without need to make any compression? It could just pass it “as is”. This is often referred to as the identity problem.

Looks like overcomplete autoencoders while not useful for extracting common features in the data, they are useful for denoising. They can learn to reconstruct an original from a noisy version of it.

Autoencoder application: Gene expression

As an example of an AE, we are going to discuss the GENEEX autoencoder introduced by the National Cancer Institute to model cancer biology data.

GENEEX is an autoencoder that takes as inputs the expression profiles of all the genes of individual cells and collapse them into latent variables of much lower dimensionality. GENEEX used the genomic expression data (RNA-seq) from cancer samples trained on 3,000 RNA-Seq gene expression profiles containing 60,484 genes from the Genomic Data Commons.

Inputs

The inputs for a gene expression autoencoder are counts of the number of RNA-seq reads associated to a given gene (g) in a given cell (c). All RNA-seq methods have their way of estimating these counts per gene (or per gene transcript), counts[c,g].

Raw counts cannot be used to compare between different samples due to

To avoid these confounding factor, often counts are given normalized, that is, scaled so that each cell has the same number of total counts,

\[norm\_counts[c,g] = \frac{counts[c,g]}{\sum_{g^\prime}counts[c,g^\prime]}\]

and then log-transformed,

\[log1p(norm_counts[c,g]) = log(1+ norm\_counts[c,g]) \geq 0.\]

Outputs

The outputs of this autoencoder are just real values resulting for the last layer of the decoder without ReLU.

GENEEX autoencoder architecture

The GENEEX autoencoder architecture includes two hidden layers of dimensions \(h_1=2,048\), \(h_2=1,024\) and latent_dim\(=128\). The encoder and decoder have a symmetric configuration. All layer except the last one of the decoder use ReLU and dropout.


Figure 6. GENEEX AE results for synthetic data including three different gene profiles. Data include 6,000 cells each with expression data for 1,000 genes. (a) the dimension of the latent variable is 128. (b) latent variable dimension is 2.

Results

This code includes a toy example using log-transformed gene expression values for 6,000 cells each including expression data for 1,000 genes. The data has been set up such that for each third of the cells (2,000 cells), a third of the genes have high expression (up-regulated) and all the others just some background expression.

The autoencoder is easily self-trained to reconstruct the expression inputs (no labels of to which of the three classes each cell belongs to) using the MSE loss.

For an AE with latent dimension 128, we can extract the latent variables \(z[c,128]\) and using PCA, we can plot the first two PCA dimensions (Figure 6a) which clearly separate the three kind of cells. Even more simply, using a latent dimension of 2, we can directly plot the two hidden components (Figure 6b) which also allow us to infer the three different types of expression profiles present in the data.

Variational Autoencoders (VAE)

Here is the VAE code describing the methods for this section.

Variational autoencoders (VAE) are a particular flavor of autoencoders that use probabilistic modeling. Figure 7 provides a first summary comparing the two. An AE goal is to reconstruct all data points in the input set by means of a latent variable associate to each data point. The VAE objective is to reconstruct the data probability distribution by means of a parametric generative model. The VAE generative model introduces a probability distribution that describes both the input and the latent variables.

Thus, while the optimization objective of an AE is for each data point x that the outcome of the AE \(x^\prime\) to satisfy

AE: \(x\approx x^\prime (z)\)

by means of the bottleneck latent variable z.

The optimization objective for a VAE is

VAE: \(P_{data}(x) \approx P_\theta(x)\)

Where the marginal probability \(P_\theta(x)\) is obtained by means of the latent variable and the generative model \(P_\theta(x,z)\) as

\[P_\theta(x) = \int_z P_\theta(x,z)\, dz = \int_z P_\theta(x\mid z)\, P_\theta(z)\, dz.\]

Unfortunately, it is normally intractable to obtain and to optimize this marginal distribution. That is where the “variational” part of a VAE comes into play as we will see briefly.


Figure 7. Summary comparison between an AE and a VAE. The "variational" component of the VAE is explained in Figure 8.

The VAE decoder is a generative model

Thus, VAE are generative models (Figure 7). That means, that they introduce probability distributions to describe all variables involved, both observed (x) and latent (z). Those probability distribution are arbitrary for each problem and depend on specific parameters that we designate as \(\theta\).

The probability distribution that describes all variables is referred to as the joint probability distribution. \(P_\theta(x,z).\)

It is common for this joint probability distribution to be factorized using Bayes theorem as

\[P_\theta(x,z) = P_\theta(x\mid z)\, P_\theta(z),\]

where \(P_\theta(x\mid z)\) is the inference or discriminative distribution what gives us a probability distribution over x given a value of z, and \(P_\theta(z)\) is the prior distribution over the latent variable. See Figure 7.

Standard VAEs model these two distributions separately. The prior distribution informs about the latent variable in the absence of any knowledge about the data. It is common in VAE to model it by an Normal distribution

\[P_\theta(z) = N(0,1).\]

The VAE decoder probability \(P_\theta(x\mid z)\) can be any distribution, and is usually determined by the type of data. For instance, in Figure 9, I have assumed it is a categorical distribution,

\[P_\theta(x_i\mid z) = p_i\quad for 1\leq i\leq D, \quad with \sum_{i=1}^D p_i = 1,\]

thus, there is a softmax at the end of the last decoder layer. In the scVI model we are describing next, they use a negative binomial distribution.

The VAE encoder is an inference model (approximate inverse of the generative model)


Figure 8. General description of a VAE.

The Variational Inference approach

The main idea behind variational inference (VI) is this, when you would like to find an intractable distribution, such as the posterior of a generative model \(P_\theta(z\mid x)\), then


Figure 9. Description of a VAE where we assume that the generative distribution (decoder) is a categorical variable, and the variational discriminative distribution is an normal distribution where the parameters mu and log(sigma) are trained by the encoder.

Thus the VAE encoder is an inference (discriminative) model (\(q_\varphi(z\mid x)\)) that is the approximate inverse of the discriminative model (\(P_\theta(z\mid x)\)) derived from the generative model (\(P_\theta(x,z)\)), see Figure 8.

Where is the deep learning part of a VAE? VAE are Deep Learning Variable Models (DLVM)

Deep Learning Variable Models (DLVM) are generative models \(P_\theta(x,z)\) where the parameters of the model \(\theta\) are obtained by training a neural network.

A VAE uses neural networks architectures to train all the parameters involved: the encoder parameters \(\varphi\) as well as the decoder parameters \(\theta\).

Two optimization objectives. One method to achieve them both: The Evidence Lower Bound (ELBO)

The two optimizations objectives in a VAE are

A common way to measure how close two distributions are is to calculate the Kullback-Leibler (KL) divergence. The distribution that best fits the data is the one that minimizes the KL divergence

\[min_\theta D_{KL}(P_{data}||p_\theta(x)) = min_\theta \int_x P_{data}(x) \log\frac{P_{data}(x)}{P_{\theta}(x)} = max_\theta \int_{x\sim P_{data}} \log P_{\theta}(x)\]

Thus, minimizing the KL divergence between the data distribution and the model’s marginal is equivalent to maximizing the log-marginal probability over the data.

For any inference model \(q_\varphi(z\mod x)\), we can write

\[\begin{aligned} \log P_{\theta}(x) &= E_{q_\varphi(z\mid x)}\left[ \log P_{\theta}(x) \right]\\ &= E_{q_\varphi(z\mid x)}\left[ \log \frac{P_{\theta}(x,z)}{P_{\theta}(z\mid x)}\right]\\ &= E_{q_\varphi(z\mid x)}\left[ \log \frac{P_{\theta}(x,z)}{q_\varphi(z\mid x)}\frac{q_\varphi(z\mid x)}{P_{\theta}(z\mid x)}\right]\\ &= E_{q_\varphi(z\mid x)}\left[ \log \frac{P_{\theta}(x,z)}{q_\varphi(z\mid x)}\right] + E_{q_\varphi(z\mid x)}\left[ \log \frac{q_\varphi(z\mid x)}{P_{\theta}(z\mid x)}\right]\\ \end{aligned}\]

Because the second term is the KL divergence between the approx \(q_\varphi(z\mid x)\) and true \(P_{\theta}(z\mid x)\) inference distributions and it never negative,

\[E_{q_\varphi(z\mid x)}\left[ \log \frac{q_\varphi(z\mid x)}{P_{\theta}(z\mid x)}\right] = D_{KL}(q_\varphi(z\mid x)|| P_{\theta}(z\mid x)) \geq 0,\]

then

\[\log P_{\theta}(x) \geq ELBO_{\varphi\theta}(x) = E_{q_\varphi(z\mid x)}\left[ \log \frac{P_{\theta}(x,z)}{q_\varphi(z\mid x)}\right]\]

where the ELBO is the evidence lower bound

\[ELBO_{\varphi\theta}(x) = E_{q_\varphi(z\mid x)}\left[ \log P_{\theta}(x,z) - \log q_\varphi(z\mid x)\right].\]

Thus the \(ELBO_{\varphi\theta}(x)\) is a lower bound to the log-probability of the data \(\log P_\theta(x)\).

And maximizing the ELBO(x) wrt both sets of parameters will result in the two things we want to do

(1) will improve the generative model by increasing the marginal \(P_{\theta}(x)\)

(2) will improve the inference model by reducing the KL distance between the true posterior \(P_\theta(z\mid x)\) and the approximate posterior \(q_\varphi(z\mid x)\).


Figure 10.

The reparameterization trick

The VAE encoder (unlike a regular AE) does not produce values of the latent variable z directly, but it outputs the values of the parameters for the probability distribution \(q_\varphi(x\mid z)\).

It is typical to describe

\[q_\varphi(x\mid z) = N(z; \mu(x), \sigma^2(x)),\]

thus one can sample values of z using pytorch as

\[z = torch.normal(mean, std, size=(d1,d2))\]

this is an “stochastic node” not a deterministic value, and stochastic nodes cannot be differentiable, which is what we need to do stochastic gradient descend to update the parameters.

A way to avoid this problem for VAE is called the reparameterization trick.

Because z is a sample of the Normal distribution

\[z\sim N(\mu,\sigma)\quad then\quad \frac{z-\mu}{\sigma} \sim N(0.1)\]

Then instead of using a non-differentiable stochastic node to sample z, we use it to sample a

\[\epsilon = torch.normal(0, 1, size=(d1,d2))\\\]

and then \(z\) calculated as

\[z = \mu + \epsilon \, \sigma\]

is just a deterministic value given \(\epsilon\) and can be used in backpropagation.

VAE Pytorch code

Pytorch code implementing a VAE for gene expression profiles is here

This VAE assumes that

scVI: VAE for single-cell transcriptomics


Figure 11. scVI variational autoencoder. "Deep generative modeling for single-cell transcriptomics", Lopez et al., 2018.