MCB128: AI in Molecular Biology (Spring 2026)

(Under construction)


block 2:

Convolutional Networks, Residual Networks & Residual Networks / DNA/RNA binding motifs

In this block we introduce convolutional neural networks CNNs, recurrent networks RNs and recurrent neural networks (RNNs).

Convolutional Networks (CNNs)

For this section, we follow chapter 10 in Understanding Deep Learning. The molecular biology question associated to CNNs is that of RNA and DNA biding sites in proteins. We will follow the implementation of the method DeepBind “Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning”. We will also study how CNNs learn the sequence motifs motifs with this work: “Representation learning of genomic sequence motifs with convolutional neural networks”.

Multilayer perceptrons process all the input data together as a whole. Convolutional networks allow to process different parts of the input independently from each other using parameters shared by the whole input. CNNs have been extensively used to extract information from 2D images, such as where is the horse in the image? In molecular biology, CNNs have been used to recognize small motifs sequences found in genomes, such are transcription factor binding sites.

Convolutional Networks in 1-D

Convolutional networks are based on convolutional operations.

The convolution operation


Figure 1. 1D convolution with kernel of size 3.

For a vector \(x[D]\), and a set of weights of dimension K=3, \(w[3] = (w_1,w_2,w_3)\), usually referred to as the kernel, a convolution \(z\) (Figure 1) is defined as a weighted sum of the nearest three inputs,

\[\begin{aligned} z_i &= x_{i-1} * w_1 + x_{i} * w_2 +x_{i+1} * w_3\\ &= \sum_{j=1}^3 x_{i+j-2} w_j. \end{aligned}\]

Each element in the output vector \(y[D]\), is a linear combination of the nearest three inputs with the same set of weights

\[z_2 = x_{1} * w_1 + x_{2} * w_2 +x_{3} * w_3,\\ z_3 = x_{2} * w_1 + x_{3} * w_2 +x_{4} * w_3,\\ z_4 = x_{3} * w_1 + x_{4} * w_2 +x_{5} * w_3.\\\]
Padding

Figure 2. Edge cases in 1D convolution with kernel of size 3. Solutions: zero-padding (left), the dimensions of the inputs and outputs remain the same. No padding (right), the dimension of the output is smaller than the input by K-1 = 2.

For the edge cases, the convolution kernel extends beyond the inputs. Two typical solutions to deal with edge cases are given in Figure 2.

Kernel size, stride, dilation
Equivariant by translation

A convolution is equivariant to a translation, which means that the input changes the same way as the output when subject to a translation.

\[z(x+t) = z(x) + t.\]

This property is important. Consider the case of detecting a subsequence motif in a genome. Translational equivariance implies that if the motif is shifted in a different genome, the cnn will detect the motif in its new location, and the output classification will also change accordingly.

Convolutional layer

A convolutional layer, takes inputs \(x[L]\), and calculates its outputs \(y[L]\) by convolving the inputs with a convolutional kernel \(W\) of size \(K=\) adding a bias \(\beta\), and passing it through an activation function \(f\),

\[y_l = f\left( \beta + \sum_{k=1}^K x_{l+k-(K-1)} w_k\right),\]

which for the particular case \(K = 3\), results in (Figure 1)

\[y_l = f\left( \beta + x_{l-1} w_1 + x_{l} w_2 + x_{l+1} w_3\right).\]

Pooling

Most convolutional networks include one additional layer after the convolutional layer (with activation) another layer in order to downsample the inputs. Subsampling is convenient because it increases the receptive field of the convolution.


Figure 4.

There are several forms of downsampling, described in Figure 4 with two examples for a 4x4 input, one downsizing (2,2) to at 2x2 input, the other downsizing (2,1) to a 2x4 input.

Channels


Figure 5.

Typically, multiple convolutions are applied to the input \(x\) and stored in channels.

When two convolutions \(W^1[3]\), \(W^2[3]\) are applied to the same input, that results in two channel outputs

\[y^1_i = f\left(\beta^1 + \sum_{k=1}^3 x_{i+k-2} W^1_k\right)\\ y^2_i = f\left(\beta^2 + \sum_{k=1}^3 x_{i+k-2} W^2_k\right)\\\]

which generalizes to \(C_o\) channel outputs as

\[y^{c_o}_i = f\left(\beta^{c_o} + \sum_{k=1}^3 x_{i+k-2} W^{c_o}_k\right)\quad for\, 1\leq c_o \leq C_o\]

Moreover, the inputs can also have many layers \(C_I\), then the hidden units in each output channel are computed as a weighted sum over all \(C_i\) channels and K kernel entries using a weight matrix \(W[K,C_I]\).

\[y^{c_o}_i = f\left(\beta^{c_o c_i} + \sum_{c_i=1}^{C_I} \sum_{k=1}^3 x_{i+k-2} W^{c_o,c_i}_k\right)\quad for\, 1\leq c_o \leq C_o\]

In general, the input and hidden layers all will have multiple channels (Figure 5). If there are \(C_I\) input channels and \(C_O\) output channels, then we need \(W[K,C_I,C_O]\) weights, and \(\beta[C_O C_I]\) biases.

For instance, in the case of biological sequences like DNA or RNA, it is typical for the input to have four channels, one for each of the nucleotides. And if the model predicts sequence motifs, as it is the case for the DeepBind model we describe later, each of the output channels correspond to each of the motifs.

The receptive field


Figure 6.

CNN for 2D inputs


Figure 7.

Figure 8.

Going to extremes with a CNN


Figure 9.

Figure 10.

DeepBind, DNA/RNA binding motifs

This is the code that implements the basics of the DeepBind model.

How and where are the sequence motifs learned?

Residual Networks (RNs)

For this section, we follow chapter 11 in Understanding Deep Learning.

Recurrent Neural Networks (RNNs)