MCB128: AI in Molecular Biology (Spring 2026)

(Under construction)


The Book of (Deep Learning) Jargon:

google machine-learning glossary

Inputs

Tokenization

Categorical variable

A variable that can take a fixed number of values. For example, DNA/RNA nucleotides can be represented as a categorical variable with four possible values A, C, G, T/U. Amino acids can be represented as a categorical variable with 21 possible values.

Embedding (or vector embedding)

An array of numbers (a vector) that represent an input. For instance, the categorical variable “RNA nucleotide” could be represented by four vectors of arbitrary dimension representing A, C, G, and U respectively.

One-hot embedding

A vector embedding representing a categorical variable such that each vector has one 1 value, and all the others are zero.

For instance, the one-hot embedding for the categorical variable “RNA nucleotide” can be given as,

A = [1,0,0,0]

C = [0,1,0,0]

G = [0,0,1,0]

U = [0,0,0,1]

Dataset

The collection of all data used to train the model. The training data usually includes the inputs for a number of examples. For some subset of the examples, we may also have lables used in [supervised training[(supervised-learning).

A dataset is usually divided into: training set, validation set and test set.

Training set

The collection of examples from the dataset that are used to train the model.

Validation set

Test set

Feature

Equivalent to inputs

Sparse Feature

A feature/input vector where most of the values are zero. For instance, the one-hot enconding is an example of a sparse feature.

Labels

First introduced in b0_lecture

Labels are associated to the examples in the training sets, and usually correspond to the categories that the neural network is trying to predict. Labels are used in all cases in which a given neural network is trained by supervised learning

Labels are different from inputs. For instance, in the case of separating apples and oranges, each of the examples is given a label \(t\), as to whether it is an apple (\(t=1\)) or an orange (\(t=0\)). For this example the inputs are the color of the fruit and the roughness of its skin.

For the Ribosomal Entry Sites perceptron, the inputs/features are the nucleotides of a RNA sequence, and each example RNA sequence is given a label \(+\) for being a ribosomal binding site (RBS) or \(-\) otherwise.

Labeled data

Labeled data is data for which in addition to consisting of the inputs (or features) for a number of examples, it also provides for each example label values of what the neural network is trying in infer. These labels become the ground truth used in supervised training to guide the optimization of the weights.

Unlabeled data

Unlabeled data is data that does not include any labels. Like in our apple/oranges data from b0, Figure 9, it would mean getting for all fruit exaples all the input values for “color” and “skin roughness” without telling us whether the fruit is an apple or an orange.

Unlabeled data can be used as test

Data leakage

Data augmentation

Outputs

Outputs are the end representation of a neural network. Outputs are usually provided as a probability distribution over a discrete variable.

Outputs are a non-linear operation (the activation function](#activation-function) on a linear combination of the inputs and the weights

Logits

softmax

Tensors

Broadcasting

Flattening

Given a tensor of arbitrary dimensions, the data is re-distributed as a one-dimensional vector. Numpy has np.flatten() to flatten any arbitrary tensor to one dimension.

We introduced flattening in the b0_lecture, when we flatten the one-hot representation of an RNA sequence of length 6, to a vector of length 4x6 = 24.

sequence = “GCAGUA”

one_hot = [[0,0,1,0],[0,1,0,0],[1,0,0,0],[0,0,1,0],[0,0,0,1],[1,0,0,0]]

one_hot_flat = np.flatten(one_hot) = [0,0,1,0,0,1,0,0,1,0,0,0,0,0,1,0,0,0,0,1,1,0,0,0]

one_hot.shape = (5,4)

one_hot_flat.shape = (24)

one_hot_dim = len(one_hot.shape) = 2

one_hot_flat_dim = len(one_hot_flat) = 1

Tensor vs Vector

Tensor dimensions/ Tensor shape

Einsum notation

Gradient

First introduced in b0_lecture

A gradient is the vector with the derivatives of a given function respect to its variables.

For example for a loss function \(L\) that depend on weights \(w_1,\ldots,w_n\), its gradient is given by

\[g = (\frac{\delta L}{\delta w_1},\ldots , \frac{\delta L}{\delta w_n})\]

Models

Inputs/Input features

Inputs are the discrete values that represent the queries that a model gets to produce an output. The inputs are received by the input layer of the model

For instance, in the case of separating apples and oranges, there are two inputs (or features): the color of the fruit and the roughness of its skin. For the Ribosomal Entry Sites perceptron, the inputs/features are the nucleotides of a RNA sequence.

Outputs

outputs

Parameter/weight

The (float) values that constitute the elements of a model. Parameters/weights are the values trained from data.

The bias term

A special parameter (or weight) \(w_0\) what is not associated to any inputs, and represents the activation in the absence of inputs.

Hyperparameter

Activation

The activation \(a\) is the weighted sum of inputs and weights plus the bias, \(a = x\cdot w + w_0\).

Activation function

The activation function \(f()\) is a function of the activation \(a\), which is a linear combination of inputs and weights \(a= x\cdot w\). The activation function introduces a non-linearity in the output \(y=f(a)\).

There are many different activation functions. The plots of the activation functions are never straight lines.

We introduced the linear logistic activation function in our b0_lecture.

Linear logistic function

\(f(a) = \frac{1}{1+e^{-a}}\)

The sigmoid (tanh) function

\(f(a) = tanh(a)\)

The step function

\(f(a) = \left\{ \begin{matrix} 1& a > 0\\ 0& a \leq 0 \end{matrix} \right.\)

RELU

The RELU function
\(f(a) = \left\{ \begin{matrix} a& a > 0\\ 0& a \leq 0 \end{matrix} \right.\)

Activity

The activity \(y=f(a)\) is the result of applying the activation function \(f()\) to the activation \(a\) (the weighted sum of inputs and weights).

Input layer

Ouput layer

Hidden layer

Normalization layer

Fully connected layer/dense layer

Embedding layer

Capacity

Resnet

Depth

The depth of a neural network is defined as the sum of the number of its layers, that includes the hidden layers and output layers, but not the input layers.

Pooling/sub-sampling

Model Training / Learning

Loss/Error function

First introduced in b0_lecture

A loss function quantifies the difference between an output predicted by the model and the ground truth value for each example in a training set. Losses are used in supervised learning where training set data includes labels for the training examples providing the values the model tries to predict.

Cross-Entropy loss

First introduced in b0_lecture for a binary classification example.

For a given example, cross-entropy loss describes the predictions \(y^{(n)}=(y^{(n)}_1,\ldots y^{(n)}_I)\) and the labels \(t^{(n)}=(t^{(n)}_1,\ldots t^{(n)}_I)\) as two probability distribution, and it calculates the cross entropy between the two, averaged to all \(N\) examples as

\[L = - \frac{1}{N} \sum_{n=1}^N \sum_{i=1}^I t_i^{(n)} \log y^{(n)}_i.\]

The cross entropy loss is always positive, and it becomes zero only when the two probability distributions are identical.

Softmax

Perplexity

epoch

masking

Gradient Descent

The gradient descent algorithm takes the gradient of an loss (error) function \(L\) with respect to the weights of a layer \(w\), \(g = \frac{\delta L}{\delta w}\) and descends the parameters against that gradient

\[w = w - \eta\, g\]

\(\eta\) is called the learning rate

Batch GD

First introduced in b0_lecture

Here the gradient is calculated summing the contributions of all the examples in the training set.

Stochastic GD (SDG)

The gradient is calculate considering a random subset (a batch) of the examples in the training set.

on-line GD

First introduced in b0_lecture

Here parameter updates is calculated after calculating the gradients with respect to one random example in the training set.

on-line GD is a type of SDG.

Backpropagation

First introduced in b0_lecture

The backpropagation algorithm calculates the derivatives of a loss (or error) function \({L}\) with respect to the weights at a given layer \(w_i\).

\[\frac{\delta {L}}{\delta W_i}.\]

The loss depends explicitly only on the weights of the last layer, \(W^N_i\). For inner layers \(n < N\), these derivatives are calculated using the chain rule

\[\frac{\delta {L}}{\delta w^n_i} = \sum_{j,k,l,..}\frac{\delta {L}}{\delta w^N_k} \frac{\delta w^N_k}{\delta w^{N-1}_j}\ldots \frac{\delta w^{n+1}_l}{\delta w^{n}_i}\]

Vanishing gradient

Masking

Batch

Adam optimization

Forward pass

Backward pass

Pre-training vs Fine-tuning

Fundational model

Regularization

First introduced in b0_lecture

Regularization is a collection of technics used in the process of training a neural network in order to avoid overfitting to perform so well in the training data, to the detriment of its performance on other examples not seen at training (generalization)

There are different regularization techniques. The overall idea is to modify the objecive function (the error or loss) so that it incorporates a bias against the solutions that favor the training set.

Early stopping

Dropout

L_1 regularization

L_2 regularization

Ablation

overfitting/memorization

Generalization

Interpretability

Learning

First introduced in b0_lecture

Learning is equivalent to adjusting the parameters (weights) of the network such that the outputs \(y(n)\) of the network are optimized for all \(n\) training examples.

Learning requires the existence of a learning rule that specifies the way in which the neural network updates the weights in training.

Learning rule

The learning rule specifies the objective by which the parameters of the network will be updated in the training process. The learning rule depend on the inputs and the model parameteres. In supervised learning, it also depends on the labels provided with the training data.

For instance, in our b0_lecture apples and oranges example, the learning rule is the cross entropy loss.

Learning rate

First introduced in b0_lecture

The learning rate refers to the proportionality value \(\eta\) that measures how fast weights are changed against the gradient of the loss \(g = \frac{\delta L}{\delta w}\).

\[w = w - \eta\, g\]

If the learning rate \(\eta\) is very small, it may take too long to find the optimal parameters. If the learning rate is too big, then the system may bounce back an forth never reaching the optimal region for the parameters.

The learning rate is an example of the hyperparameter

Supervised Learning

First introduced in b0_lecture

In supervised learning, the objective is to adjusting the parameters (weights) of the network such that the output \(y(n)\) of the network is close to the label \(t(n)\) for all \(n\) examples.

Unsupervised Learning

Semi-supervised Learning

Adaptive Learning

Adversarial Learning

Zero-shot learning

Denoising

Distillation

Cross validation

Model Architecture

Autoregressive model

Perceptron

The perceptron is the simplest neural network consisting only of one neuron. We introduced the perceptron in the b0_lecture

Here are two equivalent representations of a perceptron with 9 inputs \(\{x_i\}_{i=1}^9\), 9 weights plus \(\{W_i\}_{i=1}^9\), one bias \(W_0\), and using sigmoidal activation.

MLP

CNN

RNN

Attention

Transformer

Language model

Encoder

Decoder

Autoencoder

VAE

GAN

Generative model vs Discriminative model