MCB128: AI in Molecular Biology (Spring 2026)
(Under construction)
- Inputs
- Outputs
- Tensors
- Models
- Model Training / Learning
- Learning
- Model Architecture
The Book of (Deep Learning) Jargon:
google machine-learning glossary
Inputs
Tokenization
Categorical variable
A variable that can take a fixed number of values. For example, DNA/RNA nucleotides can be represented as a categorical variable with four possible values A, C, G, T/U. Amino acids can be represented as a categorical variable with 21 possible values.
Embedding (or vector embedding)
An array of numbers (a vector) that represent an input. For instance, the categorical variable “RNA nucleotide” could be represented by four vectors of arbitrary dimension representing A, C, G, and U respectively.
One-hot embedding
A vector embedding representing a categorical variable such that each vector has one 1 value, and all the others are zero.
For instance, the one-hot embedding for the categorical variable “RNA nucleotide” can be given as,
A = [1,0,0,0]
C = [0,1,0,0]
G = [0,0,1,0]
U = [0,0,0,1]
Dataset
The collection of all data used to train the model. The training data usually includes the inputs for a number of examples. For some subset of the examples, we may also have lables used in [supervised training[(supervised-learning).
A dataset is usually divided into: training set, validation set and test set.
Training set
The collection of examples from the dataset that are used to train the model.
Validation set
Test set
Feature
Equivalent to inputs
Sparse Feature
A feature/input vector where most of the values are zero. For instance, the one-hot enconding is an example of a sparse feature.
Labels
First introduced in b0_lecture
Labels are associated to the examples in the training sets, and usually correspond to the categories that the neural network is trying to predict. Labels are used in all cases in which a given neural network is trained by supervised learning
Labels are different from inputs. For instance, in the case of separating apples and oranges, each of the examples is given a label \(t\), as to whether it is an apple (\(t=1\)) or an orange (\(t=0\)). For this example the inputs are the color of the fruit and the roughness of its skin.
For the Ribosomal Entry Sites perceptron, the inputs/features are the nucleotides of a RNA sequence, and each example RNA sequence is given a label \(+\) for being a ribosomal binding site (RBS) or \(-\) otherwise.
Labeled data
Labeled data is data for which in addition to consisting of the inputs (or features) for a number of examples, it also provides for each example label values of what the neural network is trying in infer. These labels become the ground truth used in supervised training to guide the optimization of the weights.
Unlabeled data
Unlabeled data is data that does not include any labels. Like in our apple/oranges data from b0, Figure 9, it would mean getting for all fruit exaples all the input values for “color” and “skin roughness” without telling us whether the fruit is an apple or an orange.
Unlabeled data can be used as test
Data leakage
Data augmentation
Outputs
Outputs are the end representation of a neural network. Outputs are usually provided as a probability distribution over a discrete variable.
Outputs are a non-linear operation (the activation function](#activation-function) on a linear combination of the inputs and the weights
Logits
softmax
Tensors
Broadcasting
Flattening
Given a tensor of arbitrary dimensions, the data is re-distributed as a one-dimensional vector. Numpy has np.flatten() to flatten any arbitrary tensor to one dimension.
We introduced flattening in the b0_lecture, when we flatten the one-hot representation of an RNA sequence of length 6, to a vector of length 4x6 = 24.
sequence = “GCAGUA”
one_hot = [[0,0,1,0],[0,1,0,0],[1,0,0,0],[0,0,1,0],[0,0,0,1],[1,0,0,0]]
one_hot_flat = np.flatten(one_hot) = [0,0,1,0,0,1,0,0,1,0,0,0,0,0,1,0,0,0,0,1,1,0,0,0]
one_hot.shape = (5,4)
one_hot_flat.shape = (24)
one_hot_dim = len(one_hot.shape) = 2
one_hot_flat_dim = len(one_hot_flat) = 1
Tensor vs Vector
Tensor dimensions/ Tensor shape
Einsum notation
Gradient
First introduced in b0_lecture
A gradient is the vector with the derivatives of a given function respect to its variables.
For example for a loss function \(L\) that depend on weights \(w_1,\ldots,w_n\), its gradient is given by
\[g = (\frac{\delta L}{\delta w_1},\ldots , \frac{\delta L}{\delta w_n})\]Models
Inputs/Input features
Inputs are the discrete values that represent the queries that a model gets to produce an output. The inputs are received by the input layer of the model
For instance, in the case of separating apples and oranges, there are two inputs (or features): the color of the fruit and the roughness of its skin. For the Ribosomal Entry Sites perceptron, the inputs/features are the nucleotides of a RNA sequence.
Outputs
Parameter/weight
The (float) values that constitute the elements of a model. Parameters/weights are the values trained from data.
The bias term
A special parameter (or weight) \(w_0\) what is not associated to any inputs, and represents the activation in the absence of inputs.
Hyperparameter
Activation
The activation \(a\) is the weighted sum of inputs and weights plus the bias, \(a = x\cdot w + w_0\).
Activation function
The activation function \(f()\) is a function of the activation \(a\), which is a linear combination of inputs and weights \(a= x\cdot w\). The activation function introduces a non-linearity in the output \(y=f(a)\).
There are many different activation functions. The plots of the activation functions are never straight lines.
We introduced the linear logistic activation function in our b0_lecture.
Linear logistic function
\(f(a) = \frac{1}{1+e^{-a}}\)
The sigmoid (tanh) function
\(f(a) = tanh(a)\)
The step function
\(f(a) = \left\{
\begin{matrix}
1& a > 0\\
0& a \leq 0
\end{matrix}
\right.\)
RELU
The RELU function
\(f(a) = \left\{
\begin{matrix}
a& a > 0\\
0& a \leq 0
\end{matrix}
\right.\)
Activity
The activity \(y=f(a)\) is the result of applying the activation function \(f()\) to the activation \(a\) (the weighted sum of inputs and weights).
Input layer
Ouput layer
Hidden layer
Normalization layer
Fully connected layer/dense layer
Embedding layer
Capacity
Resnet
Depth
The depth of a neural network is defined as the sum of the number of its layers, that includes the hidden layers and output layers, but not the input layers.
Pooling/sub-sampling
Model Training / Learning
Loss/Error function
First introduced in b0_lecture
A loss function quantifies the difference between an output predicted by the model and the ground truth value for each example in a training set. Losses are used in supervised learning where training set data includes labels for the training examples providing the values the model tries to predict.
Cross-Entropy loss
First introduced in b0_lecture for a binary classification example.
For a given example, cross-entropy loss describes the predictions \(y^{(n)}=(y^{(n)}_1,\ldots y^{(n)}_I)\) and the labels \(t^{(n)}=(t^{(n)}_1,\ldots t^{(n)}_I)\) as two probability distribution, and it calculates the cross entropy between the two, averaged to all \(N\) examples as
\[L = - \frac{1}{N} \sum_{n=1}^N \sum_{i=1}^I t_i^{(n)} \log y^{(n)}_i.\]The cross entropy loss is always positive, and it becomes zero only when the two probability distributions are identical.
Softmax
Perplexity
epoch
masking
Gradient Descent
The gradient descent algorithm takes the gradient of an loss (error) function \(L\) with respect to the weights of a layer \(w\), \(g = \frac{\delta L}{\delta w}\) and descends the parameters against that gradient
\[w = w - \eta\, g\]\(\eta\) is called the learning rate
Batch GD
First introduced in b0_lecture
Here the gradient is calculated summing the contributions of all the examples in the training set.
Stochastic GD (SDG)
The gradient is calculate considering a random subset (a batch) of the examples in the training set.
on-line GD
First introduced in b0_lecture
Here parameter updates is calculated after calculating the gradients with respect to one random example in the training set.
on-line GD is a type of SDG.
Backpropagation
First introduced in b0_lecture
The backpropagation algorithm calculates the derivatives of a loss (or error) function \({L}\) with respect to the weights at a given layer \(w_i\).
\[\frac{\delta {L}}{\delta W_i}.\]The loss depends explicitly only on the weights of the last layer, \(W^N_i\). For inner layers \(n < N\), these derivatives are calculated using the chain rule
\[\frac{\delta {L}}{\delta w^n_i} = \sum_{j,k,l,..}\frac{\delta {L}}{\delta w^N_k} \frac{\delta w^N_k}{\delta w^{N-1}_j}\ldots \frac{\delta w^{n+1}_l}{\delta w^{n}_i}\]Vanishing gradient
Masking
Batch
Adam optimization
Forward pass
Backward pass
Pre-training vs Fine-tuning
Fundational model
Regularization
First introduced in b0_lecture
Regularization is a collection of technics used in the process of training a neural network in order to avoid overfitting to perform so well in the training data, to the detriment of its performance on other examples not seen at training (generalization)
There are different regularization techniques. The overall idea is to modify the objecive function (the error or loss) so that it incorporates a bias against the solutions that favor the training set.
Early stopping
Dropout
L_1 regularization
L_2 regularization
Ablation
overfitting/memorization
Generalization
Interpretability
Learning
First introduced in b0_lecture
Learning is equivalent to adjusting the parameters (weights) of the network such that the outputs \(y(n)\) of the network are optimized for all \(n\) training examples.
Learning requires the existence of a learning rule that specifies the way in which the neural network updates the weights in training.
Learning rule
The learning rule specifies the objective by which the parameters of the network will be updated in the training process. The learning rule depend on the inputs and the model parameteres. In supervised learning, it also depends on the labels provided with the training data.
For instance, in our b0_lecture apples and oranges example, the learning rule is the cross entropy loss.
Learning rate
First introduced in b0_lecture
The learning rate refers to the proportionality value \(\eta\) that measures how fast weights are changed against the gradient of the loss \(g = \frac{\delta L}{\delta w}\).
\[w = w - \eta\, g\]If the learning rate \(\eta\) is very small, it may take too long to find the optimal parameters. If the learning rate is too big, then the system may bounce back an forth never reaching the optimal region for the parameters.
The learning rate is an example of the hyperparameter
Supervised Learning
First introduced in b0_lecture
In supervised learning, the objective is to adjusting the parameters (weights) of the network such that the output \(y(n)\) of the network is close to the label \(t(n)\) for all \(n\) examples.
Unsupervised Learning
Semi-supervised Learning
Adaptive Learning
Adversarial Learning
Zero-shot learning
Denoising
Distillation
Cross validation
Model Architecture
Autoregressive model
Perceptron
The perceptron is the simplest neural network consisting only of one neuron. We introduced the perceptron in the b0_lecture
Here are two equivalent representations of a perceptron with 9 inputs \(\{x_i\}_{i=1}^9\), 9 weights plus \(\{W_i\}_{i=1}^9\), one bias \(W_0\), and using sigmoidal activation.