MCB128: AI in Molecular Biology (Spring 2026)

(Under construction)


block 4: Large Language Models (LLMs)/ LLMs for genomics and proteins

Pytorch code to implement the main methods described in the block, that is transformer-based encoders, decoders and encoder-decoders, can be found here.

Language models

A language model were originally designed to process and understand human language, and as a result of that to be able to generate new speech as well.

The idea of a “language” has been extended also to understand biological sequences, and we also talk about the language of DNA, the language of RNA, or the language of genes, amongst others.

Noam Chomsky pioneered work on language models in the 1950s, and introduced a hierarchical formalism of formal grammars to describe different types of languages with increasing complexity. Chomsky introduced a hierarchy of formal grammars: regular grammars, context-free grammars (CFGs), context-sensitive grammars and unrestricted grammars. The assumptions of most biological sequence algorithms correspond to those of regular grammars. For instance, hidden Markov models (HMMs), which are probabilistic regular grammars and we will briefly discuss next, used extensively for protein and DNA homology are regular grammars. And SCFGs which are probabilistic (stochastic) context-free grammars are used to describe RNA structure.

LMs are probabilistic (generative) models

The term Language Models (LMs) is used to describe probabilistic models which are able to generate new examples (sentences) of the language they represent. There is an underlying \(P_{data}\) that the LM tries to approximate by a \(P_{\theta}\) probability distribution, depending on a set number of parameters \(\theta\)

The LM gives a parametric representation of the data, where the data could be an actual natural language, or images of a particular object (dogs), or in our case of interest biological sequences of different characteristics such as protein structural families or RNA structural families, amongst others.

Learning

Learning in nothing but finding the set of parameters \(\theta^\ast\) so that that \(P_{\theta^\ast}\) best approximates the data distribution \(P_{data}\).

Some of the most important aspects in learning the parameters are

Inference

A trained generative model can be used for many purposes, some of the most obvious and important are

Generative vs discriminative models

Discriminative models task is given a data point, to assign it a label. Generative models, on the other hand, describe a join probability distribution to describe both the data and their labels.

For instance a discriminative model for RNA structure will give you a structure “str” for an input sequence “seq”. A generative model of RNA structure (such as a SCFG) defines a distribution over sequences and structures. From an SCFG, you can sample RNA sequences with structures; and in general it allows to sample data examples with their labels.

It is fair to say that a discriminative model also describe a probability (generative), that is the probability of all the possible labels (structures) given one data point (sequence). However the concept of generative model is usually reserved to models that can sample the whole data (sequences with structures).

The importance of sampling from a probability distribution

You may have noticed that the terms “generative (probabilistic) model” and “sampling” seem to come hand on hand.

Once you have a probability distribution over a set of random variables \(P(X,Y,Z)\), that results in the opportunity of taking samples from that distribution. Sampling values according to a probability distribution implies that values with higher probability will be sampled more often that values with very low probability. Any value is a possible sample, but for a large enough sample, the histogram of sampled values will resemble the shape of the probability distribution from where they were sampled. See Figure 1 for an example of the effect of the sample size.


Figure 1. Samples 10, 100, 1000, 10000 from the exponential distribution with lambda = 1.0

Sampling is very often used to represent the whole distribution. Learning a joint probability distributions, specially when many variables are involved with high dimensionality, may be an almost impossible task, but if by some approximate method, we can take a representative sample of the distribution \(\{x_i y_k z_j\}_{i,j,k=1,1,1}^{N_x, N_y, N_Z}\),

then arbitrary quantities \(f\) over the probability distribution can be estimated using that sample as

\[<f(X Y X)> =: \int_{x,y,z} f(x y z)\, P(x y z) \approx \frac{1}{N_x N_y N_z}\, \sum_{i,j,k} f(x_i y_j z_k)\]

This kind of approximation is usually referred to as Monte Carlo sampling.

Grammar-based LMs

Before we get into Transformer large language models (LMMs), we are going to consider two cases of grammar-based language models what have been very successful in molecular biology.

These LMs by being based on a grammar, the grammar already proposes an underlying structure for the language they are trying to describe. In contrast, LLMs are much looser, and one would expect to discover an underlying structure when trained on data from a particular language.

Grammar-based LMs can be trained by ML as there is a direct association between data labels and parameters, on the other hand with the transformer LMs, you need to propose an optimization procedure

Example1: hidden Markov models (to generate protein sequences)

Hidden Markov models (HMMs) are probabilistic generative methods that generate a language that follows a particular type of grammar.

Hidden Markov models (HMMs) as even simpler than autoregressive models.

Example2: stochastic context-free grammars (to generate RNA sequences+structures)

Stochastic context-free grammars (SCFGs) are generative models that describe sentences in which there are palindromic relationships, in particular, they are used in molecular biology to describe RNA structure.

Large Language models (LLMs)

Useful LLMs are autoregressive

Predicting the join probability distribution of all the data \(\{x_i\}_{i=1}^{L}\) can be very complicated. However any joint probability distribution can be expressed with total generality as a product of conditional probabilities for any arbitrary ordering of the data \((x_1, x_2,\ldots x_L\}\)

\[\begin{aligned} P(x_1,\ldots x_L) &= P(x_1)\, P(x_2|x_1)\, P(x_3|x_1 x_2)\ldots P(x_L|x_1 x_2\ldots x_{L-1})\\ &=P(x_1)\prod_{i=2}^L P(x_i|x_1\ldots x_{i-1}) \end{aligned}\]

Autoregressive generative models make use of that general property to their advantage to simplify the model by specifying each of those conditional as parameterized functions with a fixed number of parameters.

For instance we could assume that each the conditionals describe Gaussian noise as

\[P(x_i|x_1\ldots x_{i-1}) = N(\mu = x_1a_1 + \ldots + x_{i-1}a_{i-1},\, \sigma^2=\epsilon^2).\]

Transformer-based Large Language models

Transformer based models come mostly in three types

Because of these functions as text generators (decoders), text classifiers (encoders) and text translators (encoder-decoders), and the large number of parameters they involve, these types of models are referred to as large language models or LLMs. These days, transformers are at the hearth of almost all LLMs.

We will also study several LLMs used in molecular biology, such as DNABERT, ESM-1b, ProGen2, and Evo.

decoder-only (autoregressive models)

A decoder model is autoregressive. That is, a decoder does not calculate the joint probability distribution \(P(x_1 \ldots x_L)\) directly as it’s goal is to calculate the probability of the next token \(x_i\) given the tokens before it,

\[p(x_i\mid x_1\ldots x_{i-1}).\]

In Figure 2A we describe a decoder. Pytorch code to implement a transformer-based decoder can be found here.

In a decoder, the goal at training is to find parameters that maximize the log probability of all the residues under the autoregressive parameterization for the sequences in the training set. This objective is just a reformulation of the cross-entropy loss,

\[\mbox{max} \prod_i P(x_i\mid x_{1:i-1}) = \mbox{max} \left(\sum_i \log P(x_i\mid x_{1:i-1})\right) = -\mbox{min} \left(\sum_i \log P(x_i\mid x_{1:i-1})\right) = \mbox{min}\, Loss(x_{1:L})\]

Figure 2. (A) A decoder network. Input is an embedding for a DNA sequence, output is a probability for each residue based on all residues before. (B) An encoder network. Input is an embedding for a DNA sequence with masked residues, output is a probability for the masked residues.

masked self-attention

The objective when training a decoder is to maximize the log probability of all tokens given only the tokens before it. In a transformer, the self-attention layers is the one responsible for the attention between tokens. In vanilla self-attention, all tokens interact with all other tokens. For the decoding task, we can achieve that by forcing that the attention to the right tokens is zero. This is referred to as masked self-attention.

Masked self-attention is very simply implemented adding a mask to the attention scores (before the softmaksing) such that

\[M(i,j) = \begin{cases} 0 & \mbox{if}\, i > j\\ -\infty & \mbox{if}\, i <= j \end{cases}\]

then

\[masked-Attn(i,j) = \mbox{softmax}_k\left(\mbox{score}(q_i\dot k_j) + M_{ij}\right) = \begin{cases} \mbox{softmax}_k\left(\mbox{score}(q_i\dot k_j)\right) & \mbox{if}\, i > j\\ 0 & \mbox{if}\, i <= j\\ \end{cases}\]

thus, masked attention is unchanged for \(i>j\) but there is no attention for tokens ahead \(i\leq j\).

The decoder autoregressive task guarantees that each residue contributes to the loss, but it has the limitation that it only considers the left context of each residue (or word). See figure 2A.

generating more tokens

Since the decoder provides as final output the probability of all tokens after a sentence has been seen, it allows to sample the next token. The sampled next token could be the one with the max probability or sample from that probability distribution. The new extended sequence could then be feed back to the decoder to generate another token. This is the fundamental mechanism of any chatbot.

decoder example: GPT3

GPT3 is a LLM that apply the decoder mechanism to large scale. Here are some numbers,

vocabulary V 50,257 tokens
embedding D 12,288
Transformer layers K 96
Transformer heads per layer h 96
query, key, value dimension per head D_h 128
FF hidden dimension d_ff 49,152
max length of input Lmax 2,048 tokens
training words 300 billion tokens

encoder-only

The goal of an encoder is to learn some general information about the statistics of the language it is describing. We are going to consider encoders based on a transformer architecture, described in Figure 2A.

Pytorch code to implement a transformer-based encoder can be found here.

There are two phases in an encoder:

encoder pre-training

In Figure 2A we describe a decoder. In a decoder, the goal at training is to find parameters that maximize the log probability of all the residues masked in the sequences in the training set. This objective is just a reformulation of the cross-entropy loss,

\[\mbox{max} \prod_{i\in mask} P(x_i\mid x_{\notin mask}) = \mbox{max} \left(\sum_{i\in mask} \log P(x_i\mid x_{\notin mask})\right) = -\mbox{min} \left(\sum_{i\in mask} \log P(x_i\mid x_{\notin mask})\right) = \mbox{min}\, Loss(x_{masked})\]

Because the encoder pre-training does not require label data, just to mask tokens (residues), it is usually done in very large sets of data.

The encoder considers both left and right context for each residue (word), but it has the limitation that it does not make very efficient use of the data as only masked residues contribute to the loss.

encoder fine tuning for specific tasks

A pre-trained encoder produces an output embedding in some arbitrary dimension, that hopefully has captured statistical information about the language being learned (English or DNA). A pre-trained encoder later can be further trained (fine-tuned) for particular supervised tasks using much smaller datasets of labeled data.

encoder example: BERT

BERT stands for Bidirectional Encoder Representations from Transformers. BERT is a encoder-only model based on transformer layers as described in Figure 2B. BERT is meant to be trained in natural languages, mostly English, but there are many follow ups trained for different languages (or multilingual).

“Bidirectional” indicates that it is using a full attention block which explores attention from both left and right context (unlike decoders which by using masked attention only explore the left context).

The main specifications of BERT are

vocabulary V 30,000 tokens
embedding D 1,024
Transformer layers K 24
Transformer heads per layer h 16
query, key, value dimension per head D_h 64
FF hidden dimension d_ff 4,096
max length of input Lmax 512 tokens
pre-training steps 1,000,000
pre-training epochs 50
pre-training words 3.3 billions

Some of the fine-tuning tasks explored by BERT are: sentence classification (positive, negative, informative…) or word classification (a place, a person, a verb…).

DNABERT is an almost direct reimplementation of BERT for DNA sequences that we discuss bellow.


Figure 3. Vaswani et al., 2017 encoder-decoder for sequence translation.

encoder-decoder models (translation)

The 2017 Vaswani et al. paper “Attention is all you need” introduces an encoder-decoder for sequence translation using transformers. In fact, this is the manuscript that introduced the concept of a “transformer block” based on the attention mechanism. See Figure 3.

Pytorch code to implement a transformer-based encoder-decoder can be found here.

The Vaswani encoder is the same as the transformer-encoder described in Figure 2B. The Vaswani decoder has all the elements of the transformer-encoder described in Figure 2A using masked self-attention. In addition, the Vaswani decoder includes one more multi-head transformer block where the keys and values come from the output of the encoder, and the queries from the decoder. This additional layer is referred to as cross-attention.

Considering the case of a English-Spanish encoder-decoder, and the case of translating from English to Spanish the sentence “I am the boss” (Figure 4),


Figure 4. (Top, training) The source sequence in English "I am the boss" and the target sequence in Spanish "yo soy la jefa" are used to train an encoder-decoder for sequence-to-sequence translation. (Bottom, inference) The sequence "I am the boss" is passed to the encoder-decoder to get a translation in Spanish. MHA = multi-head attention. FF=Feed Forward network. A&N = add and normalize.

LLM evaluation:

Decoders: perplexity

Is it expected that the smaller the loss, the better a model is. However, losses cannot be directly compared between different methods.

Perplexity offers a way to compare the losses of two different models, provided that they use the same vocabulary (or number of tokens).

Perplexity is defined as

\[PPL(x_{1:L}) = 2^{\frac{1}{L} Loss(x_{1:L})} = 2^{-\frac{1}{L}\sum_i \log P(x_i\mid x_{1:i-1})} = \left( \prod_i P(x_i\mid x_{1:i-1}) \right)^{-1/L},\]

The lower the perplexity the better. Perplexity intuitively measures the number of tokens that your are hesitating between.

Encoders: evaluate downstream tasks

The evaluation of a encoder’s performance is usually performed by evaluating the accuracy in performing any of the downstream tasks built on top of the encoder. When evaluating performance for any encoder’s downstream tasks it is important to separate the data into a training set and a validation set that have similarity to each other, otherwise, performance in the testset will not be representative of general performance in data that is not similar to the training set. This overfitting is also referred to a data leakage.

For biological sequences data leakage can occur not just because the sequences in the training set share sequence similarity with those in the test set, but also when they share other properties characteristics of the molecules, such as structure for RNAs of proteins. When fine-tuning a encoder to perform structure determination either for proteins or RNA it is important that the sequences in the training and test set are not just sequence but also structurally dissimilar. Protein sequences can be as low as 20% similar to each other in sequence but still share the same 3D fold.

LLMs in genomics

DNABERT, an encoder to model DNA in genomes

The DNABERT method, Ji et al., 2021, is a quite direct adaptation of the BERT encoder for natural languages to describe the language of DNA in genomes.


Figure 5. DANBERT model with k=3 k-mers. Adapted from Figure 1b from Ji et al. 2021.

Tokenization

Until know, we have assumed that the tokenization of a biological sequence would be by residue: nucleotide or amino acids, so that a DNA or RNA alphabet would have dimension 4, and a protein alphabet will have dimension 20. DNABERT introduces a different approach.

The total size of the kmer alphabet is \(4^k+5\).

Then, for a sequence “AGCTGA” according to the 3-mer alphabet includes tokens: , AGC, GCT, CTG, TGA, .

Pytorch code to implement a kmer tokenization alphabet can be found here

Pre-training

DNABERT pre-training is very similar to that of BERT by self-supervised training after masking \(15\%\) of the tokens on each sequence.

DNABERT uses the same model architecture than BERT (described in Figure 5). Below we give a comparison of the values of the parameters.

model comparison   BERT DNABERT-k
vocabulary V 30,000 tokens \(4^k\) + 5
token size k   k-mers: 3,4,5,6
input embedding D 1,024 768
Transformer layers K 24 12
Transformer heads per layer h 16 12
query, key, value dimension per head D_h 64 64
FF hidden dimension d_ff 4,096 3,072
max length of input Lmax 512 tokens 512 tokens
pre-training steps 1,000,000  
pre-training epochs 50  
pre-training words 3.3 billions  

Fine tuning

For each of the downstream tasks, DNABERT starts from the pre-trained parameters, and uses some task-specific data for further training (fine tuning).

Some of the tasks that DNABERT investigates are:

ESM-1b (ESM-2), an encoder to model proteins

ESM-1b and ESM-2 are models that explore self-supervised language modeling applied to unlabeled amino acid sequences. They are transformer-based encoder and train in up 250 million protein sequences. Similar to DNABERT for DNA/RNA, ESM-1b and its second generation version ESM-2 they train standard transformer blocks by masked self-supervision. They are trained to predict the identity of amino acids that have been randomly masked out of protein sequences.


Figure 6. ESMFold architecture. Adapted from Figure 2A, from Lin et al., 2023.

fine-tuning task: protein structure prediction

ESM-2 embeddings are used as inputs to the method ESMFold described in Figure 6. ESNFold similar to AlphaFold2 predicts the 3D structure from an input protein. The input protein sequence is processed through the ESM-2 the language model, and the final ESM-2 representation is passed to the folding head.

ProGen2, an autoregressive decoder for modeling proteins

ProGen2 (Figure 7) is an example of a decoder for generating novel protein sequences.


Figure 7. ProGen2 decoder for protein modeling. Adapted from Figure 1A, 1E from Nijkamp et al. 2023.

ProGen2 uses the same autoregressive decoder-only paradigm introduced by GPT3 that we have just described before (Figure 2).

Evo: a decoder (beyond transformers) for genomics sequences

Evo is a decoder that given a genomic sequence \(x_1\ldots x_N\), predicts the probability of the next nucleotide \(P(x_{N+1}\mid x_1\ldots x_N)\) by means of a neural network. The Evo architecture consists not just of transformers. In Evo, transformer layers are interspearsed with a different architecture named the StripedHyena architecture as depicted in Figure 8.

The difference between a transformer block and a striped hyena block resides in replacing the final FeedForward layer of a transformer by a Hyena gated block which includes convolutions, gating, and residual

Transformer block:


Figure 8. EVO. Adapted from Figure 1b from Nguyen et al. 2024.

Striped Hyena block:


Figure 9. EVO. Adapted from Figure 1F from Nguyen et al., 2024.

Evo is trained on millions of prokariotic and phage genomes. Thus, Evo should learn foundational properties of DNA, RNA and also proteins though the information contained in the mRNA genomic sequences in the form of the tripets of nucleotides forming the codons from which the protein sequences are obtained. Evo reaches 7 billion parameters trained with a context length of up to 131,072 tokens, using single-nucleotide, byte-level tokenization.

The objective is por Evo to learn functional properties of regulatory DNA, non-coding RNAs and proteins. The paper describes several “zero-shot function prediction across DNA, RNA, and protein modalities”. What does this mean?