Blog 1: Protein Modelling Pt.1 - From Similarity to Structure Prediction

Introduction

Interdisciplinary science is beautiful science

Biology was never seen as a quantitative science. However over recent decades the surge in sequence data made possible by next generation sequencing has facilitated the application of statistical learning to biology. In this blog I will describe what first got me into computational biology and what helped power the first drastic improvements in protein structure prediction made by AlphaFold. Specifically, I will discuss what is now called evolutionary coupling analysis and focus on several key papers that formulated this fascinating application of statistical mechanics to biology.

Contents

Introduction
What is Protein Modelling?
Pairwise Modelling
Learning the Model
WIP

What is Protein Modelling?

Proteins are molecular machines which, at the simplest level, can be characterised by their molecular composition with their amino acid sequences - a 1 dimensional chronological string of single letter representations of each amino acid in the protein. Essentially, this is the most primitive “model” of a protein we have. It tells us the exact chemical composition of that protein yet we can’t even infer it’s exact respective genetic sequence without further information (due to the redundancy of the amino acid codons).

Let’s level up our model of a protein. Commonly across biology we are given protein sequences without any knowledge of the function or properties of that protein (owing mostly to emergence of high throughput sequencing). To resolve this we have developed various methods for comparing protein sequences in an effort to determine their degree of similarity and thus infer function from similarity to proteins with known function. Naturally this requires a quantitative approach which has come in the form of sequence alignments. At the simplest level there’s the pairwise alignment which simply sums the number of mismatches in two sequences of the same length. For sequences of different lengths things get a bit more complicated but we can save that for another blog. Further complexity can be achieved by scoring mismatches differently depending on the relatedness of the amino acid pair’s chemical properties by way what is called a substitution matrix (a popular one being the BLOSUM62 matrix).

At all levels these are models of a protein in this case providing functionality by the ability to quantify relatedness between sequences. A step further and reach Hidden Markov Models which are a particularly powerful model of proteins again describing similarity but this time probabilisticly allowing us to determine the likelihood that a given protein sequence is related to a collection of others (a family) - this is called homology modelling. What I’m going to describe in this blog is what I see as the next step up in protein modelling known as a Potts Model. I’ll describe it’s emergence from a beautiful application of statistical mechanics by way of an energy based model (EBM) and how it’s functionality extends beyond just similarity/homology modelling to generative sequence design and 3D structure prediction.

Pairwise Modelling

The models we discussed earlierall model each residue in the protein sequence independently of the other residues in the sequence (except HMM?). In reality, every residue in the protein likely is likely influenced (and influences) by every other residue. This is particularly evident in the case of epistasis - a phenomenon in which mutating a specified residue in a protein sequence with and without a mutation in secondary residue in the sequence will have drastically varying effects on the protein (in terms of stability, fitness and other protein properties).

This implies a level of inter-residue dependence in protein sequences. Drawing from probability theory we get a nice mathematical formalisation for this…

If two variables are independent: $$ P(A,B) = P(A)P(B) $$

So in the case of proteins we can quantify the degree of dependence between two residues:

$$ Epistasis $$

This is good because just like HMMs we want a probabilistic formalism for our protein model that allows us to determine the model parameters from statistics but also provide a degree of certainty (and be generative but we will discuss that later). Starting at a basic level as we touched on the dependency between residues we can look at modelling a protein as a multi-dimensional distribution over the residues in its sequence which we write as a joint distribution over the whole sequence:

$$ P(seq) = P(x_0, …., x_n) $$ where n is the length of the sequence.

So we now have a complex joint probability distribution of a sequence composed of a number of dependent variables. Following the chain rule of probability we can decompose this joint distribution into conditional probabilities for each individual residue (given the rest of the sequence):

$$ P(seq) = \prod x_i | x_{-i} $$

We can also marginalise over the rest of the sequence for any residue to get the probability of that residue:

$$ P(x_i) = \sum x $$

This probabilistic formalism can be represented graphically (specifically as an undirected graph) and is known as a Markov Random Field (MRF) in the field of machine learning. The characteristics of this are that every variable (in our case residue) is represented as a node in a graph with edges connecting residues that are dependent. For reasons that will be clearer later every node in this graph is connected to one another meaning that every residue exhibits some kind of dependence on every other residue in the sequence. This also brings the problem of making the graph particularly difficult to solve in proteins upwards of ____. The problem of solving (or approximating) this model for different proteins is the challenged addressed by the papers I will discuss in this blog series.

Learning the Model

To keep this first blog post to a reasonable size I will briefly discuss the first attempt to approximate this model for proteins, then follow up on more established methods in the next.

How do we parameterise a probabilistic model?

There are several methods ….

In our case proteins are particularly apt for probabilistic modelling in recent times as we can utilise the vast amount of sequence data available since the rise of high throughput sequencing. We discussed earlier alignments and HMMs. HMMs are typically generated from alignments of like-protein sequences. In turn we can use HMMs to search over the entierety of available sequence space to group together and align sequences homologouse to our protein of interest (our query sequence) HMMs are great at identifying sequences that are likely functional or structurally similar (homologues) without necessarily being similar in exact sequence composition (typical alignment methods) meaning we can capture much more information on the query protein by collecting more of its related (often evolutionarily ancestral) sequences together. Like building a HMM, we collect all the related sequences returned from the query and stack them into an alignment. Like building a HMM we can look at the frequency of each amino acid residue at each position and build a matrix of these frequencies. One can take the frequentist approach to probability and directly translate these per-residue frequencies into per-residue probabilities at each position in the sequence. In this case, the more data the more accurate our probabilities. Like all things in statistics, the more the better.

This is a valid model but it doesn’t capture any of the pairwise dependencies we are interested in. So how about we generate a second matrix (or in this case tensor) which captures the pairwise statistics - meaning the frequency of residue $a$ at position $x$ whilst residue $b$ is present at position $y$. This is an enormous multi-dimensional tensor but it is easily calculable with modern computers (scaling ____ with size). How do we translate this information into probability

Graph networks provide a powerful solution to modelling large probabilistic models by employoing a system of nodes and edges where an edge links two nodes that are conditionally dependent (as in $P(A,B) != P(A)P(B)$). In this way we configure the graph so that any independent variables do not have linking edges and we can design an algorithm that is capable of computing the joint distribution without marginalising…..

However our protein model is not capable of making assumptions about the independence of residue pairs as we are trying to determine these inter-residue dependencies.

We can take a frequentist approach to estimating the probabilities in our model by using a similar approach to HMMs. If we take a significantly large enough alignment of sequences that are related and look at the amino acid frequencies at each position as a starting point for our probabilities. Mathematically

given a sequence $$ sequence = \mathbf{x} = [x_0, … ,x_n] $$ and an alignment in the form of a matrix $$ \mathbf{A} = \begin{pmatrix}A_0^{(1)}&…&A_n^{(1)}\&\vdots\A_0^{(m)}&…&A_n^{(m)}\end{pmatrix} $$

where $m$ is the number of ssequences in the alignment (rows) and $n$ is the number of columns (the length of the query sequence with gaps).

We define the empirical frequency for each amino acid $i$ at each column $c$.

$$ f_i(a) = \frac{1}{m}\sum_{r=0}^mk(A_i^{(r)} == a) $$

Now similarly we can compute the empirical pairwise frequency for each amino acid pair

$$ f_{i,j}(a, b) = \frac{1}{m}\sum_{r=0}^mk(A_i^{(r)} == a, A_j^{(r)} == b) $$

We could settle with $$ P(x_i == a) = f_i(a) $$ and $$ P(x_i == a, x_j == b) = f_{i,j}(a,b) $$

WIP

This is reasonable, but from a machine learning perspective if we were to use it to generate new sequences we would find it does not generalise sufficiently

We can construct this alignment using a HMM to capture more distantly related homologs thus increasing our data and the amount of information we capture

We add a regularisation term