Blog 1: Protein Modelling Pt.1 - From Similarity to Structure Prediction
Introduction
Interdisciplinary science is beautiful science Biology has typically been the least quantitative of the sciences. However, over recent decades the surge in sequencing data, made possible by next generation sequencing, has facilitated the application of statistical and machine learning to biology. In this blog I will describe what first got me into computational biology and what helped power the first drastic improvements in protein structure prediction introduced by AlphaFold2[ref]. Specifically, I will discuss what is now called evolutionary or Direct Coupling Analysis (DCA) and focus on several key papers that formulated this fascinating application of statistical mechanical principles to biology.
Contents
What is Protein Modelling?
Starting from the very top, what are proteins? Proteins are molecular machines which, at the simplest level, can be characterised by their molecular composition with their amino acid sequence - a 1 dimensional chronological string of single letter representations of each amino acid molecule (residue) making up the complete molecular structure of the protein. In some ways, this is the most primitive “model” of a protein we have. It tells us the exact chemical composition of that protein, yet we can’t even infer it’s exact respective genetic sequence without further information (due to the redundancy of the amino acid codons).
Let’s level up our model. Commonly across biology we are given protein sequences without any knowledge of the function or properties of that protein. To resolve this we have developed various methods for comparing protein sequences in an effort to determine their degree of similarity and thus infer function from similarity to proteins with known function. Naturally this requires a quantitative approach which has come in the form of sequence alignments. At the simplest level there’s the pairwise alignment which simply sums the number of mismatches in two sequences of the same length. For sequences of different lengths things get a bit more complicated, but we can save that for another blog. Further complexity can be achieved by scoring mismatches differently depending on the relatedness of the amino acid pair’s chemical properties with a substitution matrix (a popular one being the BLOSUM62 matrix).
At all levels these are models of a protein, in this case providing functionality via the ability to quantify relatedness between sequences. Taking a step further we encounter Hidden Markov Models which are a particularly powerful model of proteins, again describing similarity but this time probabilisticly, allowing us to determine the likelihood that a given protein sequence is related to a collection of others (a family of homologs) - this is called homology modelling. What I’m going to describe in this blog is what I see as the next step up in protein modelling, the Potts Model. I’ll describe its emergence from a beautiful application of statistical mechanics by way of the Ising Model and how it’s functionality extends beyond just similarity/homology modelling to generative protein design and 3D structure prediction.
Pairwise Modelling
The models we discussed earlier consider each residue in the protein sequence independently of the others in the sequence. In reality, every residue in the protein is likely influenced by (and influences) every other residue. This is particularly evident in the case of epistasis - a phenomenon in which mutating a specified residue in a protein sequence with and without a mutation in a secondary residue in the sequence will have drastically varying effects on the protein (in terms of stability, fitness and other protein properties).
This implies a level of inter-residue dependence in protein sequences. Drawing from probability theory we get a nice mathematical formalisation for this. If two variables are independent:
$$ P(A,B) = P(A)P(B) $$
This is equally represented in epistasis which is defined as:
$$ E(\text{seq}^{mut1,mut2}) != E(\text{seq}^{mut1})E(\text{seq}^{mut2}) $$
So in the case of proteins we can quantify the degree of dependence between two residues $i$ and $j$ by their Mutual Information (MI):
$$ \text{MI} = P(\text{seq}_{i},\text{seq}_{j}) - P(\text{seq}_{i})P(\text{seq}_{j}) $$
This is good because just like HMMs we want a probabilistic formalism for our protein model that allows us to statistically determine the model parameters but also provide a degree of certainty (and be generative, but we will discuss that later). Starting at a basic level, as we touched on the dependency between residues we can look at modelling a protein as a multi-dimensional distribution over the residues in its sequence which we write as a joint distribution over the whole sequence where each position in the sequence is a variable $x_i$ taking on one of the 20 amino acids and a gap character: $$ P(\text{seq}) = P(x_0, …., x_n) $$ where $n$ is the length of the sequence. So we now have a complex joint probability distribution of a sequence composed of a number of dependent variables. Following the chain rule of probability we can decompose this joint distribution into conditional probabilities for each individual residue given the rest of the sequence: $$ P(\boldsymbol{x}) = \prod x_i | x_{-i} $$ We can also marginalise over the rest of the sequence for any residue to get the probability of that residue: $$ P(x_i = a) = \sum_{-i} P(x) $$
This probabilistic formalism can be represented graphically as an undirected graph and is known as a Markov Random Field (MRF) in the field of machine learning. The characteristics of this are that every variable (in our case residue) is represented as a node in a graph with edges connecting residues that are dependent. For reasons that will be clearer later every node in this graph is connected to one another meaning that every residue exhibits some non-zero dependence on every other residue in the sequence. This also leads to the problem of making the graph particularly difficult to solve for any realistically sized protein. The problem of solving (or approximating) this so called Potts model for different proteins is the challenge addressed by the papers I will discuss in this blog series.
Learning the Potts Model
To keep this first blog post to a reasonable size I will briefly discuss the first attempt to approximate this model for proteins, then follow up on more established methods in the next.
How do we parameterise a probabilistic model? Mathematically, parameterisation means defining a functional form for our probability distribution from which we can adjust the parameters controlling the characteristics of the distribution to best fit some data.
We previously mentioned the vast amount of sequence data available since the explosion in high throughput sequencing - this is data is already used to parameterise HMMs. HMMs are typically generated from alignments of similar sequences, multiple sequence alignments, by considering each individual column in the alignment (which refers to a distinct position in the query sequence unless gapped) and computing some empirical frequencies that we use to infer the HMM parameters. I will hopefully have a separate post describing HMMs in more detail later - the point is that we parameterise our models with empirical data.
Of course, this means that the data we use has a significant impact of the parameterisation and therefore accuracy of our model. We want our model to capture the rules that define our sequences and their relationship with each other so the more data we can obtain the more accurately our distribution can be parameterised. Although standard sequence alignments are quite capable of collecting up to thousands of sequence homologs for a given query sequence, for the Potts model we really want within the region of 10,000s. One of the benefits of HMMs over standard sequence alignment is they are better at identifying sequences that are likely functional or structurally similar (homologues) without necessarily being similar in exact sequence composition (typical alignment methods) meaning we can capture much more information on the query protein by collecting more of its related (often evolutionarily ancestral) sequences together. So the first step of learning our model is collecting all the related sequences returned from the query and stacking them into a large multiple sequence alignment.
Parameterising the Potts model Let’s look at this from a frequentist approach. We have 10,000s of homologous protein sequences where the alignment algorithm has inserted gaps to represent insertions/deletions making all sequences (rows) the same length. This is a matrix of size $n \times m$ where $n$ is the length (columns) of the (gapped) query sequence and $m$ is the number of sequences (rows) in the alignment. The classic frequentist approach would be to compute the per-residue statistics at each column $f_i(a)$ where $i$ is the column index and $a$ the residue which can take values from $a \in {1,…,q}$ with $q$ being the length of the alphabet, typically 21 - the expectation being that proteins from the same family would have similar alignments and therefore similar per-residue single site frequencies. The more basic Position-Specific Scoring Matrix (PSSM) uses these to compute a log-odds score by comparing to the frequency of the given amino acid in all proteins.
This is a valid model but it doesn’t capture any of the pairwise dependencies we are interested in. So how about we generate a second matrix (or in this case tensor) which captures the pairwise statistics $f_{ij}(a,b)$ - meaning the frequency of residue $a$ at position $x_i$ whilst residue $b$ is present at position $x_j$. Graph networks provide a powerful solution to modelling such large pairwise probabilistic models by employing a system of nodes and edges where an edge links two nodes that are conditionally dependent (as in $P(A,B) != P(A)P(B)$). In this way we configure the graph so that any independent variables do not have linking edges and we can design an algorithm that is capable of computing the joint distribution without marginalising over the entire graph.
However our protein model is not capable of making assumptions about the independence of residue pairs as we are trying to determine these inter-residue dependencies. One approach to take is the Maximum Entropy approach where we take the simplest possible parameterisation and fit our parameters to satisfy the empirical data in the multiple sequence alignment (single site and pairwise frequencies): $$ P(x) = \frac{1}{Z} \exp \left(\sum_{i=1}^n h_i(x_i) + \sum_{1\leq i<j \leq n} J_{ij}(x_i,x_j)\right) $$ This is the formal Potts model, let’s take it apart bit by bit. Inside the brackets there are two summation terms, the first delimits our single-site parameters $h_i(x_i)$ being a function for each column $i$ that takes a possible amino acid and returns a value. The second term is out pairwise parameters $J_{ij}(x_i,x_j)$ which is a function for each unique pair of positions and takes an amino acid input for each position $i,j$. Notice the subscript of the second summation, the pairwise parameters $J_{ij}$ form a pairwise matrix that is symmetrical as $J_{ij} = J_{ji}$ so we explicitly state only unique pairs which is what the subscript dictates. $Z$ is just a normalising factor that make sure the resultant model is a valid probability distribution that sums to 1. In reality, $Z$ is our biggest problem with this approach as it represents a summation over all possible [].
This form is particularly interesting because it exactly resembles something called the Ising model (pronounced ees-ing model as I’m told). The Ising model emerged from condensed matter physics where it was developed to statistically model lattices of particles exhibiting either up or down spins. The model was capable of capturing non-local interactions between these particles. The Potts model is essentially an expansion of the Ising model beyond binary variables. In other words, we have a model that specialises in capturing long range dependencies between residues in a protein sequence -> epistasis!
Under the max entropy approach the most probable values for these parameters are $1/q$ and $1/q^2$ for every column and unique column pair respectively. We then “learn” the optimal values of $h_i$ and $J_{ij}$ from our sequence alignments so that the parameters satisfy: $$ \sum_x P(x) \delta_i(x_i = a) = f_i(a) $$ $$ \sum_x P(x) \delta_{ij}(x_i = a, x_j = b) = f_{ij}(a,b) $$ Where the kronecker delta function $\delta_i(\cdot)$ is equal to 1 if the condition in the arguments is true and 0 if not. In other words, we are free to choose parameters so long as the respective site-wise marginals are enforced. If we used out model to generate a number of homologous sequences (discussed later) the empirical frequencies of the new generated sequence alignment will be the same as the original training data. This is possible because there is a certain degree of overparameterisation in this model. Specifically, we have $q \times n$ possible single site values and $\frac{q^2n^2}{2}$ possible pairwise site parameters. For a protein of 200 residues this is 4,200 + 8,820,000 = 8,824,200 possible unique parameter values whereas we are typically working with multiple sequence alignments of size $\times10^4$ providing far fewer constraints than parameters.
Realistically, many unique parameterisations give the same probability distribution - imagine shifting all $h_i$ values by some value like 0.1 in the same direction, the distribution stays the same but the values are different. To account for this we utilise gauge invariance where we fix a gauge such as the Ising gauge:
$$ \sum_{a=1}^q h_i(a) = 0 \quad \forall i $$ $$ \sum_{a=1}^q J_{ij}(a,b) = 0 \quad \forall j,b, \quad \sum_{b=1}^q J_{ij}(a,b) = 0 \quad \forall i,a $$
Which enforces that our single site and pairwise fields sum to 0 at each site and each row and column of the coupling matrix $J_{ij}$ has mean 0. Another common choice is the zero-sum gauge: $$ \sum_a J_{ij}(a,b) = \sum_b J_{ij}(a,b) = 0 $$ Which enforces that all pairwise parameters for a given site sum to 0 so any form of parameter redundancy e.g. by shifting all pairwise values at a site by 0.1, is invalidated. Finally, in a similar light there is also the reference-state gauge which enforces a chosen residue’s parameters (typically the 21st or gap state) are equal to 0 so all fields and pairwise parameters are relative to it: $$ h_i(q) = 0 \quad \forall i $$ $$ J_{ij}(a,q) = J_{ij}(q,b) = 0 \quad \forall i,j,a,b $$
We now have a functional form for modelling pairwise level and single-site protein residue interactions for any given protein: The Potts model. The next blog will address how we go about learning the parameters of this model oh which there are numerous approaches. I hope to see you there!