Blog 2: Protein Modelling Pt.2 - Graph Network Messaging
Introduction
In the previous blog we outlined the statistical basis of the Potts model. The resultant Ising-like form is largely intractable for any reasonably sized protein and multiple sequence alignment. Although no closed form solution exists, methods exploiting numerical approaches to approximate the energy function have shown remarkable success when tested for their accuracy in predicting protein residue contacts. In this blog we’ll discuss the two earliest approaches to this.
Contents
As a reminder, we start with the canonical Potts protein model where we are given a sequence $\boldsymbol{\sigma} = \{\sigma_0, ...\sigma_N \}$ of length $N$, and a multiple sequence alignment of $M$ homologous sequences arranged in a matrix $\mathcal{D}$ of size $M \times N$ where any element is an amino acid in the set of $20$ canonical amino acids and the gap character $a \in \{1,...,q\}, \quad q=21$ that can be indexed by $\sigma_i^{(m)}$ for $i \in N$ and $m \in M$. We want to use this data to approximate the following Potts model by estimating the parameters $\boldsymbol{h}$ and $\boldsymbol{J}$ representing the single-site and couplings parameters respectively:
$$ P(\boldsymbol{\sigma}) = \frac{1}{Z} \exp\left\{\sum_{i}^N \boldsymbol{h}_i(\sigma_i) + \sum_{i,j > i}^N \boldsymbol{J}_{ij}(\sigma_i,\sigma_j) \right\} $$Where $Z$ is the partition function that normalises the Hamiltonian to a probability value between 0 and 1.
Solving the above provides a sequence-based model that encodes structural information via estimation of a protein’s contact map as well as other functionality such as generative sampling. Specifically, by explicitly modelling a global joint distribution of the sequence we are able to reliably disentangle true 3D contacts between residue pairs whilst standard covariance analysis and Mutual Information approaches get confused by indirect residue correlations (Figure 1).
Figure 1: (left) multiple sequence alignment showing three highly correlated positions suggesting they are all in contact (2-4-8-2) but in reality in 3D space they are transiently connected via position 4 (right). Direct contact analysis is able to disentangle these transient contacts from real contacts to identify real couplings (2-4, 4-8).
One can imagine three residues in a protein sequence co-evolving where residue pair 1-2 and 2-3 are close in 3D space but 1-3 are not yet their indirect correlation through residue 2 suggests structural proximity. This false positive contact is successfully disentangled by the global probability distribution of the Potts model which, unlike Mutual Information, encodes information about the relationships with every other residue of a given residue and thus incorporates knowledge of shared relationships with a second residue other than the direct relationship between the two of them.
Message Passing
The first attempt at approximating the Potts model for small proteins was a message passing approach taken by Weigt et al. who used a graph network model to iteratively update the edges between nodes (residues) eventually converging on the true contact map successfully disentangling true direct contacts from indirect interactions. Weigt et al. specifically do this for the sensor histidine kinase (SK) and response regular (RR) protein pair of bacterial two-component signalling systems meaning their analysis is on interacting or coupled contact pairs between two proteins rather than within a single monomer. In their work they concatenate the sequences of the two proteins for each sequence pair essentially treating them as a single monomer. Without going too deep into the nitty gritty of message passing I’ll go through how they used it to learn the Potts model.
In their paper they consider specifically two interacting proteins with the aim of identifying their interacting residues with the same logic as intradomain residues in contact. They do this by taking paired MSAs for both proteins and concatenating the two paired sequences along each row. The resultant combined MSA is of length $N=(N_1 + N_2)$ with $M$ protein pairs with sequences $\boldsymbol{\sigma}$ capable of taking any value from the canonical amino acid alphabet + a gap character $\boldsymbol{\sigma} \in \{1,...,q=21\}$.
Empirical Frequencies
In the previous blog we noted how the solution to the Potts model comes from a Maximum entropy approach constrained by the frequency counts for each amino acid at single $f_i(A_i)$ and pairwise sites $f_{ij}(A_i,A_j)$. Weight et al. and other subsequent papers utilise variations of weighted and regularised frequency counts instead of just plain frequencies. Specifically in the Weigt et al. paper:
$$ f_i(a) = \frac{1}{\lambda q+ M}\left(\lambda + \sum_{k=1}^M \delta_i(\sigma_i^k=a) \right) $$$$ f_{ij}(a,b) = \frac{1}{\lambda q + M}\left(\frac{\lambda}{q} + \sum_{k=1}^M \delta_{ij}(\sigma_i^k=a,\sigma_j^k=b) \right) $$
Where the $\delta(\cdot)$ is the indicator function taking on 1 if the identity inside is correct and 0 if not. The above give weighted regularised frequency counts which compensate for a major issue with learning a model from multiple sequence alignments. Specifically, regularisation is employed by adding a pseudocount $\lambda$ which essentially is just a small number (adjustable hyperparameter) we add to our frequency counts that means that even if a particular amino acid is not seen in the column it still has a non-zero frequency value. This works to prevent overfitting as it is almost never the case that there is absoloutely 0 chance of a residue being accessible to evolution for any position on any protein. Any 0 frequency observed is generally just an artifact of the MSA sampling.
Returning to the paper, in the previous blog we discussed gauge fixing so as to prevent parameter freedom for the same probability (or hamiltonian). In this approach the authors a zero-sum gauge. Specifically:
$$ \sum_{\sigma_i=1}^q \boldsymbol{J}_{ij}(\sigma_i,\sigma_j) = \sum_{\sigma_j=1}^q\boldsymbol{J}_{ij}(\sigma_i,\sigma_j) = 0 , \quad \forall i,j $$$$ \sum_{\sigma_i=1}^q(\boldsymbol{h}_i(\sigma_i)) = 0 $$
Meaning that the solution to our maximum entropy approach becomes unique. This zero-sum also has the effect of reducing the number of parameters to be determined by $q-1$ as we can infer the last $q$ value parameter for each site by the sum. Having established these prerequisits we can begin to consider how we learn these $\boldsymbol{h}$ and $\boldsymbol{J}$ parameters.
Inference
Inference is achieved in the paper through gradient descent following this 2-step algorithm:
- For a trial Hamiltonian, marginal distributions for single and pairwise positions are calculated.
- Parameters are updated with gradient descent by the difference between the above marginals and the empirical frequencies.
We start by setting the couplings to 0 and the single-site fields as the true marginals. The parameter update scheme is then defined as:
$$ \Delta J_{ij} = \epsilon \left[f_{ij}(\sigma_i,\sigma_j) - p_{ij}(\sigma_i,\sigma_j) - \frac{f_i(\sigma_i) + f_j(\sigma_j) - P_i(\sigma_i) - P_j(\sigma_j)}{q} \right] $$$$ \Delta h_i(\sigma_i) = \epsilon [ f_i(\sigma_i) - P_i(\sigma_i)] $$
Where $\epsilon$ controls the step size. Even though we initialise single site fields with their marginals equal to the empirical frequencies we must still include an update rule for them at each step as when we update the couplings first for the pairwise marginals this can shift the single site maginals so we reevaluate the single site fields to fix the single site marginals.
The problem is that the single and two-site marginals are computationally intractable for any reasonably sized protein as the marginals are computed by summation over all positions except those in question:
$$ P_i(\sigma_i) = \sum_{\sigma_{-i}}P(\boldsymbol{\sigma}) $$and the same for the pairwise sites. The authors resolve this with the application of message passing. The neat solution they develop is to actually replace the gradient descent approach altogether with message passing.
Message Passing
The single-site marginals are evaluated by standard belief propagation (BP), a method for approximating marginals on a probability graph with pairwise interactions without summing over the entire combinatorial space. Such an approach works by exchange of messages or beliefs between sites. Nodes (or residue sites) pass messages about what they believe to be the marginal at their site if the neighbour they are passing to was removed. Specifically we denote a message being passed about a marginal at site $i$ to neighbour $j$ as $P_{i \rightarrow j}(A_i)$. The reason we exclude the neighbour $j$ is because we don’t want $i$ to send a message about itself to $j$ including $j$’s own belief about $i$ itself which would be a form of double counting leading to recursive updates.
To reiterate, if $i$’s belief about its own marginal comes from all other nodes we want to explicitly exclude the influence of $j$ when passing a message to $j$ about $i$. This recursive level thinking extends down the tree of messages where the computation of $i$’s marginal is an aggregate of the marginals of all the other nodes which we multiple together giving the product in the below BP update equation:
$$ P_{i \rightarrow j}(\sigma_i) \sim \exp \left( h_i(\sigma_i)\right) \prod_{k \neq i,j}\left[ \sum_{\sigma_k}\exp (-J_{ki}(\sigma_k,\sigma_i))P_{k\rightarrow i}(\sigma_k) \right] $$First all messages are initialised randomly before iterative updates using the above are repeated until no message has been updated by more than $10^{-5}$. The true marginals are then computed by:
$$ P_{i}(\sigma_i) \sim \exp \left( h_i(\sigma_i)\right) \prod_{k \neq i}\left[ \sum_{\sigma_k}\exp (-J_{ki}(\sigma_k,\sigma_i))P_{k\rightarrow i}(\sigma_k) \right] $$Where we no longer exclude $j$. You can read up more on BP through this blog by Andy Jones. The core innovation of this paper however is the inversion of this approach. Consider that we already know the marginal distributions we want, they are our empirical frequencies $P_i(\sigma_i) = f_i(\sigma_i)$. So if we can formulate our message passing to not rely on $h_i(\sigma_i)$ we can then solve for it directly by inference using BP. A neat way to eliminate $h_i(\sigma_i)$ from the above two equations to obtain this is to simply take their ratio and rearrange:
$$ \frac{P_{i \rightarrow j}(\sigma_i)}{P_{i}(\sigma_i)} \propto \frac{\exp \left( h_i(\sigma_i)\right) \prod_{k \neq i,j}\left[ \sum_{\sigma_k}\exp (-J_{ki}(\sigma_k,\sigma_i))P_{k\rightarrow i}(\sigma_k) \right]} {\exp \left( h_i(\sigma_i)\right) \prod_{k \neq i}\left[ \sum_{\sigma_k}\exp (-J_{ki}(\sigma_k,\sigma_i))P_{k\rightarrow i}(\sigma_k) \right]} $$Cancelling terms except for the denominator’s $k=j$ we simplify to:
$$ \frac{P_{i \rightarrow j}(\sigma_i)}{P_{i}(\sigma_i)} \propto \frac{1} {\sum_{\sigma_j}\exp (-J_{ij}(\sigma_i,\sigma_j))P_{j\rightarrow i}(\sigma_j)} $$then rearrange:
$$ P_{i \rightarrow j}(\sigma_i) \propto \frac{P_{i}(\sigma_i)} {\sum_{\sigma_j}\exp (-J_{ij}(\sigma_i,\sigma_j))P_{j\rightarrow i}(\sigma_j)} $$Which gives a message update solution reliant on the marginals rather than $h_i(\sigma_i)$ successfully inverting the problem. After convergence we can then compute $h_i(\sigma_i)$ by simply rearranging the previous marginals equation and replacing $P_i(\sigma_i)$ with $f_i(\sigma_i)$:
$$ \exp(h_i(\sigma_i)) \propto \frac{f_i(\sigma_i)}{\prod_{j\neq i}\left[ \sum_{\sigma_j}\exp(J_{ij}(\sigma_i,\sigma_j))P_{j \rightarrow i}(\sigma_j)\right]} $$Where we have replaced index $k$ with $j$ for notational consistency.
Having now formulated a solution for single site fields inference we need to consider the couplings. This is somewhat more complicated than a satisfying rearrangement of BP. The authors employ something called susceptibility propagation which I will defer to the paper (or a future blog).
Unfortunately, although exhibiting a significant speed up compared to an MCMC approach this method still suffers from being a largely iterative procedure. In the next part in this series we will address this issue with a powerful alternative formulation developed soon after this approach - Mean Field Approximation. So stay tuned for the next blog where I’ll take us through this beautiful convergence of statistical mechanics and thermodynamics with biology.