Blog 2: Protein Modelling Pt.2 - Graph Network Messaging
Introduction
In the previous blog we outlined the statistical basis of the Potts model. The resultant Ising-like form is largely intractable for any reasonably sized protein and multiple sequence alignment. Although no closed form solution exists, methods exploiting numerical approaches to approximate the energy function have shown remarkable success when tested for their accuracy in predicting protein residue contacts. In this blog we’ll discuss the two earliest approaches to this.
Contents
As a reminder, we start with the canonical Potts protein model where we are given a sequence $\boldsymbol{\sigma} = \{\sigma_0, ...\sigma_N \}$ of length $N$, and a multiple sequence alignment of $M$ homologous sequences arranged in a matrix $\mathcal{D}$ of size $M \times N$ where any element is an amino acid in the set of $20$ canonical amino acids and the gap character $a \in \{1,...,q\}, \quad q=21$ that can be indexed by $\sigma_i^{(m)}$ for $i \in N$ and $m \in M$. We want to use this data to approximate the following Potts model by estimating the parameters $\boldsymbol{h}$ and $\boldsymbol{J}$ representing the single-site and couplings parameters respectively:
$$ P(\boldsymbol{\sigma}) = \frac{1}{Z} \exp\left\{\sum_{i}^N \boldsymbol{h}_i(\sigma_i) + \sum_{i,j > i}^N \boldsymbol{J}_{ij}(\sigma_i,\sigma_j) \right\} $$Where $Z$ is the partition function that normalises the Hamiltonian to a probability value between 0 and 1.
Solving the above provides a sequence-based model that encodes structural information via estimation of a protein’s contact map as well as other functionality such as generative sampling. Specifically, by explicitly modelling a global joint distribution of the sequence we are able to reliably disentangle true 3D contacts between residue pairs whilst standard covariance analysis and Mutual Information approaches get confused by indirect residue correlations (Figure 1).
figure_1 One can imagine three residues in a protein sequence co-evolving where residue pair 1-2 and 2-3 are close in 3D space but 1-3 are not yet their indirect correlation through residue 2 suggests structural proximity. This false positive contact is successfully disentangled by the global probability distribution of the Potts model which, unlike Mutual Information, encodes information about the relationships with every other residue of a given residue and thus incorporates knowledge of shared relationships with a second residue other than the direct relationship between the two of them.
Message Passing
The first attempt at approximating the Potts model for small proteins was a message passing approach taken by Weigt et al. who used a graph network model to iteratively update the edges between nodes (residues) eventually converging on the true contact map successfully disentangling true direct contacts from indirect interactions. Weigt et al. specifically do this for the sensor histidine kinase (SK) and response regular (RR) protein pair of bacterial two-component signalling systems meaning their analysis is on interacting or coupled contact pairs between two proteins rather than within a single monomer. In their work they concatenate the sequences of the two proteins for each sequence pair essentially treating them as a single monomer. Without going too deep into the nitty gritty of message passing I’ll go through how they used it to learn the Potts model.
In their paper they consider specifically two interacting proteins with the aim of identifying their interacting residues with the same logic as intradomain residues in contact. They do this by taking paired MSAs for both proteins and concatenating the two paired sequences along each row. The resultant combined MSA is of length $N=(N_1 + N_2)$ with $M$ protein pairs with sequences $\boldsymbol{\sigma}$ capable of taking any value from the canonical amino acid alphabet + a gap character $\boldsymbol{\sigma} \in \{1,...,q=21\}$.
Empirical Frequencies
In the previous blog we noted how the solution to the Potts model comes from a Maximum entropy approach constrained by the frequency counts for each amino acid at single $f_i(A_i)$ and pairwise sites $f_{ij}(A_i,A_j)$. Weight et al. and other subsequent papers utilise variations of weighted and regularised frequency counts instead of just plain frequencies. Specifically in the Weigt et al. paper:
$$ f_i(a) = \frac{1}{\lambda q+ M}\left(\lambda + \sum_{k=1}^M \delta_i(\sigma_i^k=a) \right) $$$$ f_{ij}(a,b) = \frac{1}{\lambda q + M}\left(\frac{\lambda}{q} + \sum_{k=1}^M \delta_{ij}(\sigma_i^k=a,\sigma_j^k=b) \right) $$
Where the $\delta(\cdot)$ is the indicator function taking on 1 if the identity inside is correct and 0 if not. The above give weighted regularised frequency counts which compensate for a major issue with learning a model from multiple sequence alignments. Specifically, regularisation is employed by adding a pseudocount $\lambda$ which essentially is just a small number (adjustable hyperparameter) we add to our frequency counts that means that even if a particular amino acid is not seen in the column it still has a non-zero frequency value. This works to prevent overfitting as it is almost never the case that there is absoloutely 0 chance of a residue being accessible to evolution for any position on any protein. Any 0 frequency observed is generally just an artifact of the MSA sampling.
Returning to the paper, in the previous blog we discussed gauge fixing so as to prevent parameter freedom for the same probability (or hamiltonian). In this approach the authors a zero-sum gauge. Specifically:
$$ \sum_{\sigma_i=1}^q \boldsymbol{J}_{ij}(\sigma_i,\sigma_j) = \sum_{\sigma_j=1}^q\boldsymbol{J}_{ij}(\sigma_i,\sigma_j) = 0 , \quad \forall i,j $$$$ \sum_{\sigma_i=1}^q(\boldsymbol{h}_i(\sigma_i)) = 0 $$
Meaning that the solution to our maximum entropy approach becomes unique. This zero-sum also has the effect of reducing the number of parameters to be determined by $q-1$ as we can infer the last $q$ value parameter for each site by the sum. Having established these prerequisits we can begin to consider how we learn these $\boldsymbol{h}$ and $\boldsymbol{J}$ parameters.
Inference
Inference is achieved in the paper through gradient descent following this 2-step algorithm:
- For a trial Hamiltonian, marginal distributions for single and pairwise positions are calculated.
- Parameters are updated with gradient descent by the difference between the above marginals and the empirical frequencies.
We start by setting the couplings to 0 and the single-site fields as the true marginals. The parameter update scheme is then defined as:
$$ \Delta J_{ij} = \epsilon \left[f_{ij}(\sigma_i,\sigma_j) - p_{ij}(\sigma_i,\sigma_j) - \frac{f_i(\sigma_i) + f_j(\sigma_j) - P_i(\sigma_i) - P_j(\sigma_j)}{q} \right] $$$$ \Delta h_i(\sigma_i) = \epsilon [ f_i(\sigma_i) - P_i(\sigma_i)] $$
Where $\epsilon$ controls the step size. Even though we initialise single site fields with their marginals equal to the empirical frequencies we must still include an update rule for them at each step as when we update the couplings first for the pairwise marginals this can shift the single site maginals so we reevaluate the single site fields to fix the single site marginals.
The problem is that the single and two-site marginals are computationally intractable for any reasonably sized protein as the marginals are computed by summation over all positions except those in question:
$$ P_i(\sigma_i) = \sum_{\sigma_{-i}}P(\boldsymbol{\sigma}) $$and the same for the pairwise sites. The authors resolve this with the application of message passing. The neat solution they develop is to actually replace the gradient descent approach altogether with message passing.
Message Passing
The single-site marginals are evaluated by standard belief propagation (BP), a method for approximating marginals on a probability graph with pairwise interactions without summing over the entire combinatorial space. Such an approach works by exchange of messages or beliefs between sites. Nodes (or residue sites) pass messages about what they believe to be the marginal at their site if the neighbour they are passing to was removed. Specifically we denote a message being passed about a marginal at site $i$ to neighbour $j$ as $P_{i \rightarrow j}(A_i)$. The reason we exclude the neighbour $j$ is because we don’t want $i$ to send a message about itself to $j$ including $j$’s own belief about $i$ itself which would be a form of double counting leading to recursive updates.
To reiterate, if $i$’s belief about its own marginal comes from all other nodes we want to explicitly exclude the influence of $j$ when passing a message to $j$ about $i$. This recursive level thinking extends down the tree of messages where the computation of $i$’s marginal is an aggregate of the marginals of all the other nodes which we multiple together giving the product in the below BP update equation:
$$ P_{i \rightarrow j}(\sigma_i) \sim \exp \left( h_i(\sigma_i)\right) \prod_{k \neq i,j}\left[ \sum_{\sigma_k}\exp (-J_{ki}(\sigma_k,\sigma_i))P_{k\rightarrow i}(\sigma_k) \right] $$First all messages are initialised randomly before iterative updates using the above are repeated until no message has been updated by more than $10^{-5}$. The true marginals are then computed by:
$$ P_{i}(\sigma_i) \sim \exp \left( h_i(\sigma_i)\right) \prod_{k \neq i}\left[ \sum_{\sigma_k}\exp (-J_{ki}(\sigma_k,\sigma_i))P_{k\rightarrow i}(\sigma_k) \right] $$Where we no longer exclude $j$. You can read up more on BP through this blog by Andy Jones. The core innovation of this paper however is the inversion of this approach. Consider that we already know the marginal distributions we want, they are our empirical frequencies $P_i(\sigma_i) = f_i(\sigma_i)$. So if we can formulate our message passing to not rely on $h_i(\sigma_i)$ we can then solve for it directly by inference using BP. A neat way to eliminate $h_i(\sigma_i)$ from the above two equations to obtain this is to simply take their ratio and rearrange:
$$ \frac{P_{i \rightarrow j}(\sigma_i)}{P_{i}(\sigma_i)} \propto \frac{\exp \left( h_i(\sigma_i)\right) \prod_{k \neq i,j}\left[ \sum_{\sigma_k}\exp (-J_{ki}(\sigma_k,\sigma_i))P_{k\rightarrow i}(\sigma_k) \right]} {\exp \left( h_i(\sigma_i)\right) \prod_{k \neq i}\left[ \sum_{\sigma_k}\exp (-J_{ki}(\sigma_k,\sigma_i))P_{k\rightarrow i}(\sigma_k) \right]} $$Cancelling terms except for the denominator’s $k=j$ we simplify to:
$$ \frac{P_{i \rightarrow j}(\sigma_i)}{P_{i}(\sigma_i)} \propto \frac{1} {\sum_{\sigma_j}\exp (-J_{ij}(\sigma_i,\sigma_j))P_{j\rightarrow i}(\sigma_j)} $$then rearrange:
$$ P_{i \rightarrow j}(\sigma_i) \propto \frac{P_{i}(\sigma_i)} {\sum_{\sigma_j}\exp (-J_{ij}(\sigma_i,\sigma_j))P_{j\rightarrow i}(\sigma_j)} $$Which gives a message update solution reliant on the marginals rather than $h_i(\sigma_i)$ successfully inverting the problem. After convergence we can then compute $h_i(\sigma_i)$ by simply rearranging the previous marginals equation and replacing $P_i(\sigma_i)$ with $f_i(\sigma_i)$:
$$ \exp(h_i(\sigma_i)) \propto \frac{f_i(\sigma_i)}{\prod_{j\neq i}\left[ \sum_{\sigma_j}\exp(J_{ij}(\sigma_i,\sigma_j))P_{j \rightarrow i}(\sigma_j)\right]} $$Where we have replaced index $k$ with $j$ for notational consistency.
Having now formulated a solution for single site fields inference we need to consider the couplings. This is somewhat more complicated than a satisfying rearrangement of BP. The authors employ something called susceptibility propagation which I will defer to the paper (or a future blog).
Unfortunately, although exhibiting a significant speed up compared to an MCMC approach this method still suffers from being a largely iterative procedure.
Mean Field Approximation
Soon after the previous the same group introduce a far more computationally efficient approach leading to inference up to $10^4$ times faster than via message passing. Morcos et al. achieve this through a beautiful marriage of statistical mechanics to a biological problem which I will attempt to describe below.
We start by obtaining single-site and pairwise empirical frequencies as sample data estimates of our $\boldsymbol{h}$ and $\boldsymbol{J}$ parameters as with the previous approach, although with a slightly different formulation:
$$ f_i(a) = \frac{1}{\lambda + M_{eff}}\left(\frac{\lambda}{q} + \sum_{k=1}^M \frac{1}{m_a}\delta_i(a) \right) $$$$ f_{ij}(a,b) = \frac{1}{\lambda + M_{eff}}\left(\frac{\lambda}{q^2} + \sum_{k=1}^M \frac{1}{m_a}\delta_{ij}(a,b) \right) $$
Notice the two new terms, $M_{eff}$ and $m_a$. These are included to counter a further problem with MSA data alongside regularisation in that they are prone to sampling bias along phylogenetic trees of high sequence similarity which can exacerbate the presence of closely related sequences over distantly related homologues - as well as implicit biases in the sequencing actions of humans. To counter this a sequence reweighting is applied by the per-sequence factor $m_a$ which is balanced by an effective sequence count $M_{eff}$. Typically we reweight so that any sequences above a threshold sequence similarity, typically 0.8, are down weighted reducing the effective number of sequence $M_{eff} < M$. Thus $M_{eff}$ is a sum of sequence weights scores $m_a$ rather than the total sequence count. The exact calculation of the per-sequence weight score is as follows:
$$ m_a = |\{ b \in \{1,...,M\}|\operatorname{seqid}(\boldsymbol{\sigma}^{a}, \boldsymbol{\sigma}^{b}) > 0.8 \}| $$Following this the authors introduce the connected correlations as a measure of pairwise interdependence in a similar vain to the Mutual Information (actually the covariance of the single site marginals):
$$ C_{ij}(a,b) = P_{ij}(a,b) - P_i(a)P_j(b) $$Where the empirical approximation to this follows:
$$ C_{ij}(a,b) = f_{ij}(a,b) - f_i(a)f_j(b) $$The computational efficiency and simplistic appeal of Mean Field approximation as a method for approximating the Potts model arises from the simple relationship:
$$ C_{ij}^{-1} = - \boldsymbol{J}_{ij} $$So the largely iterative procedure of the message passing approach is reduced to a matrix inversion which, given fixes gauges and regularisation, is solveable. The actual derivation of this requires some key tools from physics which I will not attempt to explain here (I’m not a physicist!) but I will walk you through the process regardless.
Derivation of the Mean Field Approach
Essentially the authors apply something called a Legendre transform to change the variables of our Potts model from the Hamiltonian formulation’s fields (single-site and pairwise) to a Gibbs potential expressed in terms of “spin” distributions (or in this case amino acid probability distributions) allowing inference of the partition function by a statistical approach using our multiple sequence alignment. They first introduce the (perturbed) hamiltonian as follows:
$$ \mathcal{H} = -\alpha \sum_{1 \leq i < j \leq N} e_{ij}(\sigma_i,\sigma_j) - \sum_{i=1}^N h_i(\sigma_i) $$Where the parameter $\alpha$ is introduced to allow for interpolation between an independent model $\alpha = 0$ and the full model for $\alpha = 1$. In physics the Hamiltonian describes the energy of a given configuration of the system (for us a protein sequence in the space of protein sequences) and is related to the free energy (Hemholtz free energy) $F$ by:
$$ F = U - TS $$This is classic thermodynamics, where $U$ is the energy, $T$ the temperature and $S$ the entropy. In statistical mechanics the entropy and average energy give the same free energy:
$$ F = -\kappa_B T \ln Z $$Where $k_B$ is the Boltzmann constant and $Z$ the partition function which we know is a weighted sum over all possible configurations of the system weighted by its Boltzmann factor $e^{-\beta H}$:
$$ Z = \sum_{\{\boldsymbol{\sigma_i} \}} e^{-\beta \mathcal{H} (\{\boldsymbol{\sigma_i}\})} $$Combining this with the above we get the formal Free energy (note how the Boltzmann constant and temperature are combined into the constant $\frac{1}{\beta}$):
$$ F = - \frac{1}{\beta} \ln \left(\sum_{\{\boldsymbol{\sigma_i}\}}e^{-\beta \mathcal{H} (\{\boldsymbol{\sigma_i} \})} \right) $$This is all parameterised by energy terms which we do not have access to but want to infer from a statistical distribution of sequences in a multiple sequence alignment. We can use a Legendre transform of $F$ to change variables switching from the free energy to the Gibbs potential $\mathcal{G}$ which changes dependence on the fields $\boldsymbol{h_i}$ to so-called magnetizations $m_i$, or probability marginals $P_i$. The Legendre transform of $F[h]$ to $G[P]$ is as follows:
$$ \mathcal{G}[P] = F[\boldsymbol{h}] + \sum_i\boldsymbol{h}_iP_i $$Where $P_i = -\frac{\partial F}{\partial h_i}$ is the negative partial derivative of F with respect to the single site fields. This is true because, after some differentiation of the above free energy equations:
$$ \frac{\partial F}{\partial \boldsymbol{h}_i(\sigma_i)} = -\frac{1}{\beta}(\beta \langle \delta_{\sigma_i,a}\rangle) = - \langle \delta_{\sigma_i,a}\rangle $$and the expectation over the kronecker delta indicator function that is 1 if $\sigma_i == a$ is exactly equal to the single-site marginal $P_i(a)$.
Now $\mathcal{G}$ depends on the system’s state (probabilities) not the external forces (fields), allowing inference from our statistical distribution. So for the Potts model the Gibbs potential is defined as:
$$ -\mathcal{G}(\alpha) = \ln \left[\sum_{\{\sigma_i | i = 1, ..., N\}} e^{-H(\alpha)} \right] - \sum_{i=1}^N \sum_{b=1}^{q-1} \boldsymbol{h}_i(b) P_i(b) $$Where the first term is the statistical free energy $F = -\ln Z$ with the partition function $Z$ and the second term (with sign rearrangements) is the added marginals. Taking derivatives of $G(\alpha, \{P_i\})$ with respect to $P_i(a)$ gives the fields:
$$ \boldsymbol{h}_i(\sigma_i) = \frac{\partial \mathcal{G}(\alpha)}{\partial P_i(\sigma_i)} $$Here’s the hatrick. If we take the derivative of these single site fields $\boldsymbol{h}_i$ with respect to $P_j(\sigma_j)$ we get:
$$ \frac{\partial \boldsymbol{h}_i(\sigma_i)}{\partial P_j(\sigma_j)} = \frac{\partial^2 \mathcal{G}(\alpha)}{\partial P_i(\sigma_i)\partial P_j(\sigma_j)} $$Remember the connected correlations $C_{ij}$. in statistical physics it is well established that the second derivative of the free energy (with respect to the fields) gives the connected correlations! Given our first derivative of the free energy from earlier:
$$ \frac{\partial F}{\partial \boldsymbol{h}_i(\sigma_i)}= - \langle \delta_{\sigma_i,a}\rangle = P_i(\sigma_i) $$Taking the second derivative with respect to the single site fields gives:
$$ \frac{\partial^2 F}{\partial \boldsymbol{h}_i^2(\sigma_i)} = \frac{\partial P_i(\sigma_i)}{\partial \boldsymbol{h}_j(\sigma_j)} $$Which equals the connected correlations:
$$ C_{ij}(\sigma_i,\sigma_j) = \langle \delta_{\sigma_i,a}\delta_{\sigma_j,b}\rangle - P_i(\sigma_i)P_j(\sigma_j) = \frac{\partial P_i(\sigma_i)}{\partial \boldsymbol{h}_j(\sigma_j)} $$Where $\langle \delta_{\sigma_i,a}\delta_{\sigma_j,b}\rangle = P_{ij}(\sigma_i,\sigma_j)$. Notice how we are left with the inverse of the derivative of the single site fields with respect to $P_j(\sigma_j)$. In other words the inverse of the connected correlations describes the change in fields in response to the marginals:
$$ (C^{-1})_{ij}(\sigma_i,\sigma_j) = \frac{\partial \boldsymbol{h}_i(\sigma_i)}{\partial P_j(\sigma_j)} $$The authors emphasise that this relationships holds for any value of $\alpha$ and that the $q-1$ gauge makes this matrix invertible by removing “trivial linear dependencies resulting from the normalisation of $P_{ij}$. Given that the connected correlations holds information on our pairwise marginals $C_{ij}(\sigma_i,\sigma_j) = P_{ij}(\sigma_i,\sigma_j) - P_i(\sigma_i)P_j(\sigma_j)$ we can use it to obtain them via a simple rearrangement:
$$ P_{ij}(\sigma_i,\sigma_j) = P_i(\sigma_i)P_j(\sigma_j) + C_{ij}(\sigma_i,\sigma_j) $$However thanks to the earlier proof we can obtain $C_{ij}$ by inverting the Hessian of the Gibbs potential $\mathcal{G}$:
$$ C = \left(\frac{\partial^2 \mathcal{G(\alpha)}}{\partial P_i(\sigma_i)\partial P_j(\sigma_j)}\right)^{-1} $$So to compute the pairwise marginals we must evaluate the inverse of the Hessian of $\mathcal{G}(\alpha)$ and add the product of the single-site marginals. To do so we need an approximation $\mathcal{G}(\alpha)$. The authors achieve this via a Taylor expansion of $\mathcal{G}$ for two terms at $\alpha = 0$ (the independent Hamiltonian), also known as a first-order mean field expansion. In other words, the Gibbs potential is approximated the value of the independent system $\mathcal{G}(0)$ plus an interaction potential acting through the first order (second) term which is the mean-field correction:
$$ \mathcal{G}(\alpha) \approx \mathcal{G}(0) + \frac{d \mathcal{G}}{d \alpha}|_{\alpha = 0} $$This works easily for the independent single site case (no couplings) and in the paper they also extend this to the pairwise couplings interaction energy with a small coupling expansion (Taylor series on 0) which I refer the reader to for a more rigorous description than anything i could attempt (Morcos et al.).
Considering the Gibbs potential in the independent case $\alpha = 0$ where the Gibbs potential is equivalent to the negative entropy of an ensemble of $N$ uncoupled Potts spins with marginals $P_i(\sigma_i)$, the free energy equals the average energy minus the entropy. For $\alpha = 0$ the Legendre transform removes the complete average energy leaving the entropy of uncoupled spins:
$$ \mathcal{G}(0) = \sum_{i=1}^N \sum_{a=1}^q P_i(a) \ln P_i(a) $$Due to the gauge the authors modify this to eliminate $P_i(q)$ reducing the expression to independent variables:
$$ \mathcal{G}(0) = \sum_{i=1}^N \sum_{a=1}^{q-1} P_i(a) \ln P_i(a) + \sum_{i=1}^N\left[1 - \sum_{a=1}^{q-1} P_i(a)\right] \ln \left[1 - \sum_{a=1}^{q-1} P_i(a) \right] $$For $\alpha = 0$ we look to obtain $\frac{d\mathcal{G}(\alpha)}{d\alpha}$ from $\mathcal{G}(\alpha)$ for which we get:
$$ \frac{d\mathcal{G}(\alpha)}{d\alpha} = -\left\langle \sum_{i<j}\boldsymbol{J}_{ij}(a,b) \right\rangle_\alpha $$Or the average of the couplings term in the Hamiltonian. At $\alpha = 0$ this is trivial as we can factorise the joint distribution of all variables over single sites:
$$ \frac{d\mathcal{G}(\alpha)}{d\alpha} |_{\alpha=0} = -\sum_{i<j}\sum_{a,b} \boldsymbol{J}_{ij}(a,b)P_i(a)P_j(b) $$Inserting our equations for $\mathcal{G}(0)$ and $\frac{d\mathcal{G}(\alpha)}{d\alpha} |_{\alpha=0}$ into our expansion of the Gibbs potential returns its first-order approximation:
$$ \sum_{i=1}^N \sum_{a=1}^{q-1} P_i(a) \ln P_i(a) + \sum_{i=1}^N\left[1 - \sum_{a=1}^{q-1} P_i(a)\right] \ln \left[1 - \sum_{a=1}^{q-1} P_i(a) \right] -\sum_{i<j}\sum_{a,b} \boldsymbol{J}_{ij}(a,b)P_i(a)P_j(b) $$Or its mean-field approximation.
Now we’re nearly there! The final step of this procedure is to obtain so-called “self-consistent mean-field equations” for the single-site fields and connected correlations. Drawing from Physics once more we refer to the variational principle in thermodynamics that minimises the free energy (Gibbs potential) with respect to the marginal distributions $P_i(a)$ ignoring all correlations except those of the mean fields to converge on the true underlying Boltzmann distribution $P(\{\boldsymbol{\sigma}\})$. This requires a massive differentiation over the Gibbs potential with respect to $P_i(a)$ whilst normalising to $\sum_{a=1}^q P_i(a) = 1$. As with any minimisation we obtain the point at which the gradient equals 0 which through the magic of the author’s hard work leaves the single-site mean field equation:
$$ \frac{P_i(\sigma_i)}{p_i(q)} = \exp \left\{\boldsymbol{h}_i(\sigma_i) + \sum_{j\neq i}\sum_{\sigma_j} \boldsymbol{J_{ij}}(\sigma_i,\sigma_j)P_j(\sigma_j) \right\} $$Being each site’s marginal depends on the average state of all the others.
Finally returning to our connected correlations where we required an approximation of the Gibbs potential we need to derive the Hessian of the above first-order mean field Gibbs potential with respect to $P_i(\sigma_i)$ and secondly $P_j(\sigma_j)$ which gives the inverse connected correlations. Again this is a complex differentiation which, if you’re really keen you can try yourself and considering the normalisation of the marginals summation to 1, but I will just provide the end-result for. We get two cases as a result of this differentiation. For the case of $i \neq j$ we explicitly obtain the inference solution to the Potts model using mean field approximation:
$$ C_{ij}^{-1}(\sigma_i, \sigma_j) = -\boldsymbol{J}_{ij}(\sigma_i,\sigma_j) $$And that’s it! This inspiring medley of thermodynamics, statistical mechanics and evolutionary sequence analysis starting from a 21 state Ising model has led to rapid inference through the simple inversion of the connected correlations matrix which we have an empirical approximation for.