Modelling Protein Folding With Graph Networks Bridges Geometry and Kinetics

Introduction

Protein folding is still a difficult problem. Although AlphaFold made great strides in advancing static structure prediction, the dynamic process that transports a protein from random coil to equilibrium structure is difficult to model. Go based Ising-like energy models such as the WSME model were among the first successful mathematical models of protein folding as a kinetic process. The approach is as follows: given a native-like folded conformation, we can assign residues with any possible alternative conformation as either native-like or not. One way to define this is to consider the residue’s formation of native-like contacts ($C_\alpha - C_\alpha$ distance $<8\dot{A}$). In folding this could be the difference between a random coil unfolded state where a given residue makes few native-like contacting pairs with spatially proximal residues, versus the folded state where that residue may exist in an alpha helix.

The Ising-like approach is to simply formulate the protein with a binary state for each residue, taking 1 if that residue is in the native-like conformation and 0 otherwise. The WSME model itself is specifically formulated to enforce sequence contiguity, and the associated Ising-like hamiltonian is combined with an entropic parameter for each residue resulting in a canonical Free energy functional. The model is therefore kinetically meaningful and approximates a protein’s folding free energy landscape. For a more detailed description of the WSME model see Ooka, Liu, Arai. 2022.

What I’m hoping to illustrate with reference to the WSME model is that modelling protein folding as a dynamical process of state transitions is well established. This blog explores how we might exploit graph networks, which lend themselves well to such state-based formulations, for some slightly different approaches to modelling protein folding. The analysis I undertake is a more interpretative representation of folding prioritising a mechanistic intuition whilst drawing from key concepts used in the more established framework of Markov State Models (MSMs). By tracking observed transitions between states we construct heuristic estimates of relative contact flip rates for individual residue pairs. While these are not true kinetics or free energies directly comparable to the typically uniform WSME contact energies, they serve as a exploratory proxy for illustrating how kinetic heterogeneity can emerge in contact formation/breaking. It also has the perk of cool visualisations which in this case actually portray a meaningful picture of protein folding.

Throughout this blog I will specifically consider molecular dynamics (MD) simulations of protein folding, namely that of the 10 residue artificial fast folder Chignolin. Naturally this means we are modelling numerical physics-based approximations of protein folding rather than experimental data such as SAXS and HX. The reason for this is because MD trajectories are trivial to interpret and prepare, and for a number of well described fast folding proteins, are accessible and reproducible. However, as a TLDR the intractability of proteins >30 residues with this method makes my later manifold edges-based kinetic barrier analysis largely useless in the face of the clustering required to even build these graphs. However, I thought this was interesting enough to warrant a blog regardless, and i hope that it can at least serve as a somewhat intuitive introduction to elements in MSM theory.

Contents

Methods - Building a Graph
Identifying the Transition State Ensemble
Bridging Geometry with Energy
Take Homes

Methods - Building a Graph

Like the WSME model I’m going to rely on residue contacts to define a protein’s state in folding. For this we’ll define residue contacts as any pair of residues who’s $C_\alpha - C_\alpha$ distance is $< 8\dot{A}$ and are separated in sequence by at least 4 residues. We will consider every unique contact map ($N_{\text{res}} \times N_{\text{res}}$ boolean matrix) of a protein as a possible topological state that the protein can take during folding. For a graph $G = (V, E)$ of nodes $v \in V$ and edges $e \in E$, these will be the nodes of our graph.

Where edges $E= \{(v_i, v_j) \}$ represent relationships between states we can define them as follows: Given an MD trajectory of protein folding we may want to draw edges between nodes if the respective two contact maps are at any point adjacent in sequence in the trajectory - in other words, if node $v_i$ is the contact map of frame $t$ in the trajectory, we draw an edge from $v_i$ to $v_j$ if the contact map of either frames $t-1$ or $t+1$ belongs to node $v_j$. The below image is one such graph network for the artificial fast folding protein Chignolin (10 residues) using the trajectory data from Majewski et al. 2022 who repeated the original simulations from Kresten Lindorff-Larsen et al. How Fast-Folding Proteins Fold: chignolin_temporal_graph_seqsep Figure 1: Graph where nodes represent unique contact maps and edges connect temporally adjacent nodes. Node size corresponds to its respective contact map’s count in the folding trajectory. Nodes are coloured by the index in the trajectory of the first occurrence of their respective contact map. The outlined black node indicates the starting frame node (the unfolded conformation). Blue edge colouring highlights the shortest path to the yellow folded node whilst red edge highlighting indicates post-folded state transitions. The red start sits in the centre of the folded node.

Immediately we can spot something interesting. The largest node (yellow) represents the most commonly visited unique contact map in the trajectory which, given this is a folding simulation, we might assume is the folded conformation. This state is surrounded by a number of nodes that are visited only after the folded state is reached. Indeed by colouring edges appearing after reaching that state with red, we can see that the protein, upon reaching its folded state, seems to bounce around exclusively within this small basin. On the other hand we see the comparatively large space explored by the protein before reaching this ensemble, depicting the protein’s search over the folding landscape.

However, as this graph is constructed purely by temporal edge information there is no physical meaning to its coordinate system - if nodes are close together this is just a product of the graph building algorithm. Let’s keep our temporal edges visible but overlay them on a graph who’s real edges are defined by adjacency in the space of all contact maps. In other words, nodes $i$ and $j$ are connected if they differ by exactly one contact. This forms an $N\text{res}\choose 2$ dimensional discrete hypercube manifold* which we can project to 2D coordinates by embedding with UMAP to get the following: chignolin_manifold_umap_graph_seqsep Figure 2: Unlike the previous graph node connectivity is defined by adjacency on the discrete contact map configuration space manifold and projected down to a 2D representation to approximate topological difference in structure between nodes. Violet edge colouring represents the shortest path adhering strictly to the underlying manifold edges (manifold geodesic)

*Actually the manifold is heavily constrained down to a $\mathcal{M} = \sum_{i=1}^{N_\text{res}}(N - (i+3)) = 21$ dimensional hypercube because of the sequence separation distance of 3 residues in our contact definition. This means the total number of unique possible states is $2^{21}$

Again we observe the folding process as an apparently somewhat inefficient search over the contact map space bottlenecked by a number of key nodes. This raises the question of why all of the space to the right of the starting state is explored before reaching the first transition considering that the progression to the folded state is clearly left-directional (although I must emphasise the inequivalence between geometry and reaction coordinates in this embedded space). So, is this due to inefficiencies in nature’s folding machine, inaccuracies in the MD approximate physics, or perhaps an apt visualisation of the initial random search theory of protein folding posited by Dill & Chan 1997.?

One thing we can reveal from taking the intersection of manifold and temporal edges (edges that only appear in both sets) is that the unfolded state is a non-continuous process disconnected from the folded state along the manifold space - this isn’t anything surprising it just means this flip is likely hidden under temporal transitions consisting of multiple contact flips. Specifically the third edge on the manifold geodesic (ASP2, THR5 formation) is never observed in the set of temporal edges, the protein navigates around it rather than making the direct link. (ASP2, THR5) is also not a native contact.

Naturally we might think that applying the shortest path to these graphs would give insight into optimal folding paths. But clearly Chignolin does not take the shortest or manifold geodesic paths, why? It’s likely that some energetically favourable conformation must be reached before the large jump into the folded conformation can be achieved. But, are these graphs just pretty or could they help us explore how to identify these transition states and ways to engineer proteins to more efficiently reach them without all the random search? Perhaps we can formalise these sentiments into a quantitative framework by drawing on the properties of these graphs.

Identifying the Transition State Ensemble

One property we can easily exploit is the betweenness centrality $C_B(v)$ which is a metric that balances how connected a given node is with how much interconnectivity it provides, in other words how often it appears in the shortest paths between all other pairs of nodes.

$$ C_B(v) = \sum_{s \neq v \neq t} \frac{\sigma_{st}(v)}{\sigma_{st}} $$

Where $\sigma_{st}$ is the total number of shortest paths from node $s$ to node $t$ and $\sigma_{st}(v)$ is the number of those shortest paths that pass through $v$. This is a good descriptor of topological bottleneck states which are likely key to successful folding but may be naiively confused with the the Transition State Ensemble (TSE). Let’s consider the top 10 betweenness centrality nodes: chignolin_manifold_umap_graph_bottlenecks_seqsep Figure 3: Chignolin protein folding graph on the contact space manifold (PCA and UMAP embedded). This time crosses indicate bottleneck nodes with a high betweenness centrality. The size of the crosses and their colour represents the magnitude of the betweenness centrality score.

Notice how many of the most significant bottlenecks lie on the shortest paths but there are still some distributed over the unfolded space. A more appropriate explanation of what high betweenness centrality nodes represent is structural hubs that are easily accessible to lots of other states. These nodes are likely vital for the protein to find before branching off into attempts at folding as they increase accessibility to perhaps more energetically rugged regions of the landscape.

But structural accessibility does not necessarily mean transition state. Instead we can take a more rigorous approach to identifying the TSE, specifically by drawing from MSM theory which utilises another key graph property called committor probabilities. Committor probabilities exploit the fact that in graphs we know what possible paths are ahead via node connectivity so at each point we can compute for the probability that node $i$ will reach the folded state before returning to the unfolded state (start node).

Imagine we have a weighted undirected graph $G = (V,E,W)$ with nodes $V = \{v_1,...,v_N\}$, each representing a unique state (contact map), and edges $E = \{ (v_i,v_j)\}$ with weights $w_{ij} \geq 0$ given by transition counts between nodes $(v_i,v_j)$. In standard MSM modelling we would typically employ a clustered graph with states representing conformational ensembles and validated for markovianity at a chosen lag time. In our case we preserved interpretability and geometric meaning with unique contact state nodes.

To compute the committor on this undirected graph we treat it as a reversible flux-weighted graph. We first regularise the transition counts by adding a small regularisation factor $\alpha$:

$$ \hat{w}_{ij} = w_{ij} + \alpha $$

This allows us to symmetrise the matrix of counts $C$ to enforce reversability:

$$ c_{ij} = \frac{\hat{w}_{ij} + \hat{w}_{ji}}{2} $$

This enforces detailed balance by construction without having to assume ergodicity, meaning the transitions counts $C=\{c_{ij} \}$ now reflect the probabilities in a random walk over the graph:

$$ P_{ij} = \frac{c_{ij}}{\sum_k c_{ik}} = \frac{c_{ij}}{D_{ii}} $$

Where $D$ is the degree matrix where $D_{ii} = \sum_k c_{ik}$ is the degree of node $i$. We can now exploit something called the Laplacian of the graph which measures how a signal on a node differs from its neighbours and is given by:

$$ L = D - C $$

From the symmetric graph Laplacian we can marginalise each node over all its edges:

$$ L_{ii} = \sum_j c_{ij}, \quad L_{ij} = -c_{ij}(i\neq j) $$

The committor probability $q_i$ can be computed exactly for small graphs as the probability that state $i$ reaches the boundary folded node $B$ before the unfolded node $A$. This is the Dirichlet boundary conditions $q(A) = 0, \quad q(B) = 1$ resulting in the discrete harmonic equation:

$$ \sum_j c_{ij} (q_i - q_j) = 0, \quad \forall i \notin A \cup B $$

In practice, this construction yields a discrete, symmetric committor defined directly on the contact-map graph. Because the graph is undirected and regularised, the resulting committor reflects both kinetic connectivity and entropic structure in contact space. While this object is an approximation of the true committor of the underlying MD, provides a geometry-aware reaction coordinate consistent with our undirected treatment of contact-flip kinetics.

As a proxy of a true committor probability for our non-MSM graph $q_i$ gives us a reaction coordinates-like state ordering in the folding pathway. With such reaction coordinates we can more explicitly identify the transition state which is canonically defined as the bottleneck of the reaction. If $q=0$ is the probability of unfolding before folding and $q=1$ is the probability of folding before unfolding (folded state) then any node with $q \approx 0.5$ is sitting on the precipice of falling towards either state - imagine it sitting on a ridge tilting into the unfolded and folded energy basins. This will be how we define the transition state ensemble.

Using this approach we can take the 10 nodes with $q$ values closest to 0.5 as the transition state ensemble and compare to our previous estimations: chignolin_manifold_umap_graph_qtse_seqsep Figure 4: Chignolin protein folding graph on the contact space manifold (PCA and UMAP embedded). This time all edges except those between nodes with $q>0.5$ are removed and they are coloured by $q$. Nodes are sized by $q$. Crosses label the top 10 nodes with committor probabilities closest to $0.5$. The size and colour of the crosses represents their committor probabilities. Note the yellow start which is the only node shared between this q-based TSE and the top 10 betweenness centrality nodes. The blue start represents the folded node.

The discrepancy between these two sets is informative. For our graph, the high betweenness centrality nodes are more often energetic traps than true transition states, they do not necessarily lead to the folded state. In other words geometric connectivity does not align with propensity to fold. Indeed there is no correlation between the values across all nodes: chignolin_betweenness_vs_committor Figure 5: Comparison of the committor probabilities and betweenness centralities of the q-based TSE (top 10 nodes closes to q=0.5) and the top 10 betweenness centrality nodes.

The lack of an apparent relationship means that generally for Chignolin the transition states are largely not accessed by the protein - it explicitly avoids highly connected geometric intersections to fold. This is a strong indicator that the folding process in this case may be inefficient in its initial random search.

This difference between our geometric measure and kinetic measure exposes the robustness of more established methods in MSMs. However, this does not mean the geometric representation we gain from our graph is not useful. While committors provide kinetic insight, our graph’s geometric structure offers a unique opportunity to estimate energetic parameters directly.

Bridging Geometry with Energy

Some geometry that is meaningful is the vector between the unfolded and folded states, the geodesic shortest path adhering to the manifold. Even if the shortest path isn’t energetically optimal, it serves as a useful directional bias in the contact-map space as a vector that points in the direction of folding. Further, the distances on our graph (specifically the number of edges / hamming distance) are a geometric measure of the number of contact flips needed to transport from one state to another.

What’s powerful about our manifold + temporal edges graph is that temporal transitions can have hamming distances > 1 (more than one contact flip), indicating more than one unique contact change happened within the 200ps timesteps of this trajectory data. This gives two useful properties I will use to attempt to infer individual contact flip kinetics:

Higher Hamming distances in one timestep means the barrier for the combined event is low relative to the timestep, or intermediates are short-lived and unresolved. Such edges represent cooperative events.
We can infer the sequence of manifold jumps taken in a temporal edge by the collection of shortest paths on the manifold to achieve it.

A simple histogram of hamming distances for all temporal edges (where if hamming distances are >1 we infer the distance by shortest path on the manifold) nicely shows the appropriateness of a 200ps timestep at capturing single flip events: chignolin_hamming_distance_histogram Figure 6: Histogram showing the distribution of temporal transitions’ inferred hamming distance. Hamming distances inferred by shortest path on the manifold edges between the temporal nodes for each temporal edge.

Now here’s the really unique part, we can use these properties in a cool way to estimate the kinetic cost of flipping a contact (either forming or breaking a contact between a pair of residues). This is particularly interesting for the likes of the WSME model which uses contact energies $\epsilon_{ij}$ but canonically defines them with a uniform value - although to be explicit what i describe here is a kinetic rate barrier which is not equivalent to energetic cost such as those used by the WSME model. However, in future work I’m hoping to extend this to a real energetic cost using a thermodynamic approach.

To estimate these contact flip rates I developed a relatively crude algorithm inspired from standards in MSM kinetics analysis (apologies for any cringe this causes experienced MSM modellers), namely:

Observed temporal transition counts can be converted into relative rate-like quantities (under the fixed lag time of our 200ps timestep for this Chignolin simulation data).
We can apply a Arrhenius-like transform to these rates to give relative kinetic energy barriers.

I say crude because as I mentioned these are not real energies, even if we transform Arrhenius style. The Arrhenius equation gives a rate $k$ from the energy change $\Delta G$:

$$ k = A \exp(-ΔG/k_B T) $$

In our case we rearrange for the energy given the rate for the contact flip between residues $k(a,b)$ (note this is not directional as we are currently only working with an undirected graph):

$$ \Delta G(a,b) = -k_B T \ln \left(\frac{k(a,b)}{A} \right) $$

The problem with this is that the attempt frequency $A$ is not arbitrary, it is specific to each system. For this work instead of determining an exact $A$ we can replace it with a selected reference rate with the consequence that $\Delta G^\dagger$ becomes a relative kinetic barrier as opposed to an absolute activation energy:

$$ \Delta G^\dagger(a,b) = -k_B T \ln \left(\frac{k(a,b)}{k_\text{ref}} \right) $$

In practice, $k_{\text{ref}}$ can be taken as the median observed flip rate, yielding a measure of how difficult each contact flip is relative to a typical event. Alternatively, choosing $k_{\text{ref}} = \max k_\text{ref}$ assigns the fastest observed flip the lowest barrier, with all other contact flips measured relative to it. Given this formulation for the relative kinetic barrier of a contact pair marginalised over all contact map configurations I use my crude algorithm to estimate the non-directional rate $k(a,b)$ of flipping for any given contact pair on the graph.

Algorithm

For every temporal edge between nodes ($v_l$,$v_k$) with observed transition counts $C_{lk}^{\text{obs}}$:

Find all shortest paths between $u$ and $v$ on the manifold graph where $N_\text{paths}$ is the number of degenerate shortest paths.
Assign each path a flux* $C_{lk}^{\text{obs}}/N_\text{paths}$
For each path iterate through its constituent manifold edges $(v_i,v_j)$ adding the flux to each edges’ inferred count $C_{ij}^{\text{inferred}}$ Note this algorithm yields the original temporal transition counts for temporal edges of hamming distance 1

*As our graph actually exists in a 21 dimensional hypercube there are $d!$ possible shortest paths for any temporal transition with hamming distance $d$. To reconcile this we can simply use a distributed flux maximum entropy approach (the least assumptive choice) $C_{lk}^{\text{obs}}/N_\text{paths}$ where we uniformly split the transition counts between the number of paths. This maximum entropy naturally captures heterogeneity in inferred manifold edge counts despite arising form uniform splits because of differential usage of these edges within different temporal edge paths. E.g. if two temporal edges of hamming distance 3 share some manifold edges in their shortest manifold paths their flux is additive meaning unshared manifold edges have a different flux, you can imagine how this stacks over hundreds of temporal edges and constituent manifold edge combinations.

Given that the same contact flip can be made by different nodes with different contact map contexts we must aggregate counts across all manifold edges involving a given contact flip $a,b$ to get a marginal score for that contact:

$$ C^\text{inf}(a,b) = \sum_{(v_i,v_j) \in \mathcal{E}_{ab}} C^\text{inf}_{ij} $$

where $\mathcal{E}_{ab}$ is the set of all manifold edges that represent flipping contact $(a,b)$:

$$ \mathcal{E}_{ab} = \{(v_i \rightarrow v_j) : \text{nodes } v_i \text{ and } v_j \text{ differ only in contact } (a,b)\} $$

Now we can infer the transition probability that contact $(a,b)$ flips within a timestep by normalising over the number of opportunities for that flip:

$$ P(a,b) = \frac{C^\text{inf}(a,b)}{N_\text{adj}(a,b)} $$

where $N_\text{adj}(a,b)$ is the combined state population counts for all nodes that are involved in a $(a,b)$ contact flip:

$$ N_\text{adj}(a,b) = \sum_{(v_i \rightarrow v_j ) \in \mathcal{E}_{ab}} N_i $$

Note we only consider summing over $i$ because as the graph is undirected $\mathcal{N}_{ab}$ implicitly iterates over the reverse $v_j,v_i$. Given we know our time-step (lag time) is 200ps we can approximate the flip rate as the inverse mean waiting time $k \approx P/\tau$ which assumes exponential kinetics. Under the approximation that contact flips are memoryless over the chosen lag time, we can map the discrete-time flip probability to an effective continuous-time rate using the standard exponential waiting-time relationship:

$$ k(a,b) = -\frac{1}{\tau}\ln(1-P(a,b)) $$

Using this rate with our Arrhenius-like equation we get the kinetic barriers of the contact pairs found in our graph which we can visualise nicely with a contact map matrix: chignolin_contact_flip_energy_matrix_dual Figure 7: (left) Chignolin contact matrix coloured by inferred contact flip relative kinetic barrier $\Delta G^\dagger(a,b)$ for 12 of the 14 native state contacts (right). The native state is taken as the most occupied node rather than the crystal structure PDB.

This gives us a clear indication of which contacts flip more readily in the observed dynamics, for example residue pair $(1,6)$ flip far slower than the rest suggesting there is a high kinetic barrier to forming (or breaking) this contact. Again I emphasise that as our graph is undirected we cannot resolve to energies for forming or breaking contact, the kinetic barrier we have inferred is joint over them. Perhaps in a future blog I’ll decompose this into formation and break barriers.

Either way, we can perform one last brief analysis to characterise the folding profile of Chignolin. Like is typical in MSMs we can plot energy against reaction coordinate - however very much unorthodoxly in this case we will use our contact pair kinetic barrier values in place of the energy. By taking the sum of kinetic barriers for every node’s contact map we can get an idea of how slow the trajectory enters and leaves that state. Against committor probability as a reaction coordinate we get a clear visualisation of the well established two-state folding nature of Chignolin: chignolin_summed_kinetic_barrier_vs_q Figure 8: Chignolin two-state folding. Total kinetic barrier taken as the sum of all native-like contacts’ kinetic barriers is calculated for each unique contact map and plotted against the respective states’ reaction coordinate given by the committor probability q. Background shading is a density function of the scatter points. Central funnelling is characteristic of two-state folding.

This is all rather theoretical and unvalidated. One of the reasons for this is that, as I highlight later with the poor scaling of the choice of contact map discrete states, reasonably sized protein such as Chymotrypsin inhibitor 2 (CI2) for which there is phi value experimental data to directly compare to are largely intractable (at least on my laptop). Hence why I’ve left this work to a blog.

Take Homes

In this blog I’ve introduced modelling protein folding (specifically MD trajectories of folding) with discrete states as contact maps using graph networks. We’ve explored some useful graph properties and dipped our toes into the mature field of MSMs for which I refer you to more dedicated work for a more accredited introduction. Although very much explorative I’ve outlined a somewhat physics-based method for inferring the kinetic barriers associated with residue contact pair flips. As we mentioned this is unfortunately not decomposable to formation or breaking costs and is also not an absolute free energy. I would like to emphasise that it is 100% hypothetical and I have not rigorously experimented with it against experimental data but I hope it might have piqued your interest or inspired you.

Referring back to our opening statements, could any of this be useful in protein engineering applications? The answer is, quite possibly. For one our inferred contact flip kinetic rates present a clear picture of which contacts have high barriers, either slowing folding significantly or representing heavy structural constraints. Targeting such residues with rational modifications may well lead to intended changes in folding rates.

However, as I’ve mentioned the most troublesome limitation is scaling with protein size. The problem is that the number of unique contact maps scales $2^{N_{\text{res}}\choose 2}$ with protein size $N_{\text{res}}$ and so for any reasonably sized protein that is not some artificial fast folder like the 10 residue chignolin the node sizes converge to 1 as the number of possible contact maps vastly exceeds the number of samples.

There is a quick but dirty solution. We can cluster contact maps proximal on the manifold of contact space together thus reducing the number of nodes to levels ameanable to data sizes and visualisation requirements. For example, let’s consider the largest protein from the Kresten Lindorff-Larsen et al folding dataset, the lambda repressor, here’s its folding graph embedded with PCA and UMAP onto the contact map manifold then clustered with HDBSCAN: lambda_repressor_clustered_and_mds_embdedded_seqsep Figure 8: Lambda repressor folding graph in the embedded contact map space with nodes representing clusters of unique contact maps close in embedded PCA -> UMAP space. Grey edges connecting temporally adjacent nodes (cluster labels adjacent in the trajectory given each frame in the trajectory is assigned a cluster label). the blue shortest path operates on the temporal edges.

Also, although we used unique contact maps as our nodes one could easily employ other state representations such as secondary structure content which would have $4^{N_{\text{res}}}$ configurations as opposed to the contact map $2^{N\text{res}\choose 2}$. For MSMs it’s much for common to use RMSD to the native state values for clustering, or even torsional angles.

In the end it comes down to scale vs mechanistic interpretability. Clustering is the only way we can even construct the graph for larger proteins and it does not lend itself well to our kinetic contact flip energy analysis thanks to sacrificing manifold edge fidelity into clusters. However, for more MSM-like free energy and committor analysis it is functional under the subjectivity of clustering choices.

Regardless, I’m going to keep looking into the contact flip energies on the side so stay tuned for anything useful I might find whether through blogs or publication (wish me luck!).

The code used in this analysis and links to the MD simulation data is publicly available in the following GitHub repo.