Introduction

Much of today's biological research is motivated by the desire to understand higher-order processes more fully. For example, our desire to understand human development and disease helps us to drive research on cell biology. In turn, our desire to understand cells has motivated us to understand signaling pathways and gene networks. At each of these levels, networks — entities connected by arrows based on their relationships to one another — have proven to be a valuable way of representing knowledge (Figure 1).

Figure 1
figure 1

Example networks. (A) A signaling pathway with a cell (large green rectangle), proteins (colored ovals), a nucleus (dashed green circle), repression as blunted arrows, movement into the nucleus as a dashed arrow, and transcription as a solid black arrow. (B) The cell cycle with stages G1, S, G2, and M.

Now our desire to understand the molecular underpinnings of biology and disease is motivating research into molecular kinetics. For example, it has recently been discovered that small oligomers of just a few Aβ peptides may be the toxic elements in Alzheimer's disease 1. Determining their structures could aid in designing drugs to prevent their formation, however, the structural heterogeneity of these oligomers makes accurate structural characterization with conventional methods difficult. Fortunately, computational modeling can capture both the dominant structures and dynamics of these molecules 2. Another example is proteomics. Now that genomics has given us the sequence of the human genome, there is a push for high throughput structure prediction to obtain the structures of all the proteins encoded therein. Information-based methods have proved useful for small globular proteins, but physical models (which capture both thermodynamics and kinetics) are likely required to push to systems like membrane proteins for which less structural data is available. Since the majority of drugs on the market target membrane proteins, such information would again be valuable for designing therapeutics.

As in many other fields, networks are a valuable framework for representing knowledge of molecular kinetics. In particular, networks called Markov state models (MSMs) are proving to be a powerful means of understanding processes like protein folding and conformational changes 3, 4, 5, 6, 7, 8, 9, 10, 11, 12. The power of MSMs derives from the fact that they are essentially maps of the conformational space accessible to a system. That is, like a road map with roads labeled with speed limits and cities labeled with populations, MSMs give the probability that a protein or other molecule will be in a certain set of conformations (called a state or node) and describe where it can go next and how quickly. Figure 2 shows a portion of an MSM as an example. Examining these maps can give tremendous insight into processes like protein folding and could even suggest how to manipulate these processes with small molecules, mutations or other perturbations. Models with sufficient resolution can also yield quantitative agreement with, or even prediction of, experimental observables like folding rates and structures 4, 5, 6.

Figure 2
figure 2

An example MSM for the villin headpiece. Shown here are four clusters of conformations automatically identified by MSMBuilder. Each cluster represents a state of the villin protein. Arrows indicate transitions between states, also identified by MSMBuilder. The group or cluster representing the native state (right-most) was accurately identified. Its members match the crystal structure (shown in darker blue and magenta) with an average root mean square deviation (RMSD) of 1.8 Ã…. Courtesy of Joy Ku and Gregory R. Bowman, reproduced with permission from Biomedical Computation Review.

MSMs are typically constructed from simulation trajectories (i.e. series of conformations that were visited one after another in a physical simulation of a system, like a protein) 4, 5, 10, 12, 13, 14. Because of the temporal relationship between conformations in a trajectory, it is possible to group conformations that can interconvert rapidly into states and then determine the connectivity between states by counting the number of times a simulation went from one state to another. By employing these kinetic definitions, one ensures that the system's dynamics can be modeled reasonably well by assuming stochastic transitions between states 3, 4, 5, 6, 7, 10, 11, 13, 15. Thus, it is possible to perform analyses, such as identifying the most probable conformations at equilibrium. In addition, one can naturally vary the temporal and spatial resolution of an MSM by changing the definition of what it means to interconvert rapidly or slowly 4, 5, 15, 16, 17. By choosing a long timescale cutoff, one can obtain humanly comprehensible models with just a few metastable (or long-lived) states that capture large conformational changes, like folding. Such coarse-grained models are useful for gaining an intuition for a system. With a short timescale cutoff, on the other hand, one can obtain a model with many states. By using such high-resolution models, one sacrifices ease of comprehension for more quantitative agreement with experiments 4, 5, 18.

To date, MSMs have mostly been used to understand phenomena like peptide and protein folding 4, 5, 13, 17, 19, 20, 21, 22, 23, RNA folding 6, 16, and conformational changes 8, 9, 24. Having been validated by these studies, they are now being applied to important topics in human health. Examples include protein aggregation in Alzheimer's disease 2 and vesicle fusion 25, an important step in influenza infection.

The remainder of this review will be divided into four major sections. First, we review the application of MSMs to biomolecular folding, one of the driving problems behind the development of this technology. The next section focuses on the application of MSMs to human health, particularly protein misfolding diseases and influenza. This is followed by a review of MSM methodology and a discussion of efficiently capturing long timescales with short simulations and MSMs in the last two sections.

Protein and RNA folding

Protein folding is one of the biological problems that has driven the development of MSMs. From a biophysical point of view, it is simply amazing how proteins collapse to specific structures so quickly given the astronomical number of possible conformations they can adopt. From a biomedical point of view, understanding folding is an important first step towards understanding and preventing misfolding, as in Alzheimer's and Huntington's diseases 26. Better models for protein folding could also allow more efficient structure prediction 27, 28, aiding high-throughput proteomics initiatives.

Unfortunately, understanding protein folding is notoriously difficult. Much experimental work has been done to characterize protein folding, but no experiment can capture the dynamics of an entire system with atomic resolution in order to provide a complete picture of the process. Computer simulation has the potential to fill this void, but reaching biologically relevant timescales with sufficient (atomic) resolution is extremely challenging 29, and even with an infinitely long simulation one would still need a way to extract meaningful information from it. MSMs can overcome these issues by performing a kinetic clustering of conformations to provide maps of a system's conformational space and extracting long timescale dynamics from many short simulations 4, 5, 10, 13, 30, similar to the way relay runners can cover large distances despite the fact that each runner only goes a little way. As explained in the section 'Reaching biologically relevant timescales with MSMs', MSMs may even be used to direct simulations to where they are needed most (via adaptive sampling) to efficiently explore the conformations that a molecule can adopt 6, 31, 32, 33.

Many of the initial applications of MSMs have been to small peptides due to the challenges of simulating larger proteins 12, 13, 17, 20, 21, 22, 34, 35. For example, some of the fundamental methods were laid out by Singhal et al. 36 in an early work on a tryptophan zipper β hairpin. In another noteworthy example, Chodera et al.13 developed and applied one of the first automated methods for building MSMs to a number of small systems. Buchete and Hummer21 have also demonstrated how to tune MSMs to obtain models with varying degrees of coarse graining.

In one of the first forays into full protein systems, Jayachandran et al.37 constructed an MSM for the villin headpiece. Villin was selected because it has a hydrophobic core and tertiary contacts, which make it a real protein rather than just a peptide, but is still extremely small and fast folding 38, 39. Villin only has ∼35 residues and folds on the microsecond timescale, whereas most proteins have a hundred or more residues and fold on millisecond to second timescales; thus, villin is more amenable to computer simulations, which are generally no longer than tens of microseconds.

Villin was also the first target to be studied by automated methods 4, mainly the MSMBuilder package 15, which is now freely available on the web (https://simtk.org/home/msmbuilder/). This new MSM, a portion of which is depicted in Figure 2, was validated by showing that it is capable of reproducing the raw simulation data. Moreover, it also yields reasonable agreement with experimental measurements. For example, based on free energies calculated from the model, it was possible to predict the native structure to within 2 Ã… of the X-ray crystal structure (Figure 2). This was a significant advance since the native state of a system is determined by the lowest free energy state, yet most structure prediction algorithms are still based on identifying low-energy conformations due to the difficulties inherent in calculating the entropic component of free energies 40, 41. Making predictions based on free energies will be especially important for more complex systems, like inherently disordered proteins, where the lowest energy state may not be the lowest free energy state. The longest timescales in our villin MSM were also consistent with experimentally measured folding rates 4. Capturing these rates is critical because most experimental observables can be derived given complete knowledge of the relevant rates and a few conformations from each state.

Similar methods have also been used to study the PinWW domain 5. This small protein consists of a three-stranded β-sheet and is another common model system for protein folding studies 42. One important contribution of this work was a method for probing the equilibrium ensemble of folding pathways, which is based on transition path theory 43 and related methods 44. This study also revealed numerous parallel folding pathways, as well as misregistered trapped states which slow the folding process.

MSMs have also been applied to the related problem of RNA folding 6, 16. This research has been driven by recent developments demonstrating that RNA does more than convey information from DNA to proteins and, like proteins, often must fold into specific three-dimensional shapes to accomplish these roles 45. The ribosome is a noteworthy example; however, it is far too large to sample exhaustively. Thus, simulations have tended to focus on small motifs, like RNA hairpins 46, 47. RNA hairpins are composed of a double helix stem capped by a short loop. Despite their small size, they play many important roles. They are ubiquitous in known RNA structures, may serve as nucleation sites for the folding of larger molecules and play important roles in RNA-RNA and RNA-protein interactions 46, 48. In the past, generalized ensemble simulations have been used to try to understand the thermodynamics of RNA hairpins 46, 49, 50. More recent work has begun to apply MSMs to capture both their thermodynamics and kinetics, showing that current force fields are at least sufficient to identify a hairpin as the native state and get the order of magnitude of the folding rate correct 6, 16.

Other studies have also used MSMs to try to understand phenomena like downhill folding 51, conformational changes in membrane proteins 9, the effects of mutations 52, proton pumping 53 and human disease (to be discussed in the next section). However, most of the work on protein and RNA folding has focused on methods development and validation. In the coming years, it will be exciting to identify common features in protein free energy landscapes, explain exactly how proteins are able to fold so quickly and elucidate the origins of experimental observables. Capturing longer timescales (i.e. milliseconds to seconds and beyond) will be crucial for these future developments. An important step in this direction is a recent work that was able to capture the folding of NTL9 54, an α/β protein that folds on the millisecond timescale 55. Further application to other biomolecules should also prove fruitful.

MSMs and human health

Protein folding is not the only process that is difficult to characterize experimentally. Inherent disorder in processes like aggregation and vesicle fusion also makes them good candidates for study with MSMs. Again, experimental studies of these systems have limited resolution. Computer simulations can provide atomic-level descriptions of their dynamics and, with MSMs, reach the relevant timescales.

One particularly noteworthy example of applying MSMs to human disease is a recent work on Aβ aggregation, which plays an important role in Alzheimer's disease 1, 2. Aβ is extremely difficult to work with experimentally; however, with computer simulations and MSMs, Kelley et al. 2 were able to obtain models with atomic resolution capable of capturing dynamics on tens of seconds timescales. To accomplish this feat, they first ran atomistic simulations of encounter complexes with varying numbers of monomers. An MSM was then constructed from this data using state definitions based on physical intuition (e.g. based on the number of monomers, dimers, trimers and tetramers). To achieve experimentally relevant concentrations of Aβ, they then added a diffuse state and rates of transitioning between this state and encounter complexes were calculated using an analytic diffusion theory. This model, a schematic of which is shown in Figure 3, was found to give reasonable agreement with experimentally measured rates, a tremendous achievement given that standard simulations are about six orders of magnitude shorter than the relevant timescales. The MSM also gave important mechanistic insight into Aβ aggregation. In particular, the authors identified a reasonably populated C-terminal β-hairpin that was the main source of interactions between aggregated peptides, leaving the N-termini exposed to the solvent. The exposure of the N-termini to solvent makes them more accessible for binding with other molecules, explaining why antibodies targeting the N-terminus tend to have higher binding affinities than those targeting the C-terminus.

Figure 3
figure 3

Markovian model for Aβ oligomerization. Our model was built using the different aggregation states as the Markov states; in a system with four chains, there are five such states: four monomers (MMMM), two monomers and one dimer (MMD), two dimers (DD), one monomer and one trimer (MT), and finally, one tetramer (Q). In addition, to include the effects of low concentration found experimentally, we discriminate EC states (in which states are close) from separated states. The rate-limiting steps in the aggregation process are shown as dotted lines. The numbers associated with the transitions are transition probabilities. The significant figures were determined from the uncertainties in the transition probabilities. Some transitions with very low probability have not been shown for the sake of clarity. Reprinted with permission from Kelley NW, Vishal V, Krafft GA, & Pande VS, J Chem Phys, 129, 214707, 2009. Copyright 2009, American Institute of Physics.

In another example, MSMs were used to study vesicle fusion 25. Vesicle fusion is important for processes like infection by enveloped viruses (e.g. in influenza); however, their large size and slow timescales make them inaccessible to standard computer simulations. Typical simulations have ∼10 000 atoms and reach tens of nanoseconds timescales, but understanding vesicle fusion requires reaching hundreds of microseconds timescales for systems with over 500 000 atoms. Once again, a combination of simulations with MSMs was able to overcome these limitations, leading to important mechanistic insights. First, vesicles were found to fuse via two pathways: a fast pathway wherein two vesicles become connected by a short stalk and then rapidly fuse, and a slow pathway in which the stalk state transitions to a strongly metastable hemi-fused state before finally fusing. In a later work, the authors were even able to probe the dependence of the relative probabilities and rates of these pathways on the lipid composition of the vesicles involved 56.

Other human diseases are also excellent candidates for study with MSMs. For example, thousands of short simulations have already been performed for the Huntingtin protein 57 and the influenza hemagglutinin fusion peptide 58. Thus, these systems are prime for study with adaptive methods 6, 31, 32, which allow one to build an initial MSM with available simulation data and then refine it by running new simulations from each state.

MSM methodology

The dynamics of proteins and other molecules are governed by the system's underlying free energy landscape. Much like hikers on a natural landscape, proteins prefer to stay in the valleys of their landscape and only rarely (and generally slowly) cross over the barriers and peaks between valleys. MSMs are essentially maps of such landscapes 3, 4, 5, 6, 7, 10, 13, 15, 19. However, whereas road maps have cities connected by roads labeled with speed-limits, MSMs have conformational states (sets of conformations in the same valley) connected by edges labeled by the probabilities of transitioning between them.

While MSMs may be visualized as networks, as in Figure 2, computers represent them as transition probability matrices (P) 3, 7, where the entry in row i and column j (Pij) gives the probability of transitioning from state i to state j in an interval called the lag time of the model (). The probability of being in any state at a particular time can be represented as a vector, v(t). The time evolution of v(t) can then be calculated using

where P() is a transition matrix with lag time and each multiplication advances the model through time by one lag time.

The eigenvalue/eigenvector spectrum of a transition probability matrix gives important thermodynamic and kinetic information 3, 7. For example, the first eigenvalue always has a value of one and the corresponding eigenvector gives the equilibrium probabilities of all the states. Subsequent eigenvector/eigenvalue pairs give information about sets of states that interconvert on the same timescale and the rates at which these transitions occur. For example, a rate can be calculated from an eigenvalue using

where μ is an eigenvalue, is the lag time, and k is a rate. This equation comes from the equivalence between discrete time MSMs and continuous time master equations (see Refs 10 and 13 for details). Using these properties and a few representative conformations from each state, it is possible to compute experimental observables 4, 5. One can even put error bars on these properties using Bayesian statistics 59, 60.

MSMs are straightforward to work with once you have a valid state definition (one yielding Markovian behavior). One simply has to assign conformations to these states and represent trajectories as a series of state assignments rather than as a series of conformations. One can then simply count the number of transitions between pairs of states and store these values as a transition count matrix (C), where the entry in row i and column j (Cij) gives the number of observed transitions from state i to state j. One can then obtain a transition probability matrix (P) by normalizing each row of the transition count matrix. From P, one can then calculate all the eigenvalues, eigenvectors and observables they desire.

The most challenging part of building MSMs is identifying a good state decomposition. A good state decomposition should group conformations that can interconvert rapidly into the same state because this implies that they are not separated by a significant free energy barrier. Conformations that cannot interconvert rapidly, however, should be separated into different states because they are likely separated by a significant free energy barrier.

Practically speaking, it is neither possible nor desirable to determine the rate (or equivalently the probability) of transitioning between two conformations. Rather, one must consider the rates of transitioning between sets of conformations. To understand this, one can imagine trying to measure how long it takes to get home from work. You could measure the time it takes to get from the front doorway of your office to the front doorway of your home. However, it would likely be more appropriate to measure the time it took you to get from anywhere in your work place to any point in your home, regardless of whether you left work through the stairwell door and entered your home through the front door or left work through the main door and entered your home through the garage door. In the same spirit, measuring the transition rates between sets of kinetically equivalent conformations is more meaningful than measuring the transition rates between individual conformations.

In the case of MSMs, one often obtains initial sets of conformations by clustering them into microstates based on geometric criteria (Figure 4), with the objective of having conformations within a given microstate be so similar that their geometric similarity implies a kinetic similarity 10, 13, 15. This initial decomposition may be ideal for making quantitative comparisons with experiments 4, 5, 18. To gain an intuition for the system, kinetically related microstates can then be lumped into macrostates to ensure a direct connection to the underlying free energy landscape (e.g. identify various valleys and how quickly one can get from one to another) 10, 13, 15. For example, in an earlier automated algorithm, one first generated a set of microstates, lumped them into macrostates using a method called PCCA 61, and then iteratively broke the macrostates up into new microstates and re-lumped them until convergence 13. A more recent automated algorithm (MSMBuilder), which is now freely available at https://simtk.org/home/msmbuilder/, avoids any iteration by using a different clustering algorithm to obtain finer clusters and then doing a single round of lumping using either PCCA or an improved method, called PCCA+ 62, 63. This procedure is outlined schematically in Figure 4.

Figure 4
figure 4

Schematic representation of the steps required for building an MSM and obtaining representative conformations for each state. First, Generalized Ensemble (GE) data (or other data for that matter) represented by points are grouped into microstates represented by circles, with darker circles for more highly populated microstates. Kinetically related microstates are then lumped together into macrostates or metastable states, represented by amorphous shapes. Finally, representative conformations are obtained by extracting the most probable conformation from each macrostate. Reprinted from Methods, 49, Bowman, GR, Huang, X, Pande, VS, Using generalized ensemble simulations and Markov state models to identify conformational states, 197-201, Copyright (2009), with permission from Elsevier.

Regardless of how an MSM is built, it is necessary to choose a lag time and validate the final model. Checking that the implied timescales of the model level off as the lag time is increased is a first indication that a given state decomposition is reasonable and allows one to choose a lag time (the lag time at which the implied timescales first level off should be chosen) 64. Intuitively, this equates to checking that the slowest rates (often on the order of hundreds of nanoseconds or greater) are invariant with respect to the interval at which you count transitions (often on the order of ten nanoseconds or less). Unfortunately, evaluating where the implied timescales of a model level off is extremely subjective and new methods are needed to replace this criterion. Some new methods employing information theory and Bayesian statistics point to how this may be done 59, 65. Once a lag time has been selected, the transition probability matrix can be calculated and used to further validate the MSM by comparing its dynamics to the raw simulation data (the Chapman-Kolmogorov test) 4, 5, 13.

Another important question is how many macrostates to construct when building a model to gain some intuition for a system. In the past, researchers have chosen a number of macrostates based on gaps in the implied timescales 10, 13, 15. Doing so imposes a separation of timescales – fast intrastate transitions, slow interstate transitions – that should give rise to Markovian behavior, i.e., the next state a trajectory visits should be independent of its history because it should be able to enter a state and quickly lose memory of where it came from before transitioning to a new state. However, recent work has shown that there is often a continuum of timescales without any obvious gap. Thus, the number of macrostates has come to be seen as a tunable parameter that can be adjusted based on one's objectives 4, 5, 18. For example, to gain an intuition for a system one may desire to lump a dataset into as few states as possible, while still preserving the Markov property and reasonable agreement with the raw simulation data.

Reaching biologically relevant timescales with MSMs

An important advantage of MSMs is their ability to capture long timescale dynamics from many short simulations 4, 5, 9, 15, 21, 30. Traditional approaches to molecular simulation require global equilibration, i.e., each simulation must be orders of magnitude longer than the slowest relaxation process so that every possible transition can be observed multiple times 66. In contrast, MSMs only require local equilibration 6, i.e., each simulation only needs to be long enough to equilibrate within a subset of all the possible states. One can then obtain a global model by statistically stitching together many short, parallel simulations covering different parts of the conformational space, much like creating a quilt from many small patches or relay runners covering long distances.

Adaptive sampling algorithms take this reductionist approach a step further 6, 31, 32. In adaptive sampling, one first obtains an initial model and then uses Bayesian statistics to calculate the contributions of each state to statistical uncertainty in some observable of interest, like the rate of the slowest process. A new round of simulations is then started from these states in order to increase the model precision as efficiently as possible. Thus, one avoids running more simulations where they are unnecessary while gathering more data where it can be of most use. Work with toy models has shown that performing multiple rounds of adaptive sampling can quickly refine a model 31. Indeed, it has recently been shown that adaptive sampling can reduce the wall-clock time necessary to achieve a given model quality by a factor of N, where N is the number of parallel simulations run during each iteration 33. Adaptive sampling can also reduce the total computer time necessary for a given model quality by a factor of two 33. An exciting future direction will be to apply adaptive sampling to real systems of biological significance that are currently beyond the reach of computer simulations, like conformational changes in transcription elongation 67.

Even with adaptive sampling, however, MSMs are not without their limitations. For example, it is still unclear how to distinguish systematic errors from statistical uncertainty and correct for them too. Furthermore, while MSMs are excellent for describing free energy landscapes in terms of their basins, capturing transition states can also be quite important for understanding the mechanisms of conformational changes 46. Recent work on the application of topological methods to understanding free energy landscapes shows how to capture such transition states 46, 68. Indeed, progress on combining MSMs and topological methods to capture both free energy basins and transition states is already yielding insight into processes like RNA folding 16. Moderate speedups may also be achieved by building MSMs from generalized ensemble simulations 6, 11, 15, 69, which perform a random walk in temperature to take advantage of broad sampling at high temperatures to improve mixing between states at lower temperatures 70, 71.

Concluding Remarks

MSMs are a powerful means of mapping out the conformational space of both macromolecules and macromolecular complexes. While much of the literature on MSMs has focused on methods development and validation, they have already provided important insights into processes like protein folding, aggregation and vesicle fusion. Further progress in this direction will likely have important medical benefits, both by providing a deeper understanding of the molecular phenomena underlying higher-order biological processes (and especially human disease) and by allowing more effective drug and protein design. Methodological advances in the use of MSMs could also prove useful in systems biology and eventually find application in other fields relying on network representations, such as the study of gene networks and signaling pathways.