Research Article

1 March 2001

Transition between Stochastic Evolution and Deterministic Evolution in the Presence of Selection: General Theory and Application to Virology

Authors: I. M. Rouzine, A. Rodrigo, J. M. CoffinAuthors Info & Affiliations

DOI: https://doi.org/10.1128/mmbr.65.1.151-185.2001

Cite

PDF/EPUB

SUMMARY

We present here a self-contained analytic review of the role of stochastic factors acting on a virus population. We develop a simple one-locus, two-allele model of a haploid population of constant size including the factors of random drift, purifying selection, and random mutation. We consider different virological experiments: accumulation and reversion of deleterious mutations, competition between mutant and wild-type viruses, gene fixation, mutation frequencies at the steady state, divergence of two populations split from one population, and genetic turnover within a single population. In the first part of the review, we present all principal results in qualitative terms and illustrate them with examples obtained by computer simulation. In the second part, we derive the results formally from a diffusion equation of the Wright-Fisher type and boundary conditions, all derived from the first principles for the virus population model. We show that the leading factors and observable behavior of evolution differ significantly in three broad intervals of population size, N. The “neutral limit” is reached when N is smaller than the inverse selection coefficient. When N is larger than the inverse mutation rate per base, selection dominates and evolution is “almost” deterministic. If the selection coefficient is much larger than the mutation rate, there exists a broad interval of population sizes, in which weakly diverse populations are almost neutral while highly diverse populations are controlled by selection pressure. We discuss in detail the application of our results to human immunodeficiency virus population in vivo, sampling effects, and limitations of the model.

The process of evolution is a consequence of the interplay of mutation, selection, and chance on a population of organisms, leading to an observable change in its genetic makeup. Since the time of Darwin, the influence of these factors on the evolution of organisms ranging from bacteria to humans has been intensively studied, both experimentally and theoretically, leading to a very large body of literature. Only recently, however, has attention been turned toward special problems in the evolution of viruses. Virus evolution is of particular interest and importance for three reasons. First, we desire to gain an understanding (usually in the absence of a fossil record) of how modern viruses have arisen from their earlier forms, both in recent times and in parallel with the evolution of their hosts. Second, the evolution of a virus during the course of infection of a single host, or along a short transmission chain, is of great importance in creating new populations with properties altered in important ways, such as evasion of the immune response, resistance to antiviral therapy, or altered virulence. Third, because of their high replication rates, simple genomes, large population sizes, and high mutation rates, viruses make good models for studying and testing evolutionary theory.

Particular attention has focussed on understanding the evolutionary forces that act on human immunodeficiency virus (HIV) during the course of infection of a single human host. HIV displays a remarkable extent of genetic variation concurrent with a high speed of evolution: in the most variable region of the genome (env), individual genomes within a population from an infected person can vary by as much as 3 to 5% (2, 43, 78); substitutions in envaccumulate at a rate of approximately 1% per year (71), 50 million times faster than in the small subunit of rRNA (61). This variation has important consequences. It allows the virus to evolve to infect different cell types (9, 20, 30) and to rapidly become resistant to otherwise highly effective antiviral drugs (10, 47, 50); it may play a role in evading the immune system (4, 56, 73, 79). Furthermore, its high mutation rate (estimated to average about 3 × 10⁻⁵ per nucleotide site per replication cycle [49]), large population size (variously estimated from about 10⁷ to 10⁸ productively infected cells), and continuous steady state, in which the large majority of virions and productively infected cells turns over every day (25, 77), create a situation which, at least in principle, is amenable to (and requires) mathematical modeling.

To date, a number of modeling approaches have been applied to understand the evolution of HIV in vivo. These approaches use either population genetic (mutation frequency distribution) or phylogenetic inference using virus sequences obtained from HIV-infected individuals. In general, they are based on one of two different theoretical frameworks to the evolution problem. Deterministic approaches, including quasispecies theory (15, 26), assume that the population size is very large, such that the frequency of a given mutation at any given time is completely predictable if one knows the initial frequency, the mutation rate, and the selection coefficient (i.e., the differential growth rate conferred by the different alleles). At first glance, such approaches would seem justified by the large number of infected cells at each generation (21); however, a number of factors, such as variation in the replication potential and generation times among infected cells, may lead to an effective population size much smaller than the actual number of infected cells. Stochastic models, as applied to HIV (to this point), proceed from the opposite assumption: that the effective population size is so small (or that selective forces are so weak) that random drift dominates over selection. The hypothesis of selectively neutral mutations has a long, successful history in describing the evolution of organisms where populations are small (and not uniformly distributed) and mutation rates are very low (36). Their applicability to virus populations remains to be established. Many of the assumptions that underlie neutral theory are not appropriate for virus populations, and a number of characteristics of HIV genetic variation in vivo, such as the uneven ratio of synonymous to nonsynonymous changes in different regions of the genome (5, 44, 48), argue against simple application of neutral theory. However, inclusion of selection effects in evolutionary analysis (for example, the coalescent method) presents a mathematical challenge that has not yet been fully solved in a practical fashion, although progress toward this goal has been made recently (42, 55).

As an example of the difference between deterministic and stochastic models, consider the question of the frequency in a population of a mutation that is slightly deleterious to virus replication. In a deterministic system, it can be easily calculated that the frequency of such a mutation in the population will come to equilibrium at a point equal to the mutation rate divided by the selection coefficient (24). In a stochastic system, the population will usually be completely uniform in one variant or the other (76), switching rarely but rapidly from one form to the other. This theoretical experiment is of great practical importance in that it describes the appearance of a mutation that can confer resistance to an antiviral drug even before treatment.

To solve this problem and many others, it is clear that a more general theoretical framework is needed: one that takes into account both selection and drift under a set of assumptions more appropriate to viruses than is found in theoretical works published to date. Our aim in this work was to develop, from first principles, a general theory that includes the effects of both selection and drift on a population. We use a set of assumptions appropriate to virus populations, focusing on the interplay between deterministic and stochastic behavior in the context of virologically realistic experiments. We apply these to the simplest possible model: mutation at a single site with only two alleles, replicating in a steady-state system (that is, a constant number of infected cells) under the influence of constant selective pressure in a single isolated population. Because we are dealing with a single locus, we do not consider recombination explicitly; because we are dealing with haploid populations, we do not have to consider allelic dominance. It should be noted that although we do not consider recombination explicitly, the presence of strong recombination must be, in fact, implied for the one-locus approximation to be quantitatively correct. Also, nonconserved loci must be spaced sufficiently far apart in the genome, depending on the recombination rate. Even in the absence of recombination, the one-locus approximation is a useful starting point for understanding interactions between selection and stochastic factors at a qualitative level. We present a complete model that considers the full range of possible values for population size, mutation rate, and selection effects. Despite its simplicity, the model is surprisingly rich in its descriptive power. At the extremes, the results of this model correspond to the standard results of deterministic or neutral theory; however, we have found that there is a large range of values for the key parameters in which the system behaves in an intermediate fashion: under some conditions its evolution is dominated by stochastic factors, whereas at other times it behaves in a nearly deterministic fashion. We refer to this range of parameter values as the “selection-drift” regime and describe its properties in detail.

This work is divided into two major parts. In the first, we present all the principal results in qualitative terms, using language appropriate for a reader trained in biology and with a moderate level of mathematical sophistication. This part is accompanied by a number of illustrative examples obtained by computer simulation. Although keyed to the mathematical formalism of the second part, it is designed to be read independently and to provide the reader with an understanding of the principal results and their biological significance, particularly in the context of virus populations. The second part is a formal mathematical derivation of the principal results of the model. These results are listed at the beginning of each section and derived in the following subsections. Although some of the derivation presented is not novel, in that it parallels classic work of a number of population biologists (18, 19, 23, 24, 31, 37, 81, 82), its formal application specifically to virus systems is, to the best of our knowledge, a new approach, and we present it in full for this reason, as well as to provide a thorough and self-contained review. Although some of our mathematical methods differ from the classic methods, the final results are identical.

The presentation in both parts of this work proceeds in parallel. We first develop the basic evolution equation, which describes, at least in a statistical sense, the change in frequency of a mutant allele as a function of time and the key parameters: mutation rate, selection coefficient, and population size. We then present the predicted results, for all three regimes, of a set of virological experiments: accumulation and reversion of deleterious mutations, competition between mutant and wild-type viruses, gene fixation, mutation frequencies at the steady state, divergence of two populations split from one population, and genetic turnover within a single population. Next, we discuss sampling statistics and the application of this theory to some specific real-world experimental issues of virus and organismal evolution. Finally, we discuss the application and extension of this theoretical framework to other problems, including multilocus evolution and phylogenetic analysis.

QUALITATIVE DISCUSSION AND COMPUTER SIMULATIONS

Description of the Model and the Evolution Equation

In this section, we introduce the population model and explain how to approach the problem of evolution when random factors enter the picture. First we describe a one-locus, two-allele population model based on the virus replication cycle and discuss briefly the main factors of evolution included in the model. This is followed by a discussion of the biological meaning of the evolution equation. Finally, the boundary conditions for the evolution equation describing the properties of a weakly polymorphic population are described.

Virus population model.

First, we choose a basic model of virus evolution. For the purposes of simplicity, we consider the evolution of one nucleotide position at a time, and we assume that each nucleotide has a choice between only two alleles. (Such a model applies directly to multiple loci if the evolving loci are sufficiently distant and the recombination rate is sufficiently high. Evolution at closely situated loci or in the absence of efficient recombination is not independent [see “Many loci and other aspects” below].) Conventionally, we denote the better-fit allele as wild type and the less-fit allele as mutant. A deleterious mutation event (from wild type to mutant) will be referred to as forward mutation, and an advantageous mutation event will be referred to as reverse mutation. Each separate nucleotide will be characterized by two parameters, both of which are assumed to be much less than unity: the mutation cost (or selection coefficient), s, which is the relative difference in fitness between the two alleles, and the mutation rate per base per replication cycle, μ. We assume that mutations at different nucleotides have a weak additive effect on virus fitness. In doing so, we neglect epistasis (coselection) arising due to biological interaction between nucleotides at both the nucleotide and protein levels. We also ignore linkage disequilibrium between loci due to random drift, so that different nucleotides evolve independently (see the Introduction). The mutation rate is set, in our work, to be the same in the forward and reverse directions. For example, for HIV in infected cells the mutation rate per base is in the range of 5 × 10⁻⁶ to 5 × 10⁻⁵, depending on the type of substitution (49, 68). The selection coefficient will vary over a wide range according to the specific base and to the specific conditions of replication, but it is assumed to be constant over the period of observation; in other words, there is no selection for diversity.

The basic model of virus replication is illustrated in Fig.1. Consider the dynamics of a cell population infected by two genetic variants of a virus: a fraction (f) of cells is infected by the mutant virus, and the remaining cells (1 − f) are infected by the wild type. The number of mutant-infected cells may change with time, i.e., with each new generation of cells. The total cell count is assumed to be constant. During a generation step, each cell produces a fixed (large) number of virions and then dies and is replaced by an uninfected cell. The number of virions produced and capable of infecting new cells differs, by a factor of 1 − s, between cells infected with different variants, creating selection for the better-fit (more prolific) variant. Since the total number of infected cells is fixed and the number of virions produced per cell is large, only a small fraction of the virions infect the next generation of cells. On infecting a cell, each virion has a small chance of mutating into the opposite genetic variant, given by the mutation rate introduced above. All the virions produced by a cell afterwards represent the same genetic variant. Thus, intracellular interference between variants does not occur. (This lack of intracellular competition is a reasonable assumption for retroviruses or when the proportion of infected cells in a tissue is much lower than 100%. It may vary in other virus models, when the multiplicity of infection is high.)

Fig. 1.

Some details of the model, such as fixed burst sizes and the point of the replication cycle at which mutation occurs, are of no consequence when long timescales are considered. Overlap in time between generations of infected cells was neglected but causes a factor of 2 increase in the rate of random drift (52). By contrast, such assumptions as two variants per base and the absence of both coselection and selection for diversity are essential. The model includes a minimal set of three factors of genetic evolution: random drift due to sampling of genomes, mutation, and selection. Let us characterize briefly the effect of each of these factors on the composition of the population as it changes with time.

The model assumes that the virions infecting each new generation of cells are chosen randomly from the virions produced by the mutant and wild-type subpopulations. As a result of this random sampling of genomes, the mutant frequency experiences random drift in time (18, 80), as shown in Fig. 1a. In the absence of mutation and selection, any population composed originally of a mixture of alleles eventually becomes uniform in either genotype (i.e., the allele is fixed), with the probabilities depending on the initial composition.

Selection enters our model through the difference in the number of infectious progeny produced by cells infected with different genetic variants. Selection alone drives the system into a state consisting entirely of the better-fit variant.

Mutations, in contrast to random drift and selection, favor inhomogeneity. If the other two factors are absent, mutations push the system toward the equilibrium composition at which the total numbers of forward and reverse mutations per generation are in balance. For equal forward and reverse mutation rates assumed here, equilibrium occurs at 50% of each allele.

If all three factors are at work and there are no external perturbations, the population will eventually reach a dynamic steady state in which mutation, on average, is in balance with selection and/or random drift. In the steady state, the statistical properties of the population no longer vary with time; i.e., even though the genetic composition may fluctuate strongly with time, all the mean values, standard deviations, etc., remain constant. The whole model with the three factors of evolution is illustrated in Fig. 1b.

Stochastic equation of evolution.

Different meanings can be assigned to the word “evolution.” For the task at hand, evolution of the population is characterized by the dependence of the frequency of cells infected with mutant virus on time. In deterministic dynamics, which applies only in very large populations of infected cells, if one knows the initial mutant frequency and has the appropriate equations, one can, in principle, predict the mutant frequency at later times with arbitrary precision. (In practise, the equations are never known exactly, since there are many different factors in play, but this is a separate issue [68].) By contrast, in the presence of random factors, the time dependence of the mutant fraction cannot be predicted even in principle. Even if one knows its precise initial value, the error with which one can predict its value later grows with time. If random factors are strong, the error in the mutant frequency and its value become eventually comparable. Evolution of the mutant frequency, in other words, is a random process.

Randomness of mutations does not mean, however, that the evolution of a population is totally arbitrary. On the contrary, useful predictions can be made about its statistical properties even if its specific state cannot be predicted. Instead of time dependence of the mutant frequency, one has to consider the time-dependent probability density [ρ(f)], defined as the chance that a given population has a mutant frequency near a particular value. The probability density, which can be introduced if both subpopulations (mutant and wild type) are large, is closely related to a histogram derived by plotting the number of times the mutant frequency of a population is observed to lie within a certain range of values. When both the number of similar experiments and the number of histogram bars are very large, the histogram becomes, in the limit, a smooth function, which is the probability density. (The histogram and the probability density differ by a constant factor: the total area under the probability-density curve [integral] is, by definition, the total probability of having any value of the mutant frequency and is, of course, equal to 1.) The density function contains information about the most relevant statistical parameters (average values and standard deviations) which can be compared with experiment (see “Experiments on evolution and observable parameters” below). In particular, the characteristic width of the probability density peak indicates the error within which the mutant frequency can be predicted.

The stochastic evolution equation (equations 1 and 2) (Fig.2a) expresses the rate of change in the probability density with time in terms of its form at the present moment. Using such an equation and knowing the initial probability density, one can predict its form, in principle, at any time in the future, similarly to how one would predict the mutant frequency itself for a deterministic process. The difference between the two cases is that the time-dependent variable is now a function rather than a number. We derive the evolution equation directly for the population model introduced in the previous subsection, in the beginning of mathematical part of our work (see “Mathematical results and derivations” below). The rest of the mathematical part is devoted to solving the equation for different important cases. Here we only show how the equation looks when the probability density is localized in a small region near some value of the mutant frequency and comment on its meaning from a more qualitative perspective.

Fig. 2.

The right hand-side of the equation shown in Fig. 2a is a sum of three terms, which together describe how the shape of the probability density function, ρ, changes over a short time interval,dt. The first term describes random drift, the second describes selection, and the third describes mutation. To clarify the roles of the three terms in describing evolution, we consider each of them separately, by setting the other two terms equal to 0 (Fig. 2b to d). As a convenient example, we examine a probability density localized in a small region near some value of the mutant frequency (f_max). In this example, the second term, by itself, means that the probability density increases with time on the left side of the peak and decreases on the right side of the peak. As a result, the probability density peak, whose shape stays constant, shifts to lower mutant frequencies, as it should in the presence of selection (Fig. 2c). The third term in the equation, by itself, causes a shift of the peak as well, but the direction of the shift is toward 50% composition, which is the expected effect of mutation when the forward and reverse mutation rates are, as assumed, equal (Fig. 2d). The effect of the first term in the equation is of a different kind. Due to this term, the probability density decreases in the interval between the inflection points A and B (Fig. 2b) and increases everywhere outside of the interval. As a result, the probability density spreads outward. This is random drift: the error within which one can predict the value of mutant frequency increases with time. A more general form of the stochastic equation when the probability density, ρ(f), is spread over a broad interval off, is given in equations 1 and 2.

In the equation in Fig. 2a, a physicist will recognize a particular case of the Fokker-Planck equation and a mathematician will recognize a case of the forward Kolmogorov equation (41). It was introduced into the field of population genetics by Wright (81) and then intensively used to study evolution in the presence of different factors (31-33, 37). As it turns out, the equation is much more general than the virus model we used for its derivation in the mathematical section of this review. It describes a broad range of population models, from a bacterial culture to a randomly mating population without allelic dominance (35). Originally, the approach of the Fokker-Planck equation was introduced into population genetics from a phenomenological perspective, based on analogy to gas kinetics (18). Later, the validity of this approach was confirmed for different population models (52, 75). Examples of essential factors which are not included in the equation but which may or may not be important, depending on the experimental system, are epistasis (biological interaction) and linkage between multiple loci, time variation of the selection coefficient and the population size, and allelic dominance in a diploid population (33).

A formal analogy for the system described by the evolution equation is a gas consisting of particles mixed with air and confined between two parallel walls (Fig. 3a). A value of the mutant frequency is analogous to a location between the walls, and the probability density is now the local gas density. The first term (Fig.2a) describes the diffusion of the gas particles in the air, and the second and third terms combined describe the effect of directed force (an electric field, for example) acting on the gas particles in the presence of friction of the gas against the air. Another useful analogy is gel electrophoresis. The electrical force acting on polymer molecules and the friction against the gel matrix together create directed motion, which segregates the molecules into bands. Molecular diffusion leads to increasing bandwidths. Although the physics of the gel or gas system has nothing to do with viruses or evolution, the formal mathematical analogy between the two systems, as we shall see below, turns out to be very useful.

Fig. 3.

Boundary conditions: properties of almost monomorphic populations.

In the real world, the mutant frequency cannot be less than 0 or greater than 1, yet the master equation has no such restriction. Thus, the stochastic equation in Fig. 2a (and equations 1and 2) is incomplete without describing what happens near ends of the allowed interval for the mutant frequencies, 0 and 1. The analysis shown in Fig. 2 is for the case where there is a large number of minority allele copies (that is, f is not near 0 or 1) and treats the mutant frequency (f) as a continuous variable. In many important cases, one also needs to describe the evolution of a population with only a few copies of the minority variant. The boundary conditions where f is near 0 and 1 have to be derived independently from the virus population model described in Subsection A. The derivation given in the mathematical section of this review shows that the conditions differ depending on the interval of population size, as follows.

The boundary conditions can be conveniently expressed in terms of the probability density flux (q), which is exactly analogous to the flux of gas particles through unit area per unit time (Fig. 3). In very large virus populations (Fig. 3b), the boundary conditions state that the flux must vanish at the “walls” corresponding to two monomorphic states, i.e., 100% mutant or 100% wild type (equation 3). In small populations (Fig. 3c), the flux is not zero (equations 5 and6). This is because the probability of finding the virus population in a completely monomorphic state is finite and can increase or decrease in time. In the gas analogy, in the first case (Fig. 3b) gas molecules bounce off the hot walls and in the second case (Fig. 3c) the walls are cold and gas forms a condensate which can decrease or increase with time. Figuratively speaking, the probability density, just like the gas condensing in or evaporating from the liquid on a wall, can “condense” in or “evaporate” from a monomorphic state.

The real, biological interpretation of the different sets of boundary conditions is as follows. In very large virus populations (which, as we shall see, roughly correspond to almost deterministic evolution), a purely monomorphic state is unlikely: mutations destroy it very quickly. In a small population, mutations are rare and the monomorphic state can occur with a finite probability. This argument also shows that mutations affect virus evolution in a different way depending on the number of infected cells. In a large population, mutations may be important even in a very polymorphic state (e.g., if selection is small). In small populations, the role of mutations is to create a copy of the new allele in an otherwise monomorphic population; once a copy is created, mutations can be neglected until the population becomes monomorphic again. Typically, as we discuss below in the section on steady state, a new allele is lost due to random drift and repeated introduction of mutations will be needed to restore diversity.

Experiments on Evolution and Observable Parameters

In this section, we describe a few gedanken experiments on genetic evolution important for virological applications and introduce quantitative parameters suitable for experimental comparison.

To make use of the evolution equation with boundary conditions (see “Description of the model and the evolution equation” above), one needs to know the state of the system or its statistics at the initial moment of time. The initial condition depends on a particular experimental or natural setup. Virological experiments, relevant for both in vivo and in vitro situations, are as follows.

(i) Accumulation of deleterious mutants (initial condition: a pure wild-type population, i.e., f = 0 ).

(ii) Reversion of a deleterious mutation (initial condition: a pure mutant population, i.e., f = 1 ).

(iii) Growth competition (initial composition: a 50%-50% population [f = 0.5] or any other strongly polymorphic mixture).

(iv) Gene fixation (this experiment, which has received a lot of attention in population biology [19, 24, 34, 38, 80] and which is very useful for understanding other stochastic experiments, is defined only in small populations in which the total mutation rate per population, μN, is much less than 1; suppose that a single advantageous allele is introduced into an otherwise monomorphic population [ f = 1/N ]—the allele will have one of two fates: either it will be lost due to random drift [Fig. 1a] or it will spread to the entire population, i.e., become “fixed”; the questions are: what is the fixation probability, and, if the allele is fixed and does not become extinct, how much time will it take, counting from the moment it appeared? One can also ask a more general question: what is the probability of having a new allele to grow into a subpopulation of a given size before it becomes extinct?).

(v) Steady state. Whatever the initial condition, after a sufficient time, the system passes to the stochastic steady state, in which the probability density no longer depends on time; we consider this relatively simple case separately.

(vi) Genetic divergence. One splits a steady-state population into two isolated parts. Initially, both populations have a random but identical genetic composition, from which they independently diverge. As time goes on, their respective random compositions correlate less and less. The question is, what is the characteristic time at which the loss of correlation occurs?

(vii) Genetic turnover? This experiment studies the average timescale associated with random fluctuations of the mutant frequency in the steady state.

The probability density (ρ) of the mutant frequency predicted by the stochastic equation is the main observable parameter. Unfortunately, to measure it directly, one would have to generate a histogram of mutant frequencies for a very large ensemble of populations. More amenable for experimental testing are the average (expectation) values (equation 36) and the standard deviations or variances (equation 37) of different stochastic parameters, which require a smaller number of populations to measure. Below we introduce some useful parameters whose statistics can be measured in the different experiments we outlined above. At the same time, their predicted statistics can be expressed via the probability density, as shown in the mathematical section of this review. In what follows, we assume that each parameter, for each given population, is measured with a high precision from a sufficiently large sample of sequences. The sampling effects will be discussed separately below.

The first parameter is the mutant frequency itself (f), which is self-explanatory. Its value can be compared directly with the experimental value, provided that the wild-type (best-fit) nucleotide is known.

The second is the intrapopulation genetic distance (T), defined as the proportion of sequence pairs (randomly sampled from the virus population) which differ at the base of interest. Although there are other ways to measure intrapopulation variability, we will use this definition, known in population biology as Nei's nucleotide diversity. It is equivalent to the standard definition of the genetic distance in virology as the average number of pairwise differences among randomly selected genomes, except that it applies to a single base rather than to a long genomic segment. By definition, T is calculated as 2f(1 − f), and varies between 0 (at f = 0 or 1) and 0.5 (at f = 0.5 ). The genetic distance is usually a more convenient measure of population diversity than the mutant frequency itself since it does not require knowledge of the wild type sequence.

The third is the interpopulation genetic distance (T₁₂), which is defined in the same way as the intrapopulation genetic distance, except that the two sequences of each pair are sampled from two different populations (equation 40). The interpopulation distance is 0 when the two virus populations consist uniformly of the same genetic variant and 1 (100%) when the two virus populations are composed entirely of opposite genetic variants. The interpopulation distance, as one can show, cannot be smaller than the average of the two intrapopulation distances. Therefore, it is sometimes more convenient to consider instead the relative genetic distance between two populations (D), defined as the difference between the interpopulation distance and the average of the two intrapopulation distances [T₁₂ − (T₁ + T₂)/2]. This parameter (equation 41) varies between 0 (two populations have an identical genetic composition) and 1 (one population is pure mutant, another is pure wild type). There are alternative definitions of the relative distance (54). We find this definition more clear intuitively; also, its statistical moments (average, variance) are relatively easy to calculate.

All the previous parameters can be measured at one time point, both for dynamic experiments (the first three experiments in the beginning) and in the steady state. Since all of them are, in general, stochastic, an average and standard deviation has to be calculated for each. The next parameter is more complex: it requires measurement at two different times. We define it on average and for a steady state population only.

The fourth parameter, the time correlation function of mutant frequency [K(t)], describes how quickly the system “forgets” the preceding random fluctuation of the mutant frequency (equation 45). The time correlation function usually has a maximum when the time difference is 0 and vanishes at large time differences. The characteristic time at which it decays by 50% (or, say, by a factor of e = 2.78… ) from its maximum gives the timescale of random fluctuations. The form of this decay (e.g., exponential or negative power) may be a good fingerprint of a virus population model or, within a given model, of a particular population size.

In the mathematical section of this review, we calculate these parameters for different gedanken experiments and different intervals of population size. In this section of the review, we discuss these results qualitatively and illustrate them, when possible, with Monte Carlo simulations.

Steady State

In this section, we discuss properties of the steady-state, stochastic population in different intervals of the population size.

Neutral case: s ≪ μ.

Selection is of little significance when the selection coefficient is much less than the mutation rate. This case is probably of little practical significance for RNA viruses, with their tightly organized genomes. However, the transition between stochastic and deterministic behavior is easier to analyze when the selection factor can be neglected. Hence we start our discussion here.

The main fact of stochastic theory is that fluctuations of mutant frequency between statistically identical populations are large if populations are small (stochastic behavior) and small if populations are large (nearly deterministic behavior). In the language of the probability density (equation 52), the density is spread over a broad interval of f in small populations and is a narrow peak at very large population sizes. Transition between the two limits is controlled mostly by a single parameter μN, the product of the population size and the mutation rate. The composite parameter μN, which features extensively in population genetics (usually as Θ = 2Nμ), gives the total mutation rate for the entire population. For most RNA viruses, μN equals 1 when the number of infected cells is on the order of 10⁵(i.e., less than the number in a small culture dish).

As the mutation rate per population increases, the probability density gradually changes its shape, as illustrated in Fig.4 (80). This results from competition between random drift, which drives the system to one of uniform states, and mutations, which diversify the system. At values of μN much smaller than 1 (an interval we accordingly call the drift regime in Table 1), random drift wins and the usual population is only weakly polymorphic. The probability density is, accordingly, U shaped, with a minimum at 50% composition. At the smallest values of μN (the condition is given in equation 5), the system is most likely to be in either of the purely monomorphic states, without a single opposite allele present (see “Description of the model and the evolution equations” above, where the the boundary conditions are described). The total probability of any polymorphic state will be much less than 1 and on the order of μN. This estimate gives the frequency of segregating sites in a genome segment.

Fig. 4.

Table 1.

Table 1. Classification of regimes of genetic evolution

Regime	Neutral limit (s ≤ μ)				In the presence of selection (s ≫ μ)
Regime	Population size	Behavior	Factors in steady state	Factors in diverse state	Population size	Behavior in steady state	Factors in steady state	Factors in diverse state
Drift	N ≪ 1/μ	Stochastic	Drift, mutations	Drift	N≪ 1/s	Stochastic	Drift, mutations	Drift
Selection-drift					1/s ≪ N ≪ 1/μ	Stochastic	Drift, mutations, selection	Selection
Selection					N ≪ 1/μ	Deterministic	Mutations, selection	Selection
Mutation	N ≫ 1/μ	Deterministic	Mutations	Mutations

Let us move toward larger populations. As we increase the parameter μN, the U shape of the probability density flattens out (Fig. 4). The minimum at 50% composition becomes a maximum when μN is equal to 1/2. The probability density shrinks and becomes narrow as the population increases and μN becomes much larger than 1. This means that the mutant frequency is very close to the deterministic value of 1/2, owing to the balance between forward and reverse mutations. In Table 1, this limit of population sizes is denoted the mutation regime.

Case with selection: μ ≪ s≪ 1.

The situation when the selection coefficient is less than 1 but still much larger than the mutation rate is more relevant for RNA viruses and more interesting theoretically. As in the neutral limit, the larger the population size the smaller the fluctuations.

The selection factor can be neglected only if a population is very small, much smaller than the inverse selection coefficient (Ns ≪ 1), a case that has the same properties as the above-described drift regime. At larger population sizes, selection is crucial and causes the probability density (equations 48 or 49 to 51) to be asymmetric in favor of a predominantly wild-type population.

In the limit of very large populations, when μN is much larger than 1 (termed the selection regime in Table 1), the probability density is narrow and localized near its deterministic value (equation57). This value is given by the ratio of the mutation to the selection rate (μ/s), which we assumed to be small. At this value, mutations and selection against emerging mutants reach balance.

A result not sufficiently emphasized in the population biology literature is the existence of a wide interval in population size between the inverse mutation rate and the selection coefficient, which we term the selection-drift regime, in which all three factors of evolution are critical. Specifically, mutations produce diversity, selection restricts mutants to a low level, and random drift causes strong fluctuations between populations. The structure of the probability density in this regime is shown schematically in Fig.5. It consists of three components. The large peak (delta function) situated at exactly zero mutant frequency means that a population is, most probably, purely wild type. The weak continuous exponential tail which decays at mutant frequencies on the order of 1/Ns ≪ 1 (80) means that the chance of a population being polymorphic is low and that if a population happens to be polymorphic, the proportion of mutants is small and quite random. A small peak at f = 1 becomes important only close to the lower border of the interval, when N is on the order of 1/s. The probability of finding any mutants (which is given by the total area under this curve) is low and proportional to μN (equations 49 to 51).

Fig. 5.

The selection-drift regime has rather interesting, even controversial properties. On the one hand, the shape of the probability density suggests a very stochastic behavior. On the other hand, the average mutant frequency and the average genetic distance happen to coincide, over most of the regime, with their deterministic values, as if the population were much larger. Figure 6shows the average values and the relative standard deviations for both parameters at all the population sizes. As expected, in the selection drift regime the relative standard deviations for both the mutant frequency and the genetic distance are much larger than unity (Fig.6b). At the same time, the average values (in equation 59) are the same as in the selection regime (Fig. 6a). Notably, the fluctuations of the parameters are much stronger than could be expected from the Poisson statistics. This is a result of clonal amplification: if a single mutant appears in otherwise wild-type population, it grows into a clone. In the sections on stochastic dynamics (see below), we will further clarify the structure of the steady state by presenting a Monte Carlo simulation of a stochastic dynamic evolution in a single population. Examples of the results of such simulations for each regime are shown in Fig. 6c.

Fig. 6.

Deterministic Dynamics and Its Boundaries

As we have shown above (see “Experiments on evolution and observable parameters”), the steady-state mutant frequency approaches its deterministic value when μN is much larger than 1. The purpose of this section, small but with a large mathematical counterpart, is to gain insight into the transition between stochasticity and determinism in the more complex case, in which parameters of the system depend on time.

Deterministic dynamics.

Deterministic and stochastic theories operate with different dynamic variables. The former considers the time dependence of the frequency of mutants, and the latter uses a more complex object, the time-dependent probability density of the mutant frequency. It is important to ensure that the two approaches converge to the same result in the limit of infinite population, when they are expected to describe deterministic evolution, albeit in a different way. For this purpose, in the mathematical section of this review we solve the dynamic stochastic equation (equation 1) for the case of large populations. The resulting probability density, as expected, is a very narrow peak located at the time-dependent mutant frequency (Fig. 7b), which satisfies the deterministic equation of evolution (equations 60 and 61).

Fig. 7.

The first term in the right-hand side of the deterministic equation (Fig. 7a) (equation 61) describes selection for the wild type, causing depletion of mutants. When one of two subpopulations (f or 1 − f ) is very small, the first term becomes small, since if there is no diversity, there is no selection. The second term, describing mutations, does not vanish in a uniform population. Instead, the term vanishes at 50% composition when the effects of forward and reverse mutations cancel each other. Mutations drive the system toward 50% composition. The same evolution equation can be obtained directly from the deterministic first principles (equations 63 and 64).

The deterministic equation in Fig. 7a allows one to predict the genetic composition as a function of time for any initial condition set in an experiment (equation 62). Corresponding plots for the three cases matching the conditions of the accumulation, growth competition, and reversion experiments described above (see “Experiments on evolution and observable parameters”) are shown in Fig.8. In all cases, after a characteristic time proportional to the inverse selection coefficient (1/s), the population approaches a steady state in which the mutant frequency saturates at a small value, the mutation rate over the selection coefficient (μ/s) (see “Steady state” above). Reversion is somewhat delayed compared to that in the two other experiments since the system first has to diversify slowly due to mutations and then still has to cross the entire interval of the mutant frequencies. Note that in both the accumulation and reversion experiments, the initial slope of the time dependence of the mutant frequency is shallow and is determined by the mutation rate (Fig. 8). Selection becomes important and causes the plots to curve after a growing subpopulation becomes sufficiently large.

Fig. 8.

Boundaries of deterministic approximation.

Random drift, always present even in very large populations, causes the frequency of mutants to fluctuate around its deterministic value. As the population size decreases, the magnitude of fluctuations becomes comparable to the average frequency of the minority allele (either mutant or wild type), and the deterministic description breaks down. The corresponding condition on the population size varies significantly depending on the initial conditions of the experiment (equation 65). When the population starts from a monomorphic state (reversion or accumulation), the deterministic criterion is met when μNis much larger than unity. A population that is strongly diverse to start with, as in the growth competition experiment, is already deterministic at a much smaller population size in the selection-drift regime. (The criterion for diversity is that the mutant frequency must be higher than its characteristic “tail” at steady state [Fig. 5] ). The reason for this difference is that a small polymorphism is influenced by rare and random mutation events while a strongly polymorphic population is controlled by selection alone.

Stochastic Dynamics: the Drift Regime

At the smallest population sizes, smaller than the inverse selection coefficient, as we found out when considering the steady state, selection can be neglected altogether. In this section, we consider the nonequilibrium dynamics in this regime. The problems of interest are those listed above (see “Experiments on evolution and observable parameters”): the decay of a strongly polymorphic state, gene fixation, transition from a monomorphic to the steady state, divergence of populations which have been separated, and the rate of genetic turnover in the steady state.

Decay of the polymorphic state and gene fixation.

We start our discussion from the population that is initially polymorphic, somewhere in the middle between 0 and 100%. As already discussed (see “Description of the model and the evolution equation”), mutations are not important in a polymorphic population, since they occur in the population with a frequency, μN, much less than 1 per generation. Therefore, random drift remains the only factor causing variation of the mutant frequency in time. As time passes, the mutant frequency drifts until the population accidentally ends up in either monomorphic state (cf. Fig. 1a). A representative random process is illustrated by computer simulation in Fig. 9b. The average time (the number of generations) it takes for a population to become monomorphic (i.e., for either variant to be fixed) is on the order of the population size (equations 81 and 82) (32, 80). The fixation time is quite random: its representative fluctuations are on the order of its average value. The same process can be understood in another way, from the time evolution of probability density. Figure 9a shows how the probability density, initially a narrow peak located, e.g., at 50% composition, gradually spreads out to the entire interval and then decays.

Fig. 9.

The fact that, in a time not exceeding a few multiples of the population size, the population becomes uniform has general phylogenetic consequences. Let us divide arbitrarily a population into two groups of equal size and mark each group, say, by a different color. Then we divide each group (color) into two subgroups and mark them by two different shades. Then we divide each shade into two hues, and so on. If we continue the process of subdivision long enough, all individuals in the population will eventually have different tags. Consider now a group consisting of two subgroups. According to the above result, in a time not exceeding a few multiples of the group size, one of the two subgroups vanishes. Likewise, the surviving subgroup contains two smaller subgroups, one of which also becomes extinct in a time not exceeding a few multiples of the subgroup size, and so on. Therefore, in a time on the order of the total population size, the entire population will have the same tag, i.e., will comprise descendants from a single virus or organism. In other words, any two organisms in a population in the drift regime have a common ancestor at a past number of generations on the order of the population size. Phylogenetic methods of analyzing branching processes confirm this result, which is the basis of the coalescent method of estimating population size (39, 40, 65).

Related to the decay of polymorphism described above is gene fixation. Suppose that a single new allele is introduced into a monomorphic population at an initial moment. Eventually, after a number of replication steps, the allele will either disappear due to random drift (which is the most likely outcome) or spread to the entire population, i.e., become fixed. The questions are as follows. (i) What is the probability that the allele will get fixed? (ii) Given that the allele is lucky enough to become fixed, what is the average fixation time? As we show in the mathematical section of this review (equation 84 with f = 1 ), the fixation probability is the inverse of the population size (1/N) (34) and the fixation time is on the same order as the polymorphism decay time, i.e., on the order of the population size.

One can also ask more general questions. What is the probability that a single mutant genome will ever grow into a subpopulation with a given size? What is the average time spent on this growth? The results are analogous to that for full fixation, except that the subpopulation size substitutes for the total population size (equations 84). As we show in the beginning of the sections on stochastic dynamics in the mathematical section of this review, this result allows us to interpret, at a semiquantitative level, all the important results on stochastic dynamics.

Transition from a monomorphic to a steady state.

We also consider here the accumulation of mutations starting from a purely monomorphic state, e.g., wild type (which one of the two does not matter, since selection is negligible). Eventually, mutants will be generated, one of them will become fixed (as described), and the system will switch to pure mutant. Then wild-type alleles will be generated, etc., and, in the long run, the population will be, statistically speaking, in dynamic steady state in which it switches back and forth between two monomorphic states. The system will gradually “forget” its initial state, so that the probabilities of the two monomorphic states will be equal and will be close to 1/2.

In the probability density language, this process can be described as shown in Fig. 10a. The initial peak of the probability density is very narrow and is localized at the zero mutant frequency. As time goes on, a tail of the probability density spreads into the interval between 0 and 100% mutants (equations 85 and86) and a new peak at 100% mutants appears, reflecting a chance of early fixation of a mutant genome. The first peak decays and the second peak grows, until they become equal in the steady state (Fig. 4) (equation 87). In the gas system analogy (see “Experiments on evolution and observable parameters” above), all water is initially condensed on the left wall and then evaporates. The vapors diffuse into the container and condense again on the right wall (analogous to what happens in a freezer over time). The system reaches equilibrium when the amount of condensate on both walls is the same and there remains some gas in between.

Fig. 10.

In addition to the language of probability density, it is useful to visualize transition to the steady state directly, as a typical random process. If the probability density is analogous to the density of gas, the random dependence of the mutant frequency on time corresponds to the random trajectory of a separate gas particle. A representative Monte Carlo simulation of the equilibration process, together with the relevant timescales, is shown in Fig. 10b. The steady-state process looks like a telegraph signal between the two uniform states. The peaks in the mutant and wild-type frequencies correspond to alleles which were generated by mutations and started new subcolonies but failed to become fixed.

Two, widely different timescales appear in both the representative random process and the evolution of the probability density. The typical waiting time for a switch from pure wild type to pure mutant or back is within an order of magnitude of the inverse mutation rate 1/μ. This corresponds to the time in which the probability density becomes symmetric between the wild type and mutant (Fig. 4) (equations86 and 87). The actual time spent on a successful switch is much shorter, within an order of magnitude of the population sizeN. This corresponds to the time in which the tail of probability density is formed between 0 and 100% (equation 85). The two timescales can be derived either rigorously, from the evolution equation (equations 4 to 6), or approximately, from the gene fixation problem (equation 84). Both approaches are used in the mathematical section of this review. They agree with each other and with the simulation in Fig. 10b.

The total probability of a polymorphic state (the frequency of segregating sites in genome) is, at any time, much less than 1 and on the order, roughly, of μN. This agrees with the result we obtained directly for the steady state (see above). Interestingly, this value is reached on a timescale of approximately Ngenerations, i.e., much sooner than the two probabilities of monomorphic states equilibrate.

Divergence of populations which have been separated and the time correlation function.

The longer timescale, 1/μ, also appears in the time correlation function of mutant frequency, which characterizes the timescale of random fluctuation in the steady state and the divergence of populations which have been separated (see “Experiments on evolution and observable parameters” above). The value of the relative genetic distance, D, gradually changes from 0 to a constant value corresponding to statistically independent populations (equation 90). (Note that some other measures of interpopulation genetic distance used in population biology do not have an upper limit [54].) As it turns out, the time of this transition, the half time of the correlation function decay (equation91), and the time in which the probability density becomes symmetric (above) are on the same order, the inverse mutation rate. Indeed, all three times are determined by the waiting time for a successful gene fixation.

Stochastic Dynamics: the Selection-Drift Regime

Here we consider nonequilibrium experiments in the most interesting interval of population sizes (Table 1). The relative role of selection and stochasticity in population dynamics, as derived from the evolution equation in the mathematical section of this review, depends on the initial genetic composition. The dynamics of growth competition is almost deterministic (see “Deterministic dynamics and its boundaries” above), so that this experiment need not be discussed again. In the accumulation experiment, the overall dynamics is stochastic, except for the average values of the mutant frequency and the intrapopulation distance, which are, remarkably, the same as in the corresponding deterministic conditions.

Accumulation.

As in the drift regime (see above), accumulation can be described as a spread of the peak of the probability density initially located at 0 (uniform wild type) into the interval between 0 and 1. However, unlike in the drift regime, the resulting steady state is not symmetric of a large peak (Fig. 5) (equation 48 or 49 to 51). The process of accumulation is reduced to generation of a small tail describing rarely occurring weakly polymorphic states (Fig. 5). As a result, the initial peak at 0 does not decay greatly and the steady state is reached in the same time as in deterministic selection (see “Deterministic dynamics and its boundaries” above) given by the inverse selection coefficient (1/s), i.e., faster than all timescales in the drift regime (equations 103 and 104).

The simulated stochastic dependence for this experiment is shown in Fig. 11. The process starts from the generation of a single allele, which tries to grow into a clone. The growth initially occurs under the condition that random drift is more important than selection. The maximum frequency that this clone can reach is determined by the characteristic mutant frequency at equilibrium, ∼1/(Ns) which corresponds to the clone size, 1/s copies (Fig. 5). Above this value, selection becomes the leading force and drift becomes a correction. Further growth of the deleterious clone cannot occur, and it soon becomes extinct. This appears as sparse peaks, the highest of which reach to the length of the “tail” of the probability density, 1/(Ns) (Fig. 5) (equation 48 or 50). The half-life of a mutant clone (width of a large peak) is the inverse selection coefficient. Note that the typical time interval between peaks, 1/(μNs), is longer than 1/s. The former time is the waiting time for a new allele that will be lucky to reach the size 1/s. The latter time is the time that the lucky clone actually spends growing and contracting before it becomes extinct again. The ratio of the two times, μN, gives the probability of finding the population in a polymorphic state (the area under the tail in Fig. 5). As in the drift regime, all these estimates can be obtained from both the evolution equation (equation 101) and the more intuitive gene fixation approach (equation 84). For comparison, simulation of an accumulation experiment in the “selection” regime (μ N = 20 ) is shown in Fig. 12.

Fig. 11.

Fig. 12.

Divergence of separated populations and the time correlation function.

The characteristic times of divergence of separated populations (Eq. 105) and the decay time of the correlation function (Eq. 106) are on the order of the inverse selection coefficient, 1/s. Both experiments show for how long, on average, the system “remembers” its previous random fluctuation. The answer: for the half-life of a typical mutant clone, before it becomes extinct. This is because separate clones appear, due to mutation, at independent random times.

Reversion (fixation of an advantageous variant).

A reversion experiment, in which the initial population is uniformly mutant, behaves rather differently. Although the same scales for time and the minority allele frequency appear in this case, they have different meaning. As in accumulation, random drift and selection dominate in smaller and larger wild-type colonies, respectively. However, in this case, selection accelerates rather than hinders the growth of a new clone. The probability that a single wild-type allele will manage to grow to a size equal to the inverse selection coefficient, 1/s, is low, s. However, above this critical size, the rest of its growth will be carried out by selection in a deterministic manner, i.e., with a probability close to 1 and over the deterministic timescale, 1/s (see “Deterministic dynamics and its boundaries” above). Hence, the bottleneck of reversion is in reaching the critical size despite random drift; after that, a clone is likely to be fixed in the population. Stochastic dynamics below the critical size is the same as in the accumulation regime (selection is not important). The average waiting time for reversion to start is determined by the fixation probability,s, and by the frequency at which single alleles are generated in a population at each generation, μN, which gives the time ∼1/(μNs), i.e., the same scale as the waiting time for a high peak in accumulation regime (Fig. 11) (equation107) (51). A few examples of reversion curves are shown in Fig. 13. Evolution of the probability density is shown in Fig. 14, including evolution of the density of polymorphic states (Fig. 14a) (equation108) and of the two probabilities of monomorphic states (Fig. 14b) (equation 107).

Fig. 13.

Fig. 14.

Sampling Effects

In the previous sections, we analyzed random fluctuations of the mutant frequency within an ensemble of populations of infected cells of the same finite size. We have assumed that the value of mutant frequency, genetic distance, etc., for each population was measured accurately by counting the numbers of mutant and wild-type alleles in a sufficiently large sample of genomes. The genome samples used in real experiments are, of course, not infinitely large. Hence, the experimental estimate of any quantity is approximate and sample dependent. The sampling effects may distort the experimental results if the samples are too small. In this section, we calculate how large a sample of genomes we need to achieve a given accuracy of measurements. We focus on the intrapopulation genetic distance (T) defined above (see “Experiments on evolution and observable parameters”) for a separate nucleotide. To obtain an experimental estimate of the distance, one isolates a fixed number of sequences from the population, determines the number of nucleotide differences for each pair of sequences, and averages the result over all pairs. (This procedure is specific for our choice of the genetic distance.) The accuracy of the estimate is characterized by the relative error (ɛ), defined as the standard deviation divided by the average. The result is shown in Fig.15a (equation 116). This formula is quite general and can be applied to any regime or particular experiment on genetic evolution. For instance, for the maximum possible intrapopulation distance, T = 0.5 (in absolute units), which corresponds to the half-and-half variant composition at the base of interest, 25% accuracy is approximately reached at a sample size of 6 genomes and 10% accuracy is predicted for a sample of 14 genomes. As the polymorphism decreases, the sample size required increases quickly. To reach, e.g., a 20% accuracy of measurement at the genetic distance T = 0.095 (0.95 and 0.05 composition at the base), one needs to sample ∼500 genomes (of which 25 ± 5 genomes will be mutant). Hence, to study rare genetic variants, it is undesirable to simply count sequences: one needs to employ alternative methods of quantitation like selective PCR. Of course, as is done often, the genetic distance can be averaged over a large number of bases; this saves the sample size. Such a simple solution will not work, however, if one does not know whether the bases used for averaging evolve under similar conditions or if one is interested in a specific base.

Fig. 15.

Experimental design requires making an educated prediction of the appropriate sample size and measurement methods, and one therefore needs to anticipate the intrapopulation genetic distance, at least to within an order of magnitude. At the same time, the actual value of the distance fluctuates between populations and is not known before the measurement is made. Therefore, one has to use some sort of theoretically predicted typical distance. Making such a prediction is not trivial. The expectation value is not a good choice, since, deep into the stochastic regime, a population is most probably found in a monomorphic state at any given allele. The sample size has to be optimized with respect to polymorphic states. These states have a low probability: if a population is completely uniform at a site, any size sample will be monomorphic as well. We propose to use the representative average distance (T_rep), which differs from the standard average distance in that it is averaged over polymorphic states only. (Experimentally, this can be accomplished by examining many sites and focusing attention only on the few that are polymorphic.) Quantitative differences between the two averages can be rather large. The expressions for the representative average distance in the steady-state population, for all three intervals of the population size, are shown in Fig. 15b (equation 118). One can see that the smallest distance and therefore the largest samples required correspond to the deterministic limit and the smallest samples correspond to the drift regime. This advantage of stochastic regimes is, however, canceled by a large number of different populations (or, at least, similar sites) needed for a representative assessment of polymorphism (roughly 1/μN populations or sites). In the steady state, assaying many populations or sites can be traded for sampling the same population or site at many time points spaced farther than the genetic turnover time (see the discussions of the time correlation function in the previous two sections).

Experimental Applications

The theoretical considerations presented here have useful and important implications for understanding the evolution not only of viruses but also of organisms generally. Their practical application, however, requires the use of appropriate experiments, designed both to test the validity of the theory and to then apply it to specific situations. In the following subsections, we present three examples of such applications. Other important experimental issues which are outside of the scope of our basic analytic review (phylogenetic studies, multilocus effects, etc.) will be briefly discussed in the next section.

Virological studies in vitro.

Viruses replicating in cell cultures offer a convenient experimental model for studying evolutionary processes. Compared to more traditional genetic models (fruit flies and bacteria), the advantages are a relatively easy control and sampling of genotypes and of external conditions, short generation times, and, especially, high mutation rates (for RNA viruses). Application of the results presented in this paper and testing of their validity to these systems are rather straightforward. We list recommendations for two kinds of experiments.

Experiments on growth competition, aimed at comparing the fitness of two chosen genetic variants, are common in the virological literature (54-56, 58-60). They are typically carried out by mixing a majority of mutant virions with wild-type virus and monitoring the change in proportion as a function of repeated passage in permissive cells. The selective advantage(s) can then be estimated from the slope of the curve relating the mutant frequency, f, to the number of generations (Fig. 8). Note that the slope of the curve where it crosses 50% is independent of whether the experiment is carried out in the selection or the selection-drift regime. New spontaneous mutations may arise, changing the virus fitness and distorting results. This problem can be avoided if the population of infected cells is chosen in the selection-drift interval (Table 1), 1/s ≪ N ≪ 1/μ. Then, on the one hand, competition dynamics is almost deterministic, until one of the two subpopulations becomes very small. On the other hand, the time in which a mutation (advantageous for the virus but unwanted by the experimentalist) will appear, 1/(μNs), is much longer than the measurement time, 1/s, required to resolve the two growth rates (assuming that all selection coefficients are on the same order and that the advantageous mutant allele is not present in the initial population, i.e., a single ex vivo clone).

In the opposite experiment, one starts from a monomorphic population and monitors how fast a new advantageous mutation appears and outgrows the old genetic variant. To shorten the waiting time (Fig. 13), the population size must be large, at least in the selection-drift regime. After a new colony exceeds the critical size imposed by the stochastic bottleneck (see “Stochastic dynamics: the drift regime” above), the dynamics is almost deterministic.

Based on our results, the evolutionary experiment is not a suitable way to measure the spontaneous mutation rate. For example, attempted measurements of changes in the mutation rate in bacteria due to changes in external growth conditions (adaptive mutation) (61) are difficult to interpret. First, the selection coefficient is affected by the change in external conditions as well, and this effect is likely to be more important than the change in the mutation rate. As one can show (second equation 62), in a deterministically large population, even a substantial change in the mutation rate causes only a slight shift in the reversion curve (Fig. 8). Only the selection coefficient can be reliably assessed in such an experiment. Second, such experiments depend on the details of the evolutionary model. Third, if the population is small (Fig. 13), the time dependencies of the mutant frequency will fluctuate between different cultures and the changes in the mutation rate cannot be detected due to statistical error.

HIV populations in vivo.

Data on the evolutionary behavior and genetic diversity of HIV, if understood in sufficient detail, could reveal vital information about major biological factors acting on the virus population in vivo. Of particular interest are the relative roles of stochastic factors and selective forces, the role of purifying selection versus selection for diversity, and possible variation of wild-type sequence between individuals and tissues.

One application of this model is to use HIV genetic variation as a tool to probe the underlying size and structure of the infected cell population. There has been considerable controversy in the literature about the effective population size of HIV in a representative untreated patient. The concept of the effective size was introduced by Wright as a means of referring the intensity of genetic drift in a real population to that in an ideal Wright-Fisher population (80); i.e., the effective size of a real population is the size an ideal population would have if it also had the same rate of genetic drift as that observed in the real population. Another, perhaps more intuitive, way of thinking about the effective size is to consider the inbreeding effective population size, defined as the inverse of the probability that two randomly selected individuals have a common ancestor in the previous generation. (This is conceptually close to the crude “virological” definition of the effective population size as the number of productively effective cells that produce most of the virions that infect the next generation of productively infected cells.) If this probability of identity by descent is low, it stands to reason that the population size must be large, and vice versa. One begins to see how the (inbreeding) effective population size influences the genetic diversity. If the effective size is small, the probability of identity by descent is high, and there is consequently low genetic diversity, since individuals tend to be closely related. Both definitions apply in the presence of weak selection as well, e.g., for the model system shown in Fig. 1. There are several other measures of effective size, e.g., the variance effective population size and the eigenvalue effective population size, but these rarely give different values. Of course, the usefulness of all these definitions depends on the hypothesis that the actual population is not too far, in the sense of its evolutionary properties, from an ideal Wright-Fisher model with suitable parameters.

Assuming that selection is not important and the neutral model applies, a coalescence-based approach (see “Many loci and other aspects” below) has been used to estimate an effective size as small as 100 to 1,000 cells (45), much smaller than the total number of productively infected cells per patient, 10⁷ to 10⁹ (22). At least one other study reported similar values (66). However, other lines of evidence, including differences among rates of accumulation in different genes (74) and very high (44, 84) or very low (4, 5) ratios of synonymous to nonsynonymous mutations in some genes, imply that HIV populations are subject to significant selective influences. Therefore, population genetic methods that assume a lack of selection may yield erroneous results.

Two of us recently developed and applied a robust method to estimate the effective HIV population size in vivo based on the genetic variation at close pairs of highly diverse sites (67). As follows from the simulation examples above (Fig. 8, 9, and 13), a site cannot preserve a high diversity (f ≃ 0.5) indefinitely. Early in infection, the HIV population is almost uniform genetically or has a limited number of sequences, due to the bottleneck that occurs at transmission and to early competition between clones (12, 27, 46, 86). Therefore, highly diverse sites are sites that are caught in the act of “reversion” from mutant to wild type (i.e., of advantageous substitution). The basis for this test was to select two such sites, A(a), and B(b), where the lowercase and capital letters denote mutant and wild type, respectively, and then classify all sequences in the population into four groups (haplotypes): ab, Ab, aB, and AB. During reversion, the population starts from an almost uniform haplotype ab and arrives at an almost uniform haplotype AB. The two other haplotypes are transient. The idea of the test is that, deep in a stochastic regime and given a limited sample size, one of the four haplotype groups will be empty at any time, because the time at which reversion ensues is random. Suppose that the population is deep in the selection-drift regime. Two sites revert typically at different random times, even if their selection coefficients are equal (Fig. 13). Nearly simultaneous reversion can happen accidentally. In all cases, as can be shown, the number of well-represented subclones (i.e., the number found in a sample of the usual size [10 to 30 clones]) is typically two, rarely three, and much more rarely four, at any time point.

Using sequence databases for HIV pro and envgenes from drug-naive individuals (27, 44), we found that this effect is absent for close pairs of bases. We checked that this effect is not sensitive to variation of the initial genetic composition and some other factors assumed in the model and estimated the effect of recombination (derived from kinetic data) on the test to be numerically small as well. We therefore were able to conclude that a steady-state HIV population in an untreated individual, with respect to evolution of separate bases, is either in or at the border of the deterministic interval of population sizes.

Some authors considered a possibility that an HIV population may consist of weakly connected small populations. Shedding viruses from these subpopulations into the peripheral blood could explain the presence of all four haplotypes in the above test, in apparent contrast to our conclusion. Indeed, HIV-infected cells are located within lymphoid tissue in visually distinct islands (64). However, different islands exchange virions and infected cells and may or may not be weakly connected genetically. The strong overlap of the island patterns obtained for different virus strains (64) proves that the island structure is due to nonuniform distribution of infectible cells rather than to random seeding by the virus. Next, estimates based on studies of the clearance rate of free virus from peripheral blood (85), on HIV RNA quantitation in the lymphoid tissue (22), and on the decay rate of infectious virus titer under highly active antiretroviral therapy (62) suggest that a considerable portion of virus particles produced in the tissue drain into the blood (within a few hours or less, from where they are removed within a few minutes). This implies that a good portion of virus particles infect cells far from the cells that produced them, suggesting strong virus transfer between the islands within the same tissue. On the other hand, viruses isolated from some locations (semen and the central nervous system) show phylogeny distinct from that of the main virus reservoir in the body. Genetic sampling from different islands could clear the issue (28).

Given a relatively weak role of stochastic effects detected, we decided to test which factors shape evolution in the HIV protease (pro) gene (68). Using the same database ofpro sequences, we observed that variation was restricted to rare bases: an average base was variable in about 16% of patients. The intrapatient distance per individual variable site, 27%, was similar for synonymous and nonsynonymous sites, although synonymous variable sites were twice as abundant, implying that purifying selection is the dominant kind of selection. We explained these facts within the one-locus model of evolution by assuming deterministic evolution within individuals and random sampling during the transmission between individuals. We considered different variants of the model with transmission of one and several genomes and with coinfection from independent sources. The model explained the variable sites as slightly deleterious mutants that are slowly being replaced with the better-fit variant during individual infection. In the case of a single-source transmission, genetic bottlenecks at the moment of transmission effectively suppress selection, allowing mutants to accumulate along the transmission chain to the high levels observed. However, we found that even very rare coinfections from independent sources are able to counteract the bottleneck effect and keep mutants at low levels. If such coinfections occur, the plausible explanation of the high level of mutants in an inoculum is variation of the best-fit sequence between individuals due to variation in the specific immune response, combined with coselection. In this model, variation in pro is due to a cascade of mutations compensating for early antigenic escape mutations. Note that our analysis was restricted to the single-locus approximation (see below).

General applications.

The progress of evolution, in general, may be limited by the time it takes for a new advantageous (in our notation, wild-type) allele to appear and become fixed in the population. Figure 16 shows schematically the time required to reach a 50% composition as a function of the population size. An interesting conclusion about the relative role of selection and randomness follows from this diagram. The reversion half time depends on the population size. The shortest time, given mostly by the inverse selection coefficient, is reached in the deterministic limit, μN >> 1. This implies population sizes larger than 10⁸ to 10⁹ genomes for DNA viruses and bacteria and 10¹¹ to 10¹²for higher organisms. Such a population size exceeds the total size of any species higher on the evolutionary scale than insects. On the other hand, a mutation in a small population with a size smaller than the inverse selection coefficient (drift regime, in our terms) would be fixed only after a number of generations given by the inverse mutation rate, 1/μ, corresponding to timescales of planetary development. Put another way, nucleotides with a very small selective advantage compared to the inverse population size evolve very slowly. The characteristic values of selective advantage for the most important mutations (in an evolutionary sense) are unknown. Still, the above considerations suggest the possibility that the evolution of higher organisms may be driven, mostly, by nucleotides with s larger than the inverse population size (provided that they are sufficiently frequent in the genome), i.e., within the selection-drift interval (Table 1). Then the rate of evolution does not have to be unreasonably low and the size of a population does not have to be unreasonably large. If this is the case, one can conclude that the two factors, random drift and selection, are equally important for the rate of evolution on very large timescales (millions of years) and that neither is a small correction. The reversion (fixation) half time within the interval 1/s < N < 1/μ is ∼1/(μNs).

Fig. 16.

Many Loci and Other Aspects

The distinction between deterministic and stochastic evolutionary genetic processes is critical—under a deterministic regime, the fate of a novel mutation or allele can be known with certainty; under a stochastic regime, we are able to characterize only the statistical properties of allele frequencies and fixation times. The neutral and the deterministic cases represent the ends of a spectrum, and precisely where a population sits depends most strongly, on its effective size. However, much of the theory relating effective population size and the stochastic-deterministic continuum had been worked out for simple models involving two (as in the present review) or, at most, a handful of alleles (37). These developments have allowed population geneticists to characterize the qualitative behavior of genes in populations undergoing a variety of dynamic processes, e.g., population size change, subdivision, and selection. Nonetheless, the assumptions and restrictions of these simple models typically preclude their use as descriptors of real populations, except in some particular cases (below).

In 1982, Kingman (39, 40) introduced a new way of studying the stochastic behavior of genes in a population. His framework—the coalescent—characterizes the genealogies of genes (or gene fragments), specifically the statistical distribution of times to common ancestors. It assumes the neutral model of evolution so that phylogenetic branching and accumulation of mutations are independent processes. Just as we estimated above (see “Decay of the polymorphic state and gene fixation”), the average time to the most recent common ancestor of a sample of genes is, within a numerical factor which depends on the sample size, the effective population size. The coalescent incorporates the same information as allele-based methods but in a form more relevant for today's evolutionary geneticist. This is largely because the raw data of molecular evolutionary studies in the last 15 years have been the molecular sequences, and there is an extensive and well-developed literature on molecular phylogenetic reconstruction (reference 65 and references therein). Studies on the coalescent have also shown that there is an increase in the power of parameter estimation as more independent loci are analyzed. Selection—for a long time the thorn in the side of the coalescent theorist—can, in principle, be accommodated within a coalescent-like framework. The recently developed inclusion of selection in a genealogical framework (42, 55) requires the construction of an ancestral selection graph, akin to a coalescent genealogy, which has the unusual property of coalescing and splitting as one moves back in time. The reader has to keep in mind, though, that a mathematical theory on a network (the case with selection) is technically much harder to handle (with or without computer simulation) than a theory made for the tree topology as the neutral coalescent. The tree theory, but not the network theory, can be reduced to one-dimensional chain of equations. (Similar issues arise in physics, in theoretical studies of hopping transport of electrons [63, 72].) Still, this recent achievement should stimulate the development of novel methods that work with selection as well.

In some cases, recombination or point mutations break linkage disequilibrium and make loci almost independent, so that the one-locus approximation applies directly. One such case is when strong recombination is present in the system and the variable loci are spread far apart (for HIV, the respective length is predicted to be around 100 nucleotides or longer, if superinfection protection is efficient [67]). If the recombination rate is low, Muller's ratchet (6, 13, 14, 17, 53) may operate. With Muller's ratchet, random drift can lead to the elimination of the fittest genomes from a population. Once a better-fit haplotype is accidentally lost from a recombination-free population, it cannot reappear, thus clicking the ratchet another notch. Successive “clicks” cause the population to become successively less fit, on average. Back mutations can, in principle, restore the disappearing better-fit haplotypes. For a long segment of genome of a replication-competent virus, they are expected to be much less likely than the forward mutations, since the frequency of deleterious substitutions is expected to be low. As follows from Monte Carlo simulation, which we hope to discuss elsewhere, back mutations can prevent Muller's ratchet only for sites with large selection coefficients, s = 0.1 to 0.2, and provided that the gene segment is short.

Another important consideration is that selective pressures on some parts of the genome must also have an effect elsewhere. Such “background” selection may explain the very small effective size of HIV obtained from the env gene diversity under the neutral approximation. Charlesworth and colleagues (7, 8, 57) have shown that if an unsequenced region of a genome is under selection and is linked to the region under study, depressed estimates of effective size are obtained. Potentially, then, selective pressures ongag or pol can influence diversity inenv. However, linkage disequilibrium obviously has an effect beyond simply lowering the effective size, and it speaks to the issue of fitness at the unit of the individual virion. If there is linkage disequilibrium, does it really make sense to look for the effects of immune-driven selection only in env or gag, as many studies have done (3, 70, 83)? To what extent is the fitness advantage of mutations in env, for instance, balanced by the loss of fitness due to mutations in pol? Such “interaction” between loci in a small population may be caused by linkage alone and applies even to mutations that additively affect the fitness of a genome. To make things more complex, nonadditive compensatory mutations (epistasis) exist, due to actual biological interaction between loci, at both the nucleotide and protein levels. Moreover, in vesicular stomatitis virus systems, epistasis may be the factor counteracting the loss of fitness due to Muller's ratchet (16). Compensatory mutations, which become advantageous only after initial resistance mutations occur, have also observed in HIV-infected patients treated with protease inhibitors (11). The development of molecular techniques that allow full-length HIV genomes to be sequenced (69) means that it is only a matter of time before genetic data become available to study the evolutionary processes of linkage disequilibrium, background selection, and compensation in infected individuals.

Conclusions

We have analyzed in depth a broad range of problems in evolutionary dynamics in the framework of a simple one-locus, two-allele population model, which includes three basic factors: random drift, point mutation, and selection. We found that (as long as the mutation rate is lower than the selection coefficient) the dynamic properties differ drastically in three wide intervals of the population size that we call the drift, selection-drift, and selection regimes. Transition between stochastic and deterministic behavior of genetic evolution occurs in the intermediate selection-drift regime, which is expected to be very wide in the population size, especially for DNA systems. In this regime, deterministic laws govern genetically highly polymorphic populations, and almost uniform populations evolve stochastically.

Estimates of typical population sizes and of the time in which new advantageous alleles appear and become fixed in the population suggest that higher organisms may evolve while in the selection-drift regime. If this is the case, the speed of evolution depends on three parameters: mutation rate, selective advantage, and population size. Hence, selection pressure and random drift, whose relative importance for evolution is often disputed in the literature, are equally important, although they act differently: selection promotes evolution, and random drift slows it down.

The theory provides recommendations for the size of the population in different bacteriological and virological experiments in vitro aimed at either comparing the fitness of different mutants or measuring the mutation rate. For HIV populations in vivo, theory based on the purifying selection alone predicts either a weak diversity or a very low genetic turnover rate. Experimental searches for rapidly varying bases can provide biological evidence for selection for diversity due to different environments, a changing immune response, changes in host cell populations with time, and other important aspects of HIV infection.

Naturally, with any research program that requires theory to be integrated with data, there is an inevitable tension between experimental biologists, who deal daily with the complexity of real biological systems, and theoretical biologists, who “simplify, simplify, simplify” in the name of tractability. In this work, our analysis has been limited to the simplest possible case: evolution of a single locus with only two alleles. Many important aspects of evolution, including the effects of multiple loci, recombination, coselection, and migration, were not considered. Nevertheless, in-depth consideration of this simple system has yielded a surprisingly rich set of results, which should be very useful for the design of experiments in evolution and for the interpretation of patterns of genetic variation in natural infection. In the future, however, we see greater reliance on the fusion of analytic and computational methods as a means of simulating the complexity of real populations. By tying these computer-intensive methods to well-characterized mathematical and statistical methods, one has the advantage of using standard inferential procedures without sacrificing too much in the way of realism. However, the old adage that one has to walk before one can run applies to population genetics as it does elsewhere, and understanding simple evolutionary models is perhaps the surest route to coming to grips with the complexity of virus evolution.

MATHEMATICAL RESULTS AND DERIVATIONS

Description of the Model and the Evolution Equation

In this section, we will derive the diffusion-type differential equation and complement it with the boundary conditions. First, we derive the discrete Markovian equation for the virus population model; second, we reduce it to the continuous diffusion equation; and third, we determine the boundary conditions for the diffusion equation, in different intervals of the population size. In other sections, we will solve the appropriate set of equations and boundary conditions for each interval of N and different initial conditions. Table2 contains a list of the principal notation used in this section.

Table 2.

Table 2. Mathematical notation

Symbol	Definition
A, B, C, F	Undetermined constants or functions
D	Relative interpopulation distance per site
δ(x)	Dirac delta function of x
δ_ij	Kroneker symbol: 1 if i = j and 0 otherwise
f	Mutant frequency
G	Gene fixation probability
g	Continuous part of the probability density
K	Time correlation function
μ	Mutation rate per generation per site
M	Mean change in the mutant frequency per unit time
N	Population size (productively infected cell number)
n	Mutant genome number
P_n	Probability of havingn mutant genomes
p₀	Probability of having a pure wild-type population
p₁	Probability of having a pure mutant population
q	Probability density flux
ρ	Probability density of the mutant frequency
s	Selection coefficient
T	Intrapatient genetic distance per site
t	Time (generation number)
V_x	Variance of x
x	Any parameter (for this table only)
$\bar{x}$	Expectation value of x
x_ss, x^s^s	Value ofx in steady state

Main results.

We show that the stochastic evolution of the virus population is described by one of two different sets of differential equations and boundary conditions, depending on the interval of the population size, N. A large population, μN ≫ 1/ln N, usually has many copies of both the wild-type and mutant genomes. The corresponding evolution equation and the boundary conditions have the form (32)

\frac{\partial ρ}{\partial t} = - \frac{\partial q}{\partial f}

(Equation 1)

q (f, t) = - \frac{1}{2 N} \frac{\partial}{\partial f} [f (1 - f) ρ] - sf (1 - f) ρ - μ (2 f - 1) ρ

(Equation 2)

q (f, t)_{f \to 0} = q (f, t)_{f \to 1} = 0

(Equation 3)

where f is the mutant frequency, ρ is the probability density, t is the generation number (time), ands is the selection coefficient. Equations 1 to 3 are valid under the conditions s ≪ 1, μ ≪ 1, and μN≫ 1/ln N. In this case, t and f can be treated (approximately) as continuous variables. Effects of the three terms in the right-hand side of the evolution equation given by equations 1 and 2 are illustrated in Fig. 2.

A useful analogy between the probability density and the density of a gas between two walls is discussed in the qualitative section of this review (Fig. 3). In this analogy, equation 1 expresses the fact that gas particles do not appear or disappear but only travel from one location to another. The quantity q(f,t) in equation 2 is the “probability flux,” analogous to the gas flux density defined as the net number of particles crossing a plane at f from left to right per unit area per unit time. Thus, the evolution equation written in the form of equation 1 expresses the fact that the probability density is a locally conserved quantity, just like a gas density. The boundary conditions in equation 3 state that the probability flux vanishes at the boundaries of the allowed interval inf, similar to gas particles being prohibited from crossing the confining walls (Fig. 3b).

In small populations, where N ≪ 1/[μ ln(1/μ)] (see below), the population can be found, with a finite probability, in a purely monomorphic state, of f = 0 or 1 (similar to condensation of gas at cold walls). In this interval, we break up the total probability density into a sum of the continuous probability density and of two singular terms, as given by

ρ (f, t) = p_{0} (t) δ (f) + p_{1} (t) δ (1 - f) + g (f, t)

(Equation 4)

where p₀ andp₁ are the probabilities of having pure wild-type and pure mutant, respectively; δ(f) denotes the Dirac delta function; and g(f,t), wheref(1 − f) ≫ 1/N, is the continuous part of the probability density. The boundary conditions are

\frac{{dp}_{0}}{dt} = - q (0, t), \frac{{dp}_{1}}{dt} = q (1, t), N ≪ 1 / [μ ln (1 / μ)]

(Equation 5)

2 μ {Np}_{0} = [fg (f)]_{f \to 0}, 2 μ {Np}_{1} = [(1 - f) g (f)]_{f \to 1}

(Equation 6)

The first pair of conditions (equations 5) describes the accumulation or depletion of probability at the two boundaries, analogous to condensation or evaporation of gas (Fig. 3c). The second pair (equations 6) reflects the fact that transition between a monomorphic state, f = 0 or 1, and the closest polymorphic state, f = 1/N or (N − 1)/N, respectively, can occur due to a mutation only. This pair of equations has to be derived from the first principles, i.e., the discrete virus population model (below). The differential equation for the continuous part of the probability density, g(f,t), has a form

\frac{\partial ρ}{\partial t} = - \frac{\partial q}{\partial f}, q (f, t) = - \frac{1}{2 N} \frac{\partial}{\partial f} [f (1 - f) ρ] - sf (1 - f) ρ

(Equation 7)

which differs from the expressions at large N(equations 1 and 2), in that the term with μ in equation 2 is absent. The mutation rate enters the problem through the boundary conditions (equation 6) only.

One can easily obtain the upper bound for N, within which the above boundary conditions apply, from the boundary conditions themselves. The probability of a polymorphic state is given by ∫₀¹g(f)df. As follows from equation 6, near the boundaries g(f) diverges and is given by g(f) ∼ 2μN p₀/f andg(f) ∼ 2μ N p₁/ (1 − f) . The integral of g(f) is mostly contributed from f ≃ 1 or 0 and is truncated at f(1 − f) ∼ 1/N. The resulting probability of a polymorphic state is comparable to the probability of a monomorphic state, p₀ + p₁, at μN logN ≃ 1, as we stated above.

Note that the validity of these equations, as we discussed in the qualitative section of this review, is not restricted to the virus population model. The same diffusion equation (equation 1 and 2) applies to many other haploid one-locus, two-allele populations, which include the same three factors: random sampling of genomes, symmetric mutations, and purifying selection. Should some other factors come into consideration, such as allelic dominance in a diploid population or time fluctuations of selection coefficient or of other parameters, a more general equation of similar form can be written (see “stochastic equation of evolution” below) (33). In principle, the approach can be generalized for many-allele or multiple loci using partial derivatives in haplotype frequencies (37).

Now we proceed with derivations of all these formulas.