INTRODUCTION
Coronaviruses (family
Coronaviridae, subfamily
Coronavirinae) are important pathogens of birds and mammals. Coronaviruses are positive-sense RNA viruses and are currently classified into four genera:
Alphacoronavirus,
Betacoronavirus,
Gammacoronavirus, and
Deltacoronavirus (
1). Alphacoronaviruses and betacoronaviruses are found exclusively in mammals, whereas gammacoronaviruses and deltacoronaviruses primarily infect birds. The identification of severe acute respiratory syndrome (SARS) coronavirus in 2003 (
2) prompted an intensive search for novel coronaviruses, resulting in the detection of a number of novel coronaviruses in humans, domesticated animals, and wildlife (
3–7). Interestingly, surveillance of coronaviruses in wild animals has led to the discovery of the greatest diversity of coronaviruses in bat and avian species, which suggests that these animals are the natural reservoirs of the viruses (
8–10). Indeed, phylogenetic studies of bat and avian coronaviruses suggest an ancient relationship with possible codivergence and coevolution with their hosts. Conversely, many coronaviruses found in bats and other mammals diverged near the tips of coronavirus phylogeny, suggesting that these viruses were the result of recent cross-species transmission events (
9,
11).
Molecular clock analysis based on the RNA-dependent RNA polymerase (RdRp) genomic region suggests a time of most recent common ancestor (tMRCA) for the four coronavirus genera of around 10,100 years ago, with a mean rate of 1.3 × 10
−4 nucleotide substitutions per site per year (
10). This tMRCA estimate is difficult to reconcile with a hypothetical ancient, coevolutionary relationship between coronaviruses and their bat or bird hosts (
12,
13). Moreover, a group of genetically related, yet distinct, alphacoronaviruses have been detected in different mouse-eared bats (
Myotis spp.) on multiple continents. However, these bat species do not migrate long distances, with few traveling farther than hundreds of miles to overwinter sites (Gary F. McCracken, personal communication). Yet, the tMRCA of
Alphacoronavirus is estimated to be around 200 or 4,400 years ago, on the basis of analyses of helicase (
9) and RdRp (
10), respectively. The limited interaction among these bat populations suggests a more ancient evolutionary association with alphacoronaviruses (i.e., codivergence or coevolution), which is incompatible with the relatively young viral tMRCAs. Notably, coronaviruses have a unique proofreading mechanism for viral RNA replication (
3,
14); because of the exoribonuclease activity of viral nonstructural protein 14 (Nsp14), the mutation rate of coronaviruses has been found to be similar to that of single-stranded DNA viruses (∼1 × 10
−5 to 1 × 10
−6 mutation per site per replication) and well below those measured in other RNA viruses (∼1 × 10
−3 to 1 × 10
−5 mutation per site per replication) (
15,
16). For these reasons, we speculate that there is a substantial underestimation of the length of the natural evolutionary history of coronaviruses.
Recent developments in the modeling of the evolution of RNA viruses have revealed that purifying selection can mask the ancient age of viruses that appear to have recent origins, according to a molecular clock (e.g., measles, Ebola, and avian influenza viruses) (
17). Strong purifying selection can maintain evidence of sequence homology long after saturation has occurred at synonymous sites; this phenomenon can lead to underestimation of the overall depth of a viral phylogeny. In the absence of strong purifying selection, nucleotide sequences would diverge more quickly, lose detectable homology, and become difficult to align and compare. Here, we asked if similar evolutionary patterns led to underestimation of the tMRCA of the coronavirus lineage. We employed evolutionary models that account for variation in the pressure of natural selection across sites in viral loci and lineages in their phylogenies. Our results indicate that coronaviruses are orders of magnitude older than suggested by previous molecular clock analyses.
DISCUSSION
Our results indicate that the evolutionary history of coronaviruses likely extends much further back in time than previous estimates have suggested. Across the coronavirus genome, there is evidence that standard nucleotide models underestimate the amount of evolution that has occurred by orders of magnitude; strong purifying selection had masked the evidence of thousands or millions of years of evolution in the coronavirus phylogeny. Like many other DNA and RNA viruses—including herpesviruses (
35,
36), lentiviruses (
37), bornaviruses (
38), filoviruses (
38,
39), and foamy viruses (
40,
41)—coronaviruses appear to be an ancient viral lineage. Furthermore, our results demonstrate that purifying selection masking an ancient evolutionary history is not a phenomenon constrained to negative-sense RNA viruses (
17) but can be seen in positive-sense RNA viruses like coronaviruses as well.
Interestingly, our extrapolated estimate of the tMRCA of coronaviruses infecting mammalian (alphacoronaviruses and betacoronaviruses) and avian (gammacoronaviruses and deltacoronaviruses) species of 190 to 489 (mean of 293) million years ago corresponds to the inferred tMRCA of these host species based on fossil and molecular evidence of around 325 million years ago (
42,
43). It is tempting to speculate that this correspondence between dates is evidence of coevolution and codivergence between bat and avian species and the coronavirus genera. However, given the uncertainty associated with branch length estimation under strong selection (
17) and the extreme saturation leading to a dramatic increase in the inferred tree length, our analyses may not provide an accurate estimation of the tMRCA of coronaviruses. Nevertheless, our results strongly suggest that the 10,000-year-ago tMRCA of coronaviruses is underestimated by orders of magnitude. These results leave open the possibility that coronaviruses have been infecting bats and/or birds since the origin of these clades tens of millions of years ago or possibly since their divergence from each other in the carboniferous period, over 300 million years ago. This extrapolation, rather than be considered a reliable estimate of the coronavirus tMRCA, should be viewed as a biologically plausible hypothesis based on realistic parameters (e.g., patterns of substitution rates and selection profiles). We can no longer reject an ancient coevolutionary relationship based on the molecular clock.
The degree to which host switching and codivergence have shaped coronavirus diversity remains unresolved. The observation of recent viral cross-species transmission events among mammalian (
9,
11) and avian (
8) species is clear evidence of recent host switching. Conversely, the separation of mammalian and avian coronaviruses into distinct genera is suggestive of codivergence: mammalian coronaviruses (i.e., alphacoronaviruses and betacoronaviruses) are generally inferred to be reciprocally monophyletic (
8,
10,
11). Therefore, a formal analysis of host switching and codivergence in coronaviruses would be useful for disentangling which sections of the coronavirus phylogeny represent codivergence and which represent host-switching events.
For the shorter branches (≤0.05 substitution per site) in the coronavirus phylogenetic trees, we found general agreement in the inferred branch lengths between evolutionary models (GTR + Γ
4 and BS-REL); there was no evidence of underestimation of the lengths of short branches. Therefore, for closely related viral lineages involved in recent zoonotic transmission events (e.g., SARS coronavirus in humans, bats, and civet cats), previous dating estimates (
44) are consistent with our findings. Furthermore, this observation suggests that substitution rate estimates inferred from these recent outbreaks are robust to the biasing effects of purifying selection with respect to branch length estimation (though coalescent effects may still have an impact on these recent evolutionary rate estimates [
45–47]). However, it remains unclear how the lower-than-average mutation rate of coronaviruses (10
−5 to 10
−6 mutation per site per replication) (
15) translates into a typical short-term substitution rate (10
−3 substitution per site per year) (
44). Further investigation on this topic is needed.
BS-REL is an attractive tool for future studies of ancient virus evolution. The use of evolutionary models that allow for variable selection pressure across all branches in a phylogeny when estimating branch lengths is an advance over previous implementations by Wertheim and Kosakovsky Pond (
17,
25), which necessitate
a priori identification of long internal branches bearing the mark of strong purifying selection (
30). Unlike the phylogenetic trees in this previous study, which were characterized by closely related isolates separated by long internal branches, the coronavirus phylogenies are complicated, with a mixture of long and short branches interspersed throughout the tree. The BS-REL framework represents a flexible and powerful approach to the modeling of natural selection in the evolution of viruses (
25), and it does not necessitate designating which branches experienced stronger or weaker selection pressures.
Moreover, like our previous approach to the analysis of Ebola and avian influenza viruses, BS-REL found evidence of branches experiencing saturation. Because of the way in which BS-REL parameterizes variable selection, the saturated branches were estimated to be longer in coronaviruses than in Ebola and avian influenza viruses.
The BS-REL model almost certainly overfits data on short branches with simple patterns of natural selection, which likely affects the accuracy of branch length estimates across the tree. In the case of coronaviruses, this overfitting is not a serious problem because the expansion of branch length estimates due to saturation is extreme and unlikely to produce precise estimates. A more parsimonious implementation of BS-REL would be useful for addressing issues in viral evolution in which more precise branch length estimates are needed. Appropriate modeling of variation in the strength of natural selection will be integral for obtaining more accurate tMRCAs of viral lineages. Furthermore, it is possible that employing more realistic evolutionary models, for example, in maximum-likelihood or Bayesian tree searches, could improve the quality of viral phylogenetic inference.
In summary, our results indicate that coronaviruses have an evolutionary history much longer than those suggested by phylogenetic trees inferred by using standard nucleotide evolution models. This finding allows for a coevolutionary relationship between coronaviruses and their natural hosts. It is possible that such a long-term relationship has allowed some animal species (e.g., bats) to evolve strategies to coexist with coronaviruses and vice versa (
48,
49). Further investigation of this topic might help us to better understand virus-host coevolution, the origin of coronaviruses, and other related viral families in the order
Nidovirales.