It has long been appreciated that retroviruses can contribute significantly to the genetic makeup of host organisms. Genes related to certain other viruses with single-stranded RNA genomes, formerly considered to be most unlikely candidates for such contribution, have recently been detected throughout the vertebrate phylogenetic tree (
1,
6,
13). Here, we report that viruses with single-stranded DNA (ssDNA) genomes have also contributed to the genetic makeup of many organisms, stretching back as far as the Paleocene period and possibly the late Cretaceous period of evolution.
Determining the evolutionary ages of viruses can be problematic, as their mutation rates may be high and their replication may be rapid but also sporadic. To establish a lower age limit for currently circulating ssDNA viruses, we analyzed 49 published vertebrate genomic assemblies for the presence of sequences derived from the NCBI RefSeq database of 2,382 proteins from known viruses in this category, representing a total of 23 classified genera from 7 virus families. Our survey uncovered numerous high-confidence examples of endogenous sequences related to the
Circoviridae and to two genera in the family
Parvoviridae: the parvoviruses and dependoviruses (Fig.
1).
The
Dependovirus and
Parvovirus genomes are typically 4 to 6 kb in length, include 2 major open reading frames (encoding replicase proteins [Rep and NS1, respectively] and capsid proteins [Cap and VP1, respectively]), and have characteristic hairpin structures at both ends (Fig.
2). For replication, these viruses depend on host enzymes that are recruited by the viral replicase proteins to the hairpin regions, where self-primed viral DNA synthesis is initiated (
2).
Circovirus genomes are typically ∼2-kb circles. DNA of the type species, porcine circovirus 1 (PCV-1), contains a stem-loop structure within the origin of replication (Fig.
2), and the largest open reading frame includes sequences that are homologous to the
Parvovirus replicase open reading frame (
9,
11). The circoviruses also depend on host enzymes for replication, and DNA synthesis is self-primed from a 3′-OH end formed by endonucleolytic cleavage of the stem-loop structure (
4). The frequency of
Dependovirus infection is estimated to be as high as 90% within an individual's lifetime. None of the dependoviruses have been associated with human disease, but related viruses in the family
Parvoviridae (e.g., erythrovirus B19 and possibly human bocavirus) are pathogenic for humans, and members of both the
Parvoviridae and the
Circoviridae can cause a variety of animal diseases (
2,
4).
With some ancestral endogenous sequences that we identified, phylogenetic comparisons can be used to estimate age. For example, as a
Dependovirus-like sequence is present at the same location in the genomes of mice and rats, the ancestral virus must have existed before their divergence, more than 20 million years ago. Some
Circovirus- and
Dependovirus-related integrations also predate the split between dog and panda, about 42 million years ago. However, in most other cases, we rely on an indirect method for estimating age (
1). As genomic sequences evolve, they accumulate new stop codons and insertion/deletion-induced frameshifts. The rates of these events can be tied directly to the rates of neutral sequence drift and, therefore, the time of evolution. To apply this method, we first performed a BLAST search of vertebrate genomes for all known ssDNA virus proteins (BLAST options, -p tblastn -M BLOSUM62 -e 1e−4). Candidate sequences were then recorded, along with 5 kb of flanking regions, and then again aligned against the database of ssDNA viruses to find the most complete alignment (BLAST options, -t blastx -F F -w 15 -t 1500 -Z 150 -G 13 -E 1 -e 1e−2). Detected alignments were then compared with a neutral model of genome evolution, as described in the supplemental material, and the numbers of stop codons and frameshifts were converted into the expected genomic drift undergone by the sequences. The age of integration was then estimated from the known phylogeny of vertebrates (
7,
10). Using these methods, we discovered that as many as 110 ssDNA virus-related sequences have been integrated into the 49 vertebrate genomes considered, during a time period ranging from the present to over 40 to 60 million years ago (Table
1; see also Tables S1 to S3 in the supplemental material).
It is important to recognize that there is an intrinsic limit on how far back in time we can reach to identify ancient endogenous viral sequences. First, the sequences must be identified with confidence by BLAST or similar programs. This requirement places a lower limit on sequence identity at about 20 to 30% of amino acids, or about 75% of nucleotides (nucleotides evolve nearly 2.5 times slower than the amino acid sequence they encode). Second, the related, present-day virus must have evolved at a rate that is not much higher than that of the endogenous sequences. The viruses for which ancestral endogenous sequences were identified in this study exhibit sequence drift similar to that associated with mammalian genomes. Setting this rate at 0.14% per million years of evolution (
8), we arrive at 90 million years as the theoretical limit for the oldest sequences that can be identified using our methods. This limit drops to less than 35 million years for endogenous viral sequences in rodents and even lower for sequences related to viruses that evolve faster than mammalian genomes.
The most widespread integrations found in our survey are derived from the dependoviruses. These include nearly complete genomes related to adeno-associated virus (AAV) in microbat, wallaby, dolphin, rabbit, mouse, and baboon (Fig.
2). We did not detect inverted terminal repeats in several integrations tested, even though repeats are common in the present-day dependoviruses. This result could be explained by sequence decay or the absence of such structures in the ancestral viruses. However, we do see sequences that resemble degraded hairpin structures to which
Dependovirus Rep proteins bind, with an example from microbat integration mlEDLG-1 shown in Fig.
3. The second most widespread endogenous sequences are related to the parvoviruses. They are found in 6 of 49 vertebrate species considered, with nearly complete genomes in rat, opossum, wallaby, and guinea pig (Fig.
2).
The
Dependovirus AAV2 has strong bias for integration into human chromosome 19 during infection, driven by a host sequence that is recognized by the viral Rep protein(s). Rep mediates the formation of a synapse between viral and cellular sequences, and the cellular sequences are nicked to serve as an origin of viral replication (
14). The related integrations in mice and rats, located in the same chromosomal locations, might be explained by such a mechanism. However, the extent of endogenous sequence decay and the frequency of stop codons indicate that these integrations occurred some 30 to 35 million years ago, implying that they are derived from a single event in a rodent ancestor rather than two independent integration events at the same location. Similarly, integrations EDLG-1 in dog and panda lie in chromosomal regions that can be readily aligned (based on University of California—Santa Cruz [UCSC] genome assemblies) and show sequence decay consistent with the age of the common ancestor, about 42 million years. Endogenous sequences related to the family
Parvoviridae can thus be traced to over 40 million years back in time, and viral proteins related to this family have remained over 40% conserved.
Sequences related to circoviruses were detected in five vertebrate species (Table
1 and Table S1 in the supplemental material). At least one of these sequences, the endogenous sequence in opossum, likely represents a recent integration. Several integrations in dog, cat, and panda, on the other hand, appear to date from at least 42 million years ago, which is the last time when pandas and dogs shared a common ancestor. We see evidence for this age in data from sequence degradation (Table
1), phylogenetic analyses of endogenous
Circovirus-like genomes (see Fig. S2 in the supplemental material), and genomic synteny where integration ECLG-3 is surrounded by genes MTA3 and ARID5A in both dog and panda and integration ECLG-2 lies 35 to 43 kb downstream of gene UPF3A. In fact,
Circovirus integrations may even precede the split between dogs and cats, about 55 million years ago, although the preliminary assembly and short genomic contigs for cats make synteny analysis impossible.
The most common
Circovirus-related sequences detected in vertebrate genomes are derived from the
rep gene. We speculate that, like those of the
Parvoviridae, the ancestral
Circoviridae sequences might have been copied using a primer sequence in the host DNA that resembled the viral origin and was therefore recognized by the virus Rep protein. Higher incidence of
rep gene identifications may represent higher conservation of this gene with time, or alternatively, possession of these sequences may impart some selective advantage to the host species. The largest
Circovirus-related integration detected, in the opossum, comprises a short fragment of what may have been the
cap gene immediately adjacent to and in the opposite orientation from the
rep gene. This organization is similar to that of the present day
Circovirus genome in which these genes share a promoter in the hairpin regions but are translated in opposite directions (Fig.
2).
In summary, our results indicate that sequences derived from ancestral members of the families
Parvoviridae and
Circoviridae were integrated into their host's genomes over the past 50 million years of evolution. Features of their replication strategies suggest mechanisms by which such integrations may have occurred. It is possible that some of the endogenous viral sequences could offer a selective advantage to the virus or the host. We note that
rep open reading frame-derived proteins from some members of these families kill tumor cells selectively (
3,
12). The genomic “fossils” we have discovered provide a unique glimpse into virus evolution but can give us only a lower estimate of the actual ages of these families. However, numerous recent integrations suggest that their germ line transfer has been continuing into present times.
Acknowledgments
A.M.S. was funded by National Institutes of Health grants CA71515 and CA06927 and also by an appropriation from the Commonwealth of Pennsylvania. V.A.B. is supported by membership from Martin A. and Helen Chooljian at the Institute for Advanced Study.
We thank Marie Estes for excellent secretarial assistance and Karen Trush for help with the figures.
The contents of this paper are the responsibility solely of the authors and do not necessarily represent the official views of the National Cancer Institute or any other sponsoring organization.