Since their discovery (
76), overlapping genes, i.e., DNA sequences simultaneously encoding two or more proteins in different reading frames, have exerted a fascination on evolutionary biologists. Among several mechanisms, they can be created by a process called “overprinting” (
43), in which a DNA sequence originally encoding only one protein undergoes a genetic modification leading to the expression of a second reading frame in addition to the first one (Fig.
1). The resulting overlap encodes an ancestral, “overprinted” protein region and a protein region created de novo (i.e., not by duplication) called an “overprinting” or “novel” region (Fig.
1). At present, it is widely thought that the creation of proteins de novo is very rare, contrary to their emergence by gene duplication, which is thought to be the major factor (for reviews, see references
55 and
94). However, this belief might actually reflect the fact that proteins created de novo are in general very difficult to identify (
55). Indeed, a long-standing question is whether a protein that has no detectable homolog in other organisms (called an “orphan” protein or “ORFan” [
27] or “taxonomically restricted” [
110]) represents a protein created de novo in a particular organism or merely a protein that is a member of a larger family whose other members have diverged beyond recognition or have become extinct (
115). Proteins created de novo by overprinting provide a valuable opportunity to address these questions, and this constitutes one of the two strands of our study.
Practically all studies of overlapping genes have been focused on evolutionary constraints and informational characteristics at the DNA level (see, e.g., references
46,
71,
75,
84,
85, and
114). However, very little has been done to assess potential effects of the overlap on the corresponding protein products. Two studies reported that overlapping proteins are enriched in amino acids with a high codon degeneracy (arginine, leucine, and serine) (
68) and that they often simultaneously encode a cluster of basic amino acids in one frame and a stretch of acidic amino acids in the other frame (
66).
The other strand of the present study is based on earlier observations of the overlapping gene set of measles virus (
41), which suggested that protein regions encoded by overlapping genes might have a propensity toward structural disorder.
Structural disorder is an essential state of numerous proteins, in which it is associated mostly with signaling and regulation roles (
21,
96,
111). The key feature of intrinsically disordered proteins (also called “unstructured” or “natively unfolded”) is that under physiological conditions, instead of a particular three-dimensional (3D) structure, they adopt ensembles of rapidly interconverting structural forms. Different degrees of disorder exist, from random coils to molten globules (
100), and some disordered regions can become ordered under certain conditions (
21,
96,
117). A variety of computer programs have been developed to predict these regions (
19,
23,
101). Each predictor typically differs in what kind of “disorder” it identifies (
23,
78), matching only some of the types of disorder mentioned above. Therefore, in order to choose a proper predictor, it was necessary to define precisely what kind of structural disorder we expected to find in proteins encoded by overlapping genes.
At least two nonexclusive hypotheses can explain why overlapping genes might encode disordered proteins: (i) the newly created (overprinting) protein of each overlap might tend to be disordered, and (ii) structural disorder in proteins encoded by overlapping genes might alleviate evolutionary constraints imposed on their sequence by the overlap. These hypotheses are clarified below.
Intuitively, the conditions required for a protein to fold into a stable 3D configuration, including sequence composition, periodicity, and complexity, are such that structurally ordered proteins represent a vanishingly small fraction of all possible amino acid sequences. Indeed, proteins artificially created from random nucleotide sequences generally have a low secondary structure content (
107,
112). Hence our first hypothesis: novel, overprinting proteins are not expected to have a fixed 3D structure at birth, given the low probability of generating structure from a completely new sequence.
Disordered proteins are generally subject to less structural constraint than ordered ones (
13). Hence our second hypothesis: the presence of disorder in one or both products of an overlapping gene pair could greatly alleviate evolutionary constraints imposed by the overlap, allowing both protein products to scan a wider sequence space without losing their function.
Both hypotheses suppose only the lack of a rigid structure, as opposed to a total lack of structure (e.g., some proteins created de novo from a random nucleotide sequence, though lacking secondary structure, have a certain degree of order [
112]). For that reason, in this work, we use the widest possible definition of disorder, i.e., the lack of a rigid 3D structure, and we use a program whose predictions of disorder correspond to this definition, PONDR VSL2 (
69) (see Results).
In this work, we collected a large number of experimentally proven cases of proteins encoded by overlapping genes in unspliced eukaryotic RNA viruses and analyzed their sequence properties.
MATERIALS AND METHODS
Selection and curation of the data set of viral overlapping gene products.
We set out to find virus genomes containing overlapping genes whose existence was supported by experimental evidence. We first downloaded the file “Virus.ids,” release 2 July 2004 (
ftp://ftp.ncbi.nih.gov/genomes/IDS/Viruses.ids ), containing accession numbers for all complete viral genomes (except those of bacteriophages) from the NCBI viral database (
6). We then downloaded the 1,562 corresponding genomes or genome segments, corresponding to 1,098 viruses (some viruses have a segmented genome), and parsed all relevant information for each genome. Since the NCBI viral genome database (
6) is not completely reliably annotated (
62), we had to carefully select bona fide overlapping genes. We excluded from the analysis all files containing a “join” instruction (regardless whether it reflected a splicing event, a frameshift, or a circular genome with genes crossing the genome map borders) because their manual curation would have been too time-consuming. We excluded from the analysis all DNA viruses and all viral genera in which at least one virus is known to make use of splicing, and we selected only overlaps longer than 90 nucleotides, corresponding to 30 amino acids (aa) (see Results). We considered only one prototype virus per genus. We kept overlaps only if there was biochemical evidence that both proteins they encoded existed (i.e., detection in infected cells or in in vitro translation experiments) or if such evidence was available for the protein products of a homologous gene overlap in a related virus.
Overlaps found only in one virus species might stem from a sequencing error resulting in an artifactual N-terminal or C-terminal extension. Therefore, we checked in the literature that the proteins expressed had the actual, predicted size or that several viral strains from that species also had a similar overlap. If we could not exclude a sequencing artifact, we discarded the overlap.
If the theoretical start or stop codon of an overlapping open reading frame (ORF) as described in the NCBI file was incorrect, it was manually corrected (for instance, VP5 of infectious pancreatic necrosis aquabirnavirus starts at nucleotide 113 and not at nucleotide 68 [
108]). A few unspliced RNA viruses contain bona fide overlapping genes that are not described in the corresponding NCBI genome file. They were included in the analysis, and the missing proteins they encode were manually added: rice dwarf phytoreovirus OP-ORF (
89), Theiler's cardiovirus protein L* (
104), and vesicular stomatitis Indiana vesiculovirus protein C′ (
47). We provide their sequences in File S1 in the supplemental material.
A few viruses make use of frameshifting to generate overlapping reading frames but (presumably by mistake) their genome file does not contain a “join” instruction (for instance, the mumps rubulavirus P/V overlap), and therefore they were included in the analysis. Among those, some frameshifts or editing events result in genes that are partially colinear (upstream of the frameshift) and that thus truly overlap only downstream of the frameshift. In these cases, we excluded the colinear part. For instance, in the case of the mumps rubulavirus P/V gene system we excluded the N-terminal part common to both P and V (
41). Finally, in some cases an ORF (called “1”) overlaps several ORFs (called 2, 2′, 2", 2‴, etc.) that are colinear with each other because of alternative translation initiation sites, for instance, proteins C, C′, Y1, and Y2 in Sendai respirovirus (
16). In that case we kept only the ORF 2 for which the overlap with ORF 1 is the longest (in that case the ORF C).
Viral taxonomy.
Viral taxonomy changes quickly, and some names of viral taxons that are widely used by virologists are not officially recognized. Several of these taxons proved to be crucial for interpretation of our results in an evolutionary light (e.g., the proposed family
Tubiviridae [
97]). Therefore, in addition to the official taxonomy (
58), we have also indicated proposed taxa, indicating the corresponding references. The interested reader can consult the website where proposals to the International Committee for the Taxonomy of Viruses are made,
http://talk.ictvonline.org .
PONDR analysis of viral genes.
The sequences of overlapping genes and their protein products were stored in a MySQL database for analysis. Protein intrinsic disorder was predicted using PONDR VSL2 (
69), a neural network trained on a set of ordered and disordered sequences, which relies on attributes such as the composition of particular amino acids or hydropathy to predict disorder propensity along a protein sequence. PONDR predictions were also stored in the database.
Bootstrapping was used on the results to generate the confidence intervals shown. Ten thousand data sets of overlaps were randomly selected with replacement, and the calculations were repeated on each one of them. The 10,000 results were sorted and used to provide the boundary results for the appropriate confidence intervals.
The distribution of disordered regions in the overlapping regions was compared to the overall distribution of disorder in the entire data set. The null hypothesis tested was that the distribution of disorder in overlapping regions is the same as that in the entire data set; that is, we assume that there is no bias toward a greater concentration of disordered residues in overlapping regions. Using a chi-square test on sequence positions located 15 residues apart (which satisfies the assumption of independence), we obtain a P value that expresses the probability that our null hypothesis is correct.
Identification of putative ancestral, overprinted proteins.
As a first screen, all proteins encoded by overlapping genes were subjected to SMART analysis (
52), which includes prediction of PFAM and SMART domains, transmembrane and low-complexity regions, signal peptides, etc. The sequences of all overlapping protein regions were analyzed using (i) Psi-blast (
2); (ii) sequence profile comparison methods, which automatically run a Psi-blast query on a single sequence, align the retrieved sequence hits, derive a profile from the corresponding multiple-sequence alignment, and search the library of sequence profiles in PFAM release 23 (
25) for similar profiles (HHpred [
86], Compass [
74], and FFAS03 [
39]); and (iii) fold recognition methods (Fugue [
81] and Phyre [
9]). Finally, we submitted the 3D structures of proteins, when available, to structural similarity searches using VAST (
30) and SSM (
49). Protein regions were considered ancestral if they had statistically significant sequence or structural similarity with at least another protein region from a different viral family (unclassified genera were counted as distinct families).
Prediction of structural organization of pairs of known ancestral/novel overlapping regions.
The analyses described in the previous paragraph identified known domains, transmembrane segments, etc. Refined disorder prediction was carried out as follows (respecting the principles described in reference
23). We analyzed proteins containing novel or ancestral regions using the disorder predictor iPDA. For a conservative approach, we also used the predictors Prelink and Disopred, which have a very high specificity (
113), when the presence of disorder in a certain region was dubious. If neither program predicted disorder within the region under scrutiny, we considered the whole region to be ordered. The boundaries of disordered regions were refined by visual inspection of hydrophobic cluster analysis plots (
14). To find experimental evidence of disorder, all proteins were subjected to a Blastp similarity search (
2) against the database of disordered proteins Disprot (
82), and we also carried out extensive bibliographical searches.
Analysis of amino acid composition.
Composition Profiler (
102) allows comparison of the composition of a user-defined “query” data set (for instance, overlapping regions of proteins) with that of another user-defined “background” data set (for instance, nonoverlapping regions) or with that of a precompiled data set. The precompiled data sets we used are SwissProt 51 (
4), which is most similar to the distribution of amino acids in nature; PDB Select 25, which is a subset of structures from the Protein Data Bank (
10) with less than 25% sequence identity, biased toward the composition of proteins amenable to crystallization studies; and DisProt 3.4 (
82), which is a set of sequences of experimentally determined disordered regions. Composition Profiler also allows the discovery of biases in certain groups of amino acids such as order-promoting amino acids or charged amino acids (“discover” option) (
102) and the calculation of the relative entropy (RE) of two data sets, which roughly summarizes how dissimilar their compositions are. We used a significance value of 0.01 to identify composition biases.
Disorder content of differentially constrained overlapping genes.
The disorder content of viral overlapping genes whose evolutionary rates are known was calculated using the PONDR VSL2 predictor. Protein sequences were taken from genome entries. The GenBank accession numbers of the genomes are as follows: hepatitis B virus, NC_003977; human T-lymphotropic virus, AF139170; simian immunodeficiency virus, U72748; human papillomavirus, AF293961; coliphage φX174, J02482; potato leafroll virus, AF453389; Sendai virus, AB039658; and cotton leaf curl virus, NC_004607.
RESULTS
Collection of a curated data set of overlapping genes from a wide range of eukaryotic RNA viruses.
We carefully selected overlapping genes whose existence was supported by experimental evidence. Indeed, including an overlapping reading frame that is in fact not translated might introduce noise in our analyses, since such sequences are not subject to evolutionary pressure. Misannotated overlaps might stem from untranslated “hypothetical” genes, from a start codon wrongly assigned upstream of the true start codon, or from an undetected splicing event that results in an exon/intron overlap instead of an overlap of coding sequences. The last possibility prompted us to exclude all viruses that are known to make use of splicing. Curation of prokaryotic viruses (bacteriophages) and of DNA viruses proved to be too difficult. Therefore, we focused on unspliced, eukaryotic RNA viruses, which are either single stranded with a plus or minus genome polarity (respectively, +ssRNA and −ssRNA) or double stranded (dsRNA), and on unspliced retroid viruses, which use both DNA and RNA in their genome (for a review, see reference
5). Only one representative virus per genus was chosen.
The construction and curation of the data set are described in Materials and Methods. We concentrated on overlaps longer than 90 nucleotides, corresponding to 30 aa, for two reasons: (i) shorter regions are unlikely to fold by themselves (
87) and are thus expected to have a lesser structural impact, and (ii) the reliability of disorder prediction increases with length (
65,
90). By taking all of these precautions, we built a very conservative, high-quality data set of 43 viral genomes containing bona fide overlapping genes.
Table
1 shows some statistics for the 43 viral genomes comprising our data set, which are presented in Tables
2 to
6. They are grouped by taxonomy, to which we have paid particular attention in order to make this work as informative as possible (see Materials and Methods).
Some viral genomes contain several pairs of overlapping genes (for instance, the
Arterivirus GP2/GP3 and GP3/GP4 overlaps [Table
2]), while some genes overlap with more than one gene (for instance, the
Orthohepadnavirus P gene overlaps with three genes: L, X, and the capsid gene [Table
3]). Therefore, in total there are 52 gene overlaps (104 overlapping regions) in the data set, involving 96 protein products (Table
1). All overlaps in the data set are sense/sense, i.e., correspond to genes found on the same nucleic acid strand, and none encodes more than two proteins in different reading frames. The mean size of viral overlaps was 138 aa (Table
1), which corresponds to the typical size of a protein domain and is much longer than typical overlaps reported to exist in bacterial genomes (
29,
71). No precise data are available for eukaryotes due to the difficulty in reliably predicting overlapping genes, but a significant number of overlaps with a comparable length has been reported (
1,
70).
Examples of bona fide overlapping genes that have not been incorporated in this study because of the restrictions described above or because of technical limitations (see Materials and Methods) include the
Bornavirus P/X gene overlap (
109), which was removed because bornaviruses are known to make use of splicing (
79), and the
Henipavirus P/V and P/C overlaps (
106), which were excluded because the genome file contained a “join” instruction (see Materials and Methods), which is generally indicative of splicing but in this case is indicative of a frameshift.
In spite of these limitations, our data set still covers a wide evolutionary range. It consists mostly of ssRNA and dsRNA viruses, with only two retroid viruses (Table
3), because most retroid viruses are spliced and have thus been excluded. The data set includes at least one representative from several large viral orders or supergroups: the (unofficial) alphavirus-like supergroup (
72,
103) (Table
4); the orders
Picornavirales,
Nidovirales (Table
2), and
Mononegavirales (Table
6); and the proposed order
Reovirales (
58) (Table
2). Thus, our data set represents a good sampling of the diversity of overlapping genes in RNA viruses.
Proteins regions encoded by overlaps have a higher disorder content.
We have chosen to use the PONDR VSL2 software for the automated analysis because it has consistently been found to have one of the best combinations of specificity and sensitivity (
88) and because its definition of “disorder” is well suited to the biological question studied. Indeed, when PONDR VSL2 predicts a region to be “disordered,” what it predicts, more precisely, is that it has no fixed 3D structure (
69), which corresponds to our hypotheses about overlapping gene products (see the introduction). In addition to using PONDR, we also carried out in-depth analysis of selected proteins using a combination of structural prediction methods, as described in Materials and Methods and below. Our strategy is described in Fig.
2.
All proteins encoded by overlapping genes were subject to prediction of structural disorder using PONDR VSL2. As shown in Fig.
3, 29% of the amino acids of the whole data set are predicted to be in a disordered state. This is distributed in relation to overlapping as follows: 23% of the amino acids in nonoverlapping regions are predicted to be disordered, to be compared with 48% of the amino acids in overlapping regions. This difference in disorder content is highly significant (chi-square value = 254.4, one degree of freedom,
P = 2.7 × 10
−57) (see Materials and Methods). Thus, in our data set, protein regions encoded by overlapping genes show a significant bias toward structural disorder.
Identification of ancestral/novel protein pairs by their phylogenetic distribution.
One of our hypotheses (see the introduction) was that novel proteins created by overprinting tend to be disordered. Therefore, we tried to identify overlaps encoding recognizable ancestral/novel protein pairs.
Finding which protein is the ancestral one and which is the novel one in an overlapping pair is a difficult problem. Methods include (i) comparison of the codon usage of each overlapping reading frame to that of nonoverlapping genes of the viral genome (
67,
68) and (ii) assessing the phylogenetic distribution of each overlapping gene product, i.e., the extent to which they have homologs in other organisms (
43,
71). In these methods, the ancestral reading frame is assumed to be, respectively, the one having the standard genome codon usage and the one with the widest phylogenetic distribution. Whenever possible, both methods should be used together, since they are complementary (
43). However, implementing the first method with nearly 100 viral proteins is a large project in itself and is clearly outside the scope of this work. Therefore, we chose to examine the phylogenetic distribution of each overlapping gene product. We presumed that a protein region (>30 aa) involved in an overlap was ancestral only if it was conserved in at least two viral families. Given the high rate of evolution of RNA viruses (
20), this is a very stringent, and thus very conservative, criterion.
Our strategy is described in Fig.
2 and in Materials and Methods. Briefly, protein regions were considered ancestral only if they had either statistically significant sequence similarity or structural similarity with at least another protein region from a different viral family. Sequence similarity was assessed using profile-profile comparison, and structural similarity was assessed using fold recognition methods or direct structural comparison.
We found 21 protein regions matching this criterion, coming from 20 proteins from 19 viral genera. They are presented in Table
7. Several viral families contain genera with homologous pairs of overlapping genes (i.e., both overlapping regions have homologs in another viral genus, which also overlap): the
Birnaviridae VP5/VP2 overlap, the
Tubiviridae TGB2/TGB3 overlap, and the
Tombusviridae movement protein/p19 or p14 overlap (Table
7). In these cases we retained only one viral genus per family (
Avibirnavirus,
Pomovirus, and
Tombusvirus, respectively). In the end we found 17 nonhomologous overlaps encoding ancestral regions, from 15 different genera corresponding to nine families of +ssRNA, dsRNA, and retroid viruses (Table
7).
All ancestral regions match at least one PFAM sequence family as shown using profile-profile comparison (see Materials and Methods); in other terms, no ancestral region was selected only on the basis of structural similarity. (Briefly, a PFAM family is a collection of sequences of homologous protein domains or regions [
25]. Related PFAM families are grouped in “clans” [
24].)
We found no gene overlap for which both protein products were presumed to be ancestral according to the phylogenetic distribution criterion. In other terms, all the overlaps selected by this method encoded, on the one hand, a protein region conserved in at least two viral families and, on the other hand, a protein region that was restricted to one family at most. This reinforces our working hypothesis that protein regions conserved in two viral families can be considered ancestral whereas the regions overlapping them are novel (see also Discussion). Table
7 presents novel protein regions together with the ancestral protein regions that they overlap.
Some ancestral regions have homologs in a very large number of viral families, and it would be highly impractical to mention all these viral families. Instead, we present in Table
7 the PFAM families (release 23) corresponding to ancestral regions. This allows the reader to visualize easily the taxonomic distribution of homologs of ancestral regions, thanks to a user-friendly service called “species” available on the PFAM website as well as relevant bibliographical references (
25).
During the analysis of this large data set, we uncovered evolutionary relationships between some viral proteins, using profile-profile comparisons (see Materials and Methods). In Table
7 we propose corresponding new PFAM families and clans (
24). Two of these suggested clans correspond to distant sequence similarities unreported so far, to our knowledge. The first involves the nucleoproteins of the
Bunyaviridae and of the unclassified genus
Tenuivirus. The second involves the C-terminal moiety of the methyltransferase-guanylyltransferase (MT-GT) (
72) of the
Altovirus group, called the “Y region” (
45). We found that it is also present in the
Typovirus group and is thus conserved throughout the alphavirus-like supergroup (Table
4). This finding is consistent with experimental evidence that the MT-GTs of this viral supergroup have a common mechanism (
56). This MT-GT is unique to these viruses and thus constitutes an important drug target for a number of human pathogens such as hepatitis E virus or chikungunya virus. Its structure has not been solved at present, and thus our finding might facilitate further protein expression studies or modeling studies.
Prediction of the structural organizations of ancestral proteins and of novel proteins.
We then predicted the structural organization of each ancestral and novel protein using a combination of complementary methods (see Materials and Methods and Fig.
2) and plotted it in Fig.
4. All 17 ancestral protein regions are predicted to be ordered. Out of the 17 novel protein regions, 6 are predicted to be mostly ordered (
Carmovirus p25,
Tombusvirus p19,
Orthohepadnavirus S domain,
Capillovirus replicase,
Orthobunyavirus nonstructural proteins, and
Carmovirus p23), 1 is predicted to be about half ordered (the
Potexvirus TGBp3), and the 10 others are predicted to be mostly disordered. Thus, these results suggest a greater tendency for intrinsic disorder in novel protein regions, which is compatible with the first hypothesis described in the introduction.
Biased sequence composition of protein regions encoded by overlaps.
Earlier studies have suggested that overlapping protein regions have a biased sequence composition, being enriched in amino acids with the highest codon degeneracy (i.e., those encoded by six different codons) (
68). We performed an exploratory analysis based on our larger data set. Using Composition Profiler (
102), we first examined global biases in amino acid composition, represented by the “RE” (see below), and then examined biases in specific amino acids. We compared the sequence composition of all overlapping regions, or of novel or ancestral regions (Table
7 and Fig.
4), to that of reference sets, i.e., Swiss-Prot, PDB, and Disprot. Roughly, they correspond, respectively, to the mean composition of proteins in nature, to that of ordered proteins, and to that of disordered proteins (see Materials and Methods). To examine biases in global composition, we calculated the RE between each data set and Swiss-Prot, which is a rough measure of their difference in mean composition (
102) (see Materials and Methods). The higher the RE of two data sets, the more they differ in composition. For instance, the REs of PDB and of Disprot relative to Swiss-Prot are, respectively, 0.002 and 0.07 (Fig.
5), which indicates that Swiss-Prot has a composition much closer to that of PDB than to that of Disprot.
Figure
5 clearly shows that overlapping regions (bar 4) have an important composition bias relative to Swiss-Prot (RE lower than that of Disprot but much higher than that of PDB). Considering the subset of ancestral/novel regions (listed in Table
7), we see that ancestral regions have an RE only slightly lower than that of all overlapping regions (compare bars 5 and 4) but that novel regions (bar 6) have a spectacular composition bias, with an RE more than twice that of Disprot. As a control, the RE of the “background” composition is much lower than that of the overlapping data sets (compare bar 3 and bars 4 to 6).
We then computed the relative enrichment or depletion in specific amino acids of our data sets with respect either to Swiss-Prot or to nonoverlapping regions (used as a “background” composition of viral proteins). The biases uncovered when comparing the data sets to the background were similar to those observed compared to Swiss-Prot but of lower magnitude (not shown). Consequently, in order to draw conservative conclusions, we present the composition bias of each amino acid relative to this background, instead of Swiss-Prot, in Fig.
6. Amino acids are arranged according to their codon degeneracy as described previously (
68). We also examined whether the data sets were significantly (
P < 0.01) biased in disorder-promoting or in order-promoting amino acids (listed in reference
102) using the “Discovery” option of Composition Profiler (see Materials and Methods) (Fig.
6).
Taken together, overlapping regions have a significant deviation in most amino acids (16 out of 20) and are significantly biased toward disorder, i.e., enriched in disorder-promoting amino acids and depleted in order-promoting amino acids (Fig.
6, top panel). The subsets of ancestral and of novel regions show distinct trends. Ancestral regions have a composition bias for three amino acids only (middle panel) and have no significant bias toward order or disorder. In contrast, novel regions (bottom panel) are heavily biased regarding both the number of amino acids involved (
18) and the magnitude of the bias (on average more than twice that of overlapping regions taken globally [compare top and bottom panels]). Furthermore, they are biased toward disorder (bottom panel, right).
Finally, we examined Fig.
6 qualitatively, looking for a bias of overlapping regions with respect to codon degeneracy: for instance, enrichment in amino acids encoded by highly degenerate codons (as reported in reference
68) or depletion in amino acids encoded by low-degeneracy codons. This simple visual examination suggests that overlapping regions taken globally (top panel) are enriched in amino acids with a codon degeneracy of ≥4 and depleted in amino acids with a degeneracy of <4. However, the magnitude of this bias depends upon the data set chosen as background (Swiss-Prot or nonoverlapping regions [not shown]), and it should be taken with great care until validated by a rigorous statistical analysis of a larger data set. No clear bias with respect to codon degeneracy is visible for either the novel or ancestral regions (Fig.
6, middle and bottom panels).
In summary, the composition of overlapping protein regions is biased toward disorder-promoting amino acids. In particular, novel regions have a very large compositional bias. Overlapping regions seem to favor the use of amino acids with a high codon degeneracy (≥4), as seen using a merely qualitative approach, but this observation should be taken with caution until validated by further studies.
Specific functions of overlapping proteins.
In Table
7, we have compiled the known functions of overlapping proteins. In most cases, one function or several functions have been attributed to the full-length protein but the precise function of the novel region itself has not been determined. In cases where a function has been attributed specifically to the novel region, we included it with the associated bibliographical references. Table
7 and Fig.
4 show that all novel overprinting proteins with known function, except one (the
Orthohepadnavirus L), are “accessory” proteins (i.e., neither structural nor enzymatic), most often overprinting a structural or enzymatic protein.
Proteins generated by overprinting homologous DNA sequences are extremely diverse.
Several ancestral viral proteins of our data set, from different genera, are homologous to each other (i.e., they share statistically significant sequence similarity). They have been overprinted by proteins that show no distinguishable sequence or structural similarity to each other and thus might have been created independently in each genus. The identification of such proteins, which show a wide diversity both in function and in structure, offers an unprecedented insight into de novo protein creation by viruses. For instance, consider Fig.
4, panel 4, and the corresponding Table
7. Capilloviruses, tombusviruses, and umbraviruses encode a movement protein belonging to the “30K” superfamily, sharing a homologous central domain (
61). In these genera, the movement protein has been overprinted, respectively, by an ordered domain of unknown function that is part of a polyprotein, by a mostly ordered suppressor of RNA silencing (
105), and by a ribonucleoprotein (which also plays a role in long-distance movement) that is predicted to be disordered but might undergo a disorder-to-order transition upon binding to RNA (
92). The case of mandariviruses, trichoviruses, and capilloviruses (same panel), which all encode a homologous coat protein (
18,
44), is as striking. In the first two genera it has been overprinted, respectively, by the disordered N-terminal domain of an RNA-binding protein and by the disordered C-terminal domain of a 30K movement protein, while in capilloviruses it is not part of an overlap.
Finally, Fig.
4, panel 3, shows that regions homologous to the shell (S) domain of the superfamilies of capsids having the SCOP fold “nucleoplasmin-like/VP (viral coat and capsid proteins)” (
3) have been overprinted in several taxonomically distant viruses by very diverse protein regions: the
Avibirnavirus VP5, a disordered antiapoptosis protein (
36); a disordered tail of the
Betatetravirus replicase; a disordered tail of
Machlomovirus p31; and a region of the
Carmovirus p25 that contains a predicted transmembrane segment (the last three having an unknown function). These examples highlight the “creativity” of nature, which, although starting from a similar material (homologous DNA sequences), did not “invent” similar proteins twice.
Disorder and sequence constraints on overlapping reading frames.
Several studies have shown that overlapping genes often encode a protein heavily constrained in sequence and another one that is much less constrained (
28,
32,
37,
59,
63,
64,
67,
77,
98). In these cases, we would expect the protein with the less constrained sequence to have the greater disorder content, since disordered proteins are less sensitive to sequence changes.
Measuring sequence constraints of overlapping reading frames is usually done by comparing the rate of synonymous substitutions to that of nonsynonymous substitutions for each frame, using closely related genome sequences; the frame for which this ratio is higher is considered the most constrained (
38,
71). Performing such analyses on our entire data set was beyond the scope of this work, so, in order to provide some verification of the above hypothesis, we gathered from the literature all studies that provide information on the evolutionary rate differences between specific sets of viral overlapping genes (
28,
32,
37,
59,
63,
64,
67,
77,
98). For each, we performed disorder predictions on the corresponding protein products using PONDR VSL2.
Figure
7 plots the predicted disorder content of both regions encoded by each overlap. It clearly shows that in 8 cases out of 10, the less constrained frame encodes the protein region with the greatest disorder content. In another case, that of human papillomavirus, the less constrained protein (E2) is only marginally less disordered than the more constrained (E4), i.e., 89% versus 100%, respectively, which in fact corresponds to both proteins being almost entirely disordered. The last overlap (φX174) corresponds to regions of proteins D and E predicted to be both ordered. Thus, this preliminary exploration supports the idea that the less constrained reading frame generally encodes the most disordered region. However, this is not an absolute rule, and overlapping frames can encode two ordered protein regions simultaneously (such ordered/ordered overlaps can also be found in our data set [Fig.
4]).
DISCUSSION
Our carefully curated data set and conservative analysis allow us to make a strong case for our prediction that proteins encoded by gene overlaps tend to be disordered and to offer unprecedented insight in their evolution.
Unfortunately, it was difficult to find experimental evidence relating to our predictions of disorder, in part because many proteins considered here are accessory ones, which are poorly characterized (see below). Examples of disorder predictions that are experimentally confirmed include the
Orthohepadnavirus protein X (
73), the N-terminal “arm” of the capsid proteins of omegatetraviruses (
35) (Fig.
4) and sobemoviruses (
51), and the N-terminal moieties of the P proteins of morbilliviruses (
42) and vesiculoviruses (
17). We could not find any evidence in the literature that would contradict our predictions, even though some regions predicted to be disordered can actually become partially ordered, e.g., the basic, N-terminal “arms” of the capsid proteins of a number of icosahedral viruses (
51). However, this corresponds to the definition of disorder used in this work (see the introduction): proteins that do not have a unique, rigid 3D structure.
Regarding our prediction of ancestral protein regions (Fig.
4), there is good evidence for most that they are correct. For instance, the reverse transcriptases of orthohepadnaviruses belong to an ancient enzyme family (
83); likewise, the S domains of capsid proteins (
34), the 30K domains of movement proteins (
61), and the MTs of the alphavirus-like supergroup (
72) are each found in more than a dozen virus families. Furthermore, evolutionary studies of viruses from our data set that used complementary analyses, such as codon usage, are in agreement with our results: they predict that the
Tymovirus polyprotein (
68) and the
Birnavirus VP2 are ancestral (
93).
We hope to obtain further insights from other organisms. For instance, we noticed a few exciting examples of ancient proteins overprinted by proteins predicted or known to be disordered (in parentheses): the ankyrin domain of mammalian p16
INK4 (p19
ARF) (
15) and the bacterial ribosomal protein L34 (N-terminal extension of RNase P) (
22).
Earlier observations on the properties of proteins encoded by overlapping genes.
There have been earlier anecdotal observations of a connection between gene overlap and structural disorder. Jordan et al. suggested that the emergence of protein C in the P/C overlap of
Paramyxoviridae (Table
6) was favored by the disordered nature of P (
40). Likewise, Narechania et al. noticed that a disordered region of the
Papillomaviridae protein E2 might have favored the overprinting of protein E4, also predicted to be disordered (
64). However, these studies gave no reliable evidence that P and E2 were ancestral.
More recently, Meier et al. expressed ideas similar to those in this work, based on the analysis of a single overlap (
60). They suggested that the abundant disorder observed in the crystal structure of the
Coronavirus protein NSP9, most likely created by overprinting the nucleoprotein (N), may reflect its recent creation as well as constraints imposed by the N reading frame.
Prior to this article, there had been only one systematic study of overlapping genes at the protein level (
68). It reported that proteins encoded by overlaps were enriched in amino acids with the highest codon degeneracy (R, L, and S). We found enrichment in R and S but not in L and no clear-cut influence of codon degeneracy. The difference might be due to the much lower number of viral genera sampled in the previous work (
68).
Recent work on (uncurated) protein products of overlapping genes of RNA viruses has made interesting connections between their relative frames, their ages, and the modes of creation of the overlap (
8). Our data set of ancestral/novel protein regions is too small to reliably analyze their findings, but we plan to do so once a larger data set is created.
Why structural disorder in protein products of overlapping genes?
In the introduction, we proposed two nonexclusive hypotheses to explain the increased occurrence of disorder in proteins encoded by gene overlaps: either (i) the newborn protein in each pair tends to be disordered or (ii) the presence of disorder in either protein encoded by overlapping genes lessens evolutionary constraints. In fact, our results are compatible with both hypotheses.
Indeed, almost two-thirds of novel, overlapping protein regions are disordered (Fig.
4), compared with fewer than one-fourth of nonoverlapping protein regions (Fig.
3), which is compatible with the first hypothesis. However, these results should be validated by further studies, since we could determine novel/ancestral status for only 21 overlaps out of 52.
The analysis summarized in Fig.
7 is also compatible with the second hypothesis. A number of studies have shown that overlapping genes most often encode one heavily constrained protein and another one that is much less constrained (
28,
32,
37,
59,
63,
64,
67,
77,
98). Our analysis of a limited data set formed with the proteins studied in these works suggests that the less constrained proteins are generally the more disordered, which is consistent with the second hypothesis.
Thus, it is possible that both factors invoked in the two hypotheses actually contribute to the increased disorder content of overlapping gene products. A simple and attractive explanation would be that the novel proteins of each pair generally are the less constrained ones. Further studies will be needed to address this question.
Insights for viral bioinformatics.
This work establishes several methodological points.
It is possible, with a reasonable effort, to make a thorough bioinformatics structural analysis with a large number (∼100) of proteins involved in a given biological question. At present, this kind of analysis is quite rare (see, e.g., reference
31), although it obviously adds great value when compared to global statistics (e.g., compare Fig.
3 and
4). Furthermore, such analyses are feasible for bench virologists, thanks to the availability of user-friendly web-based tools such as the MPI toolkit (
11).
Our work also suggests that viral ORFs overlapping a known coding sequence and encoding hypothetical proteins with highly biased sequence composition, which are often considered noncoding (
99) and are discarded, might in fact encode a protein. Indeed, recent exciting discoveries of overlapping genes using a systematic approach (
26) suggest that overlapping genes in viruses might be even more common than previously thought.
Most studies aimed at determining the ancestral protein encoded by a gene overlap did not take into account domain organization, with a few exceptions (
28,
64,
67). However, the present work makes it clear that overlapping gene products are often composed of several domains that might have different evolutionary histories. For instance, the overlapping parts of the
Capillovirus replicase and movement protein are each composed of several domains, as is the overlapping part of the
Tymovirus replicase (Fig.
4). Thus, analyses of overlapping gene evolution should be carried out by studying domains separately.
The study of de novo proteins should enhance our knowledge of protein space.
At present, it is thought that proteins adopt fewer than 10,000 structural folds in nature, much less than expected from our understanding of biophysics (
115). This discrepancy has brought about two main hypotheses: (i) some structural folds are favored by nature for unknown biophysical or functional reasons, and (ii) most proteins are descended from a limited set of ancestors by duplication (for a review, see reference
116).
All solved structures of overprinting proteins presented here and elsewhere correspond to previously unobserved folds (
53,
60). This constitutes a challenge to the first hypothesis above and even suggests that we might underestimate the number of folds created in nature, because of our limited knowledge of the 3D structures of proteins created de novo. Solving them (as advocated by Keese and Gibbs, remarkably, more than 15 years ago [
43]) might thus help to improve methods to predict the 3D structures of proteins from their sequences, a central problem of bioinformatics which crucially depends on knowing the diversity of protein folds (
33).
De novo protein creation: a significant factor in evolution?
We noted in Results that the great majority of novel proteins are “accessory” (i.e., neither structural nor enzymatic), most often overprinting a structural or enzymatic protein, confirming an earlier observation (
8). “Accessory” does not mean that they are dispensable in vivo; on the contrary, most novel regions play an important role in viral pathogenicity or spread (Table
7), as noticed by Li and Ding (
53). Thus, de novo protein creation appears to be a significant factor in viral evolution, in particular in the evolution of pathogenicity, which is poorly understood at present.
Is it limited to overprinting by viruses? At the time that this article was submitted, two systematic studies of de novo protein creation in eukaryotes (from noncoding sequences and thus not generating overlapping genes) were published. They indicate that de novo protein creation occurs at a significant and unexpected rate, having generated between 5% and 20% of orphan proteins of primates (
95) and about 12% of orphan proteins of the genus
Drosophila (
118). Reciprocally, almost all de novo-created viral proteins that we identified are orphans at the genus level, i.e., are restricted to one genus at most (see Table
7). Thus, these works and ours provide numerous examples of orphan proteins created de novo, as opposed to having diverged beyond recognition from other relatives (see the introduction).
Acknowledgments
We thank S. Longhi, B. Canard, and B. Henrissat for support; V. Uversky for useful advice; R. Belshaw, N. Chirico, and V. Brechot for useful comments on the manuscript; and F. Ferron, J. Grimes, R. Esnouf, and D. Glaser for support in the latest stages. D.K. thanks A. Gibbs and P. Keese for their inspirational work. We also thank all the authors of the excellent freely available programs and databases mentioned in this work.
C.R. gathered and classified all complete, unspliced RNA viral genomes and extracted the overlapping genes. M.K. performed the order-disorder prediction and initial analysis of the genomic data set. A.K.D. coordinated the disorder prediction study. P.R.R. supervised the disorder prediction study, performed statistical analysis on the genomic data set, gathered the data, analyzed the relationship between evolutionary constraints and intrinsic disorder, and cowrote the manuscript. D.K. conceived and coordinated the study, curated the overlapping gene data set, performed the remaining bioinformatics analyses, and cowrote the manuscript.