Relationship between viral genome and virion sizes.
We calculated the virion sizes (volumes) of 88 viruses, chosen to be as representative as possible of known viral biodiversity (i.e., covering 50 viral families and unassigned taxa) and for which accurate data to calculate virion volumes were also available (
1). These viruses were dsDNA (
n = 33 viruses), ssDNA (
n = 6), reverse-transcribing (RT) (
n = 3), dsRNA (
n = 8), negative-sense ssRNA (−ssRNA) (
n = 4), and positive-sense ssRNA (+ssRNA) viruses (
n = 34). These data are summarized in
Table 1 and presented fully in Table S1 in the supplemental material. We calculated virion volumes using a number of common structural parameters—namely virion diameter, distance from center to pole, length, height, and depth (
1,
16)—or used the volume reported in the original publication.
The virion volume of the viruses studied varied by 4 orders of magnitude (
Table 1), with the smallest (2.6 × 10
3 nm
3) recorded in
Circovirus (ssDNA virus) and the largest (7.53 × 10
7 nm
3) observed in
Pandoravirus (dsDNA virus). The genome lengths of the viruses varied by approximately 3 orders of magnitude, with the smallest (1.68 kb) recorded in
Deltavirus (−ssRNA virus) and the largest (2,473.87 kb) in
Pandoravirus (dsDNA virus). Across the data set as a whole, we observed a significant positive correlation between genome length and virion volume (
P < 0.001). Plotting this on a log-log scale showed a strong positive linear relationship, in which 76% of the variance in the logarithm of virion volume can be accounted for by the logarithm of genome length (
P < 0.001,
R2 = 0.76, slope = 1.43) (
Fig. 1). It is striking that all but two viruses—the filoviruses
Ebolavirus and
Marburgvirus—fall within the 95% prediction interval, which depicts where 95% of virion sizes are expected to lie within for a given genome size (outer gray lines on
Fig. 1). Therefore, virion volume has an allometric relationship with genome length, with a mean exponent of 1.43 and with relatively tight confidence intervals (CI) (1.26 to 1.6) (
Table 2). That this exponent is significantly greater than 1 (
P < 0.001) indicates that an allometric relationship between volume and genome length is a better descriptor than a simple linear relationship. Importantly, the exponent is also significantly lower than 3 (
P < 0.001), which is the value of the standard “geometric” relationship between length and volume (i.e., as the units for volume are the units of length to the third power). This indicates that the relationship is not just a product of physical space availability (
17) (
Table 2).
To determine whether the association between volume and genome length holds among viruses of profoundly different types and whether this association is also described by an allometric relationship, we subdivided our data into viruses with spherical (i.e., spherical and icosahedral [
n = 65]) and nonspherical (brick, filamentous, ovoid, and rod [
n = 23]) virions. Spherical viruses have a median virion volume that is significantly less than those of nonspherical viruses (median volumes, 6.5 × 10
4 nm
3 and 8.8 × 10
5 nm
3 for spherical and nonspherical virions, respectively;
P < 0.001). In both groups there was a strong positive correlation between virion volume and genome length (
P < 0.001), and the relationship was defined well by a power law. Specifically, the allometric regression results were as follows: spherical,
R2 = 0.71,
P < 0.001, exponent = 1.17; and nonspherical,
R2 = 0.87,
P < 0.001, exponent = 1.44 (
Fig. 2;
Table 2).
Next, we subdivided our data into enveloped (
n = 28) and nonenveloped (
n = 60) viral groups. Although viruses with envelopes possess larger genomes (median of 148.21 kb for DNA viruses and 13.32 kb for RNA viruses) compared to nonenveloped viruses (36.72 kb for DNA viruses and 7.00 kb for RNA viruses) (
P < 0.001,
P = 0.004, and
P < 0.001 for all viruses, DNA viruses, and RNA viruses, respectively), both groups exhibited a significant linear relationship between log virion volume and log genome length, indicating a power law relationship between the two: enveloped,
R2 = 0.85,
P < 0.001, exponent = 1.37 (
Fig. 3a); nonenveloped,
R2 = 0.72,
P < 0.001, exponent 1.06 (
Fig. 3b). Similarly, allometric relationships were observed after subdividing the data (i) into viruses with linear (
n = 77,
R2 = 0.72,
P < 0.001, exponent = 1.06) and circular (
n = 11,
R2 = 0.82,
P < 0.001, exponent = 1.74) genomes (
Fig. 4), (ii) into dsDNA (
n = 33,
R2 = 0.71,
P < 0.001, exponent = 1.52) and dsRNA (
n = 8,
R2 = 0.45,
P = 0.07, exponent = 0.97) viral groups (
Fig. 5), and (iii) into +ssRNA (
n = 34,
R2 = 0.56,
P < 0.001, exponent = 1.95) and −ssRNA (
n = 4,
R2 = 0.97,
P = 0.01, exponent = 2.58) viral groups (
Fig. 6;
Table 2). Note, however, that because of the small sample sizes for the dsRNA and −ssRNA viruses, the confidence intervals for the exponent estimate are large in both cases.
Finally, although overlapping genes are commonly utilized in RNA viruses and small DNA viruses (
18), our results are minimally affected when accounting for overlap by estimating an adjusted genome length (
R2 = 0.52,
P < 0.001, exponent = 1.61).
Hence, overall these data clearly show that for a diverse set of viruses, virion volume and genome length follow a strong power law,
V =
aLb, in which
V is the volume of the virion,
L is the length of the genome in base pairs,
a is the scaling factor, and
b is the allometric exponent (
Table 2).
Relationship between protein numbers, gene lengths, and virion volumes.
One explanation for the relationship between virion volume and genome length is that viruses with longer genomes produce more proteins, which in turn must be housed in larger virions. We therefore sought to determine if the number of distinct proteins encoded by each virus (see Table S1 in the supplemental material) was associated with virion volume and genome length. As we expected, larger viral genomes harbored significantly greater numbers of proteins, and this relationship was again allometric (
Fig. 7a):
R2 = 0.82,
P < 0.001, exponent = 1.11. Additionally, there was a strong correlation between virion volume and number of proteins (
Fig. 7b):
P < 0.001,
R2 = 0.61, exponent = 1.05. To investigate this further, we performed a multiple linear regression on the logarithm of virion volume, genome length, and number of proteins. This revealed that genome length was still associated with both virion volume and number of proteins after adjustment of one another (
P < 0.001) but that virion volume is only associated with genome length (
P < 0.001) and not with the number of proteins (
P = 0.71) after adjustment for genome length. As a consequence, the relationship between genome length and virion volume is not a product of the number of proteins encoded.
In marked contrast to the genome-scale associations with virion size, no such correlations were observed at the level of two key individual viral genes (on either the untransformed or log-log-transformed data). In the case of nonenveloped RNA viruses, we found no relationship between the length of the capsid gene, which encodes the structural component of the virus capsid, and the virion volumes: R2 = 0.059, P = 0.18 (n = 32). A similar result was observed in the case of the RNA-dependent RNA polymerase gene, which encodes the enzyme responsible for replication of RNA from an RNA template (and hence is common to all RNA viruses): R2 = 0.009, P = 0.60 (n = 36). Hence, these results demonstrate that the expansion of virion sizes during evolution is not due to the elongation of these genes but rather is directly linked to the expansion of total genome length.