The Wayback Machine - https://web.archive.org/web/20051113152622/http://phene.cpmc.columbia.edu/origin2.htm

Banking Diverse Data

ICTVdB:
The Universal Virus Database

Viruses and viroids are the smallest infectious biological entities that depend on their host for replication.The number of viruses as pathogens or silent passengers of organisms, from the bacteria to the dominant mammal is very large.As we explore new niches for life, and as the sensitivity and specificity of detection techniques improves, the list of viruses expands.Since viruses have evolved many times and infect organisms from all kingdoms of life, researchers would like to know more of their origin and evolution.

Today, the International Committee on Taxonomy of Viruses (ICTV) recognises about 1,500 virus species (1), but some 40,000 virus strains and isolates are being tracked by virologists in different fields of biology.The ICTV is the "international court" of experts that rules on names and relationships of all viruses.Compared with bodies dealing with more complex organisms, ICTV took early initiatives in electronic data management (2) and has supported the development of ICTVdB since 1991.

Now widely recognised as a prototype biological database, ICTVdB was designed for taxonomic research; for understanding relationships among viruses.  Conventional taxonomy of other organisms is largely based on morphology.  Although most viruses have now been seen in the electron microscope, these infectious particles tend to be better known by their chemical and genomic make up, the complex disease symptoms in their hosts, and their vectors and geographical distribution.  The headline image is a human rotavirus, a 75‑80 nm particle, the major cause of diarrhea in young children, outbreaks of which coincide with abrupt temperature changes in winter.  Clearly, from the database perspective, a huge range of diverse biological information has to be managed.

Developed at the Australian National University (ANU) with support of US National Science Foundation (NSF) and sponsored by the American Type Culture Collection (ATCC), ICTVdB has grown in concept and capability to become a major reference resource and research tool (Box 1)

Demands on DELTA

Text Box: Box 2: 
The modular character list of ICTVdB
classification		#1-33
original record	#34-58
virion properties	#59-990
genome organization	#991-1445
antigenicity		#1446-1505
biological properties	#1506-2203
taxonomic structure	#2204-2215
comments, references, 
and contributors	#2216-2225

The database uses the DELTA system (3), (DEscription Language for Taxonomy) developed at CSIRO Entomology, by Michael Dallwitz (4), now adopted as a world standard for data exchange in taxonomy.  A distinctive feature of DELTA is its capacity to store an extra-ordinary diversity of data, and to translate these data into natural language for traditional reports and web publication.  All the flexibility of subprograms in DELTA is exploited by ICTVdB.

On the input side, the capacity of DELTA to handle very large datasets one item at a time is ideally suited to a long list of virus properties (character list), often accompanied by extensive text comments and images.Although only partly populated, ICTVdB already lists more than 2000 virus descriptions (items) constructed from 2250 characters, some with up to 2000 states (Box 2).By the time all available data on virus isolates and strains are entered, the number of items will be closer to a million.

Virus taxonomy is very much in flux because our understanding of relationships between viruses is increasingly dependent on genomic data that continually challenges earlier decisions based on morphology.Strategies to facilitate communication across semantic boundaries are particularly important in ICTVdB that deals with data from diverse sources such as bacteriology, agriculture, veterinary and medical sciences, each of which has evolved a distinctive vocabulary.Although terms have been standardised within ICTVdB, these standards can�t be imposed on virologists in all disciplines, and they can�t be imposed retrospectively on the literature.

Another input side requirement of ICTVdB is user-friendly, online data entry for peer review of new information, ranging from molecular properties of a virus to its geographic distribution and host range.Such diverse information, with intrinsic dependencies between genomic data, protein composition, particle structure and infectivity places particular demands on the flat file system of DELTA.These have been met by building a dependency network in data specification files.The spreadsheet display of the DELTA editor is particularly useful for reviewing these dependencies, a critical step in developing and working with the ICTVdB dataset.

Although DELTA was designed for taxonomic research, its output formats transcend these specialist interests.Its translation facilities can be used by taxonomists to construct nearest neighbour relationships, but can also be used to blend data from diverse sources.For example, ICTVdB does not itself contain sequence data, but conversion of ICTVdB data from DELTA into NEXUS format was deemed essential for comprehensive phylogenetic analyses.Such work is also indispensable for monitoring the evolution of viruses in relation to emerging diseases.

Although most new information in virology is generated at the molecular level and is deposited in sequence databases, significant events in virology tend to be associated with "host jumping", epidemics and environmental disturbances, all of which information can be retrieved from ICTVdB.From the outset, DELTA was designed not only to generate identification keys but also to translate its data into natural language hard copy, for translation onto the web in HTML format, and for translation into many of the languages of the world.These output attributes will be fully exploited by ICTVdB.

 

Structural Features of ICTVdB

Although ICTVdB began as a taxonomic database 5, it now has several distinctive features not usually used in systematics, but introduced of necessity.Chief among these is its decimal code 6. Originally introduced because the peculiar nomenclature used in virology defies direct and systematic interrogation in a database, and because virus taxonomy was changing rapidly, a decimal code (analogous to the code of enzyme nomenclature) seemed to offer a simple resolution to diverse problems.

Table 1.Expansion of the decimal code to accommodate revisions of Poliovirus taxonomy, and toanticipate the explosion of lower level data (serotypes, strains and isolates).

 

Level

Original Decimal Code

Extended Decimal Code

Order

 

00. = (not assigned)

Family

52. = Picornaviridae

00.052. = Picornaviridae

Subfamily

52.0. = (no subfamilies)

00.052.0. = (not assigned)

Genus

52.0.1. = Enterovirus

00.052.0.01. = Enterovirus

Subgenus (serogroup)

52.0.1.0. = (no subgenus)

Superseded by new species concept

Species
(type species)

52.0.1.0.001 = Poliovirus 1

00.052.0.01.001. = Poliovirus

Species

52.0.1.0.067 = Poliovirus 1

00.052.0.01.007. = Poliovirus

 

52.0.1.0.068 = Poliovirus 2

 

 

52.0.1.0.069 = Poliovirus 3

 

Subspecies

 

00.052.0.01.007.00. = (not assigned)

Serotype

 

00.052.0.01.007.00.001. = Poliovirus 1

 

 

00.052.0.01.007.00.002. = Poliovirus 2

 

 

00.052.0.01.007.00.003. = Poliovirus 3

Strain or Isolate

 

00.052.0.01.007.00.001.001. = PV-1 Brunhilde

 

 

00.052.0.01.007.00.002.002. = PV-2 Mahony
00.052.0.01.007.00.002.001. = PV-2 Lansing

 

 

00.052.0.01.007.00.003.001. = PV-3 Leon

 

Decimal Code

Because virus names are changed frequently, contain diverse linguistic and geographical elements, and are usually coupled to a disease or its symptoms, virus nomenclature presents challenging semantic problems for a database.The decimal code at one and the same time affords unequivocal identification of a virus to the level of strain or isolate, and indicates its taxonomic context.The core infrastructure of ICTVdB is its distinctive "table of contents", the Index of Viruses (formerly Index Virum), a list of approved virus names sanctioned by ICTV.The decimal code is constructed in Index of Viruses, and serves as a filename for database outputs as well as an accession number for external linkage to ICTVdB.The original DOS-based DELTA system used by ICTVdB could only accommodate 8 digit filenames.The increasing focus on lower level taxonomic information and taxonomic revisions require the decimal code to be expanded to 19 digits.The application of the expanded code to the recently revised taxonomy of Poliovirus is illustrated in Table 1.

 

With the introduction of long filenames in Windows 95/NT, the expanded decimal code can be used by PCs, and is no longer confined to UNIX systems.The expansion to 19 digits should cope with even the most ambitious �splitters� in the taxonomic community.It will be necessary to differentiate provisionally assigned taxa in the dynamic database, but this can be accommodated without further assignments in the decimal code.Although individual virologists are finding the decimal code useful, this invention of necessity is by no means universally accepted among virus taxonomists.

If a database is to accept the latest data from all branches of virology, and place these diverse data into contemporary taxonomic context, it will most commonly deal with information at the level of strains and isolates.Ideally, ICTVdB will serve virus taxonomy "from the bottom up" with primary data from researchers who describe their viruses using rich and diverse semantics, reflecting geographic and linguistic factors.At the same time, the database must accept revisions and consolidations "from the top down" as the consensus in virus taxonomy reflects this new information.For example, the relegation of such widely used species names as Poliovirus 1, 2 and 3 to serotypes (Table 1), although an emotive issue 8, has been justified by pair-wise comparison of genomic data.

As the database developed, it became clear that the decimal code served as more than an unequivocal identifier for taxonomically correct internal linkages in the database.It is used as a filename for transposing ICTVdB to the web, and also serves as a surrogate accession number used by sequence databases such as EMBL and SWISS-PROT to link to ICTVdB.The decimal code unequivocally identifies a virus, and simultaneously indicates its taxonomic status from order to isolate, and should be routinely cited in publications.

 

Dependencies

Unlike many other databases that deal with relatively uniform data types and small number of fields, ICTVdB is not a relational database, but is a flat file system.All key components of ICTVdB in DELTA format (character list, specification and items file) are readable text files, as are the directive files used for data translation and conversion.The character list of ICTVdB is distinctive in that it must accommodate data of all sorts, from the geometry of virus particles through the chemical composition of components to the host range and geographic distribution.It also supports these data with explanatory commentary and images.Each character is specified in terms of ordered or unordered multistate properties, integer or real numeric properties, text and images, the later being handled as a special category of text.Table 2 unfolds the specification of general genomic characteristics (excluding sequences) of a virus.

 

Table 2.Components of a DELTA database, <> denotes commentary in the character list and in the items file.The natural language translation of this example will read:

Genome is (usually) monopartite; contains RNA; is 9128-9738 nucleotides long (depending on isolate) with a weight ranging between (9.0-)9.2-9.5 or 9.8 (for strain Y).Genome organisation: 5'-gag-pro-pol-env-3'. Genome map (7) (image not displayed).

 

Specification File

Character List

Items File

type

feature

attribute

code

1,OM
ordered multistate

#1. genome is <whether segmented>/

1. monopartite/
2. bipartite/
3. tripartite/

2<usually>,1

2,UM
unordered multistate

#2. genome contains <nucleic acid type>/

1. DNA/
2. RNA/

1,2

3,IN
integer numeric

#3. genome <length> is/

<number of> nucleotides long/

3,9128-9738<depending on isolate>

4,RN
real numeric

#4. genome with a weight/

kDa/

4<ranging between>,(9.0-) 9.2-9.5/9.8<for strain Y>

5,TE
text

#5. genome organisation: <order of genes or ORFs>/

 

5<5'-gag-pro-pol-env-3'>

6,TE
image

#6. Genome map <image path to diagram>/

 

6<gm_lenti.gif>

 

At critical points in the character list binary statements, such as virus particle with or without envelope, are used to establish dependencies so that only the subsequently valid characters can be used.These dependencies provide the internal linkages hierarchy in the data, and direct the search path during interrogation and, among other things, reveal errors during data entry.The dependencies are very important for the decision making process during identification and data comparison, and some multistate characters in key positions (e.g. plant or animal virus) can control the validity of up to 2000 characters down the line.The dependencies are automatically indicated in the spreadsheet display of DELTA (Table 3).

Table 3.Spreadsheet view of the DELTA editor.The red bars indicate that this cell is made inapplicable through dependencies.


At other points in the character list "pseudo-characters" are used to overcome semantic difficulties arising in different sub-fields of virology, and to establish dependencies among blocks of characters.Table 4 shows pseudo-character 61 that handles the semantic equivalence of tegument = inner lipid protein membrane, and capsid = head of a tailed phage.It also shows the dependencies established by the states, so that state 4 for example only opens the character section 512-617, and handles the semantic equivalence of head and capsid.The dependencies build the internal hierarchy of the database.

Table 4.Semantic equivalencies and dependencies among the major morphological properties of virus particles.Few viruses contain more than one component.

#61. <Virion or phage> consists of <components of particle>/

Dependend Character Blocks

1. an envelope <including inner and outer envelope>/

Envelope.

#406-458

2. a surface membrane

Surface Membrane

#459-511

3. a tegument/

Tegument

#791-814

4. a head <of phage treated as isometric capsid>/

Capsid (Coat Protein).

#512-564

5. a capsid <including inner and outer capsid>/

Inner Capsid.

#565-617

6. a tail <of phage treated as elongated capsid>/

Tail

#512-564

7. a nucleocapsid/

Nucleocapsid.

#618-667

8. a core/

Core.

#717-765

9. a nucleoid/

Nucleoid.

#668-716

10. lateral bodies/

Lateral Bodies.

#815-838

11. a matrix/

Matrix.

#766-790


Images

It is said that a picture is worth a thousand words.Images of virus particles are used in several ways in ICTVdB.For example, text descriptions of key morphological characters become much more precise when they are linked in the character list to representative vignettes from electron microscope photographs.Thin section EM images of infected tissues are used to illustrate virus infection cycles and host pathology.Descriptions of all viruses generated from ICTVdB will be enhanced by EM images of the type species, irrespective of the presentation format selected.Not surprisingly, images of virus particles are among the most frequently accessed files in ICTVdB on the web.A large image file is more instructive to users, but in the database it is functionally equivalent to numerous characters, like "virus 75-80 nm in diameter" in the case of Rotavirus.File size considerations and access paths dictate that image files be stored outside the main dataset, in either local files or files accessed on the Internet.

 

Quo Vadis

The PC based ICTVdB is presented as a natural language translation on the web using HTML conversion for the DELTA formatted data.This web environment is essential for universal access, interactive data entry and interrogation, as well as interoperability with other databases.Currently, a plethora of accessories is available, many of which are standard components of DELTA (e.g. Web Intkey, an interactive identification program).Others, like the data entry forms, Java applets and scripts used to display directory trees, have been developed specifically developed for use in ICTVdB.In future, interoperability will be vastly improved by XML tagging.Just as it is certain that the flow of new information about viruses will not slow, so it is certain that new technology available to ICTVdB to handle these data will grow.

Thus far, ICTVdB has been a single investigator project, with a lot of goodwill and support for software development.The principal impediments to its usefulness and sustainability are common to most biological databases.First, filling out of the database requires commitment from the virological community to data entry and update.It is pleasing to have some researchers deposit new virus data in ICTVdB at the same time as they deposit sequence data elsewhere.Hopefully, this will become routine, but it remains difficult to extract existing data and to engage the expertise of busy senior scientists.Second, support beyond the development phase, now largely completed, requires a shift from public funding to a commercial context.A subscription database seems a plausible path forward, one that could see a consortium of database professionals working to maintain ICTVdB.Whatever, it is hoped that a public domain shop window can be retained, responsive to significant developments in virology, and accessible to all parties.

 

References

1. M.H.V. van Regenmortel et al., (eds). Virus Taxonomy.Classification and Nomenclature of Viruses, Seventh Report of the International Committee on Taxonomy. Academic Press,New York, San Diego, (1999).

2. J.G. Atherton, I.R. Holmes and E.H. Jobbins, ICTV Code for the Description of Virus Characters. Monographs in Virology 14, (1983).

3. M.J. Dallwitz, T.A. Paine and E.J. Zurcher, User's Guide to the DELTA System: a general system for processing taxonomic descriptions, CSIRO Division of Entomology, Canberra, (1993).

4. M.J. Dallwitz, "A general system for coding taxonomic descriptions" Taxon. 29, 41-46, (1980).

5. C. Büchen-Osmond, L. Blaine and M.C. Horzinek, "The universal virus database of ICTV (ICTVdB)". In: Virus Taxonomy.Classification and Nomenclature of Viruses, Seventh Report of the International Committee on Taxonomy. M.H.V. van Regenmortel et al., (eds). Academic Press,New York, San Diego, (1999).

6. C. Büchen-Osmond and M.J. Dallwitz, "Towards a universal virus database - progress in the ICTVdB", Arch. Virol., 141, 392-399, (1996).

7. C Büchen-Osmond, "Further progress in ICTVdB, a universal virus database", Arch. Virol., 142, 1734-1739, (1997).

8. C.R. Pringle, "Virus taxonomy at the XIth International Congress of Virology, Sydney, Australia 1999", Arch. Virol., 144, 2065-2070, (1999).

 

 

 

 

Cornelia Büchen-Osmond is a virologist who trained in electron microscopical identification of viruses at the Hygiene Institut, Klinikum JW Goethe-Universit�t, Frankfurt a. M., Germany. She was invited to develop the universal virus database in 1992, and commenced this work in the Bioinformatics Group at Research School of Biological Sciences, Australian National University, Canberra, ACT, Australia. The project continues at Columbia University, Biosphere 2 Center, Oracle, AZ, USA.The project has been supported by NSF grants through a consultancy with the American Type Culture Collection, Manassas, VA, USA.

 

 

6 September, 2000. Last updated 18 April 2001