Introduction
High-throughput techniques have led to a rapid increase in the availability of various types of biological data. These data do not speak for themselves, but may serve as inputs for methods that generate clues or inferences. One of the main methods used for generating such inferences is the
comparative approach. For instance, when a new genome sequence is determined, an enormous amount of useful information is revealed by comparing it with other genomes and interpreting
patterns of similarity and difference, with applications in identifying regulatory sites,
1 predicting protein structures,
2 and interpreting SNP variation.
3
Because similarities and differences among biological entities emerge by a process of descent with modification, evolutionary theory provides a mechanistic framework for interpreting biological comparisons. An evolutionary approach to comparisons emerged over several decades from the efforts of taxonomists to replace personal judgment with rigorous principles.
4,5 This approach can be distilled to three principles. First, an evolutionary analysis begins by identifying relationships, not just of similarity, but of similarity due to descent-with-modification from common ancestors, i.e. evolutionary
homology. Second, as astute bioinformaticians often emphasize with frustration, common statistical methods that would treat evolved entities as independent samples are inappropriate: evolved entities are
not independent, but have a tree-like structure of relationships, making phylogenetic trees essential for any rigorous analysis.
6,7 Evolutionary methods control for relatedness (non-independence) because they “exploit phylogenies to reveal independent events of evolution”.
8 Third, the events of change (along the phylogeny) that are invoked to account for observed biological differences are not ordinary transformations of biological substances (i.e. not like the development of an embryo, or the formation of a scar), but
evolutionary transitions that follow the rules of evolutionary genetics, with any accompanying biases due to the dynamics of mutation, genetic transmission, and reproductive sorting (selection and drift).
The evolutionary approach is not the only possible approach to comparisons. A common alternative to analyze comparative data is to apply generic methods of classification or machine-learning, such as neural networks and support vector machines,
9 that rely on a simple principle of similarity (e.g. protein X is a dehydrogenase because its sequence looks like that of other dehydrogenases) or on “guilt by association” (e.g. protein X is involved in mercury resistance because the gene encoding X is linked chromosomally to gene Y involved in mercury resistance). Relative to such heuristic approaches, the promise of evolutionary methods is that, because they incorporate a model of the actual generative process underlying the data, they will be more accurate and flexible, and particularly useful for cases in which the outcome of evolution departs significantly from the expectations of a purely functional approach, e.g. whenever mutation biases are important.
10 The full application of a comparative method based on evolutionary theory makes it possible to refine relatively vague and difficult questions about how to interpret similarities and differences into more well-posed questions about rates and processes of change along the branches of a phylogenetic tree, e.g. providing a basis to assign probabilities to unknown states, such as the activity or co-factor-specificity of an enzyme.
11
In spite of their clear advantages, evolutionary analyses remain under-utilized. This may reflect a need to educate researchers on the generality of evolutionary methods. However, it also suggests a need to reduce technical barriers. The traditional computational approach to evolutionary analysis is for an expert user to manually shepherd a single set of data through a series of steps relying on domain-specific software, often with idiosyncratic interfaces, and requiring a variety of user interventions to extract intermediate results, trap errors, and customize operations. This expert-supervised approach is time-consuming, error-prone, difficult to document (and, thus, to validate or to reproduce), and therefore represents a barrier to large-scale, integrative, or multidisciplinary analyses.
The existence of substantial technical barriers is apparent from the development of methods for assigning “functions” to proteins encoded by newly determined genome sequences. Soon after this problem emerged as a major computational challenge,
12 Eisen presented compelling arguments (by reasoning from case studies) that accurate assignments would require a phylogenetic framework, not merely identification of a “best hit” via BLAST searches.
13 Nevertheless, genome annotation projects continued to develop and apply approaches based on similarity and guilt-by-association. Years went by before approximations of Eisen's rule-based “phylogenomics” framework were automated;
14,15 and only recently Englehardt, et al developed an explicit and generalized probabilistic model
11 to replace rule-based reasoning. Meanwhile, new problems amenable to the evolutionary approach continue to emerge, e.g. the inference of interactions between sites within a protein,
16 or between different proteins;
17 or the inference of changes in gene expression.
17,18
An integrated solution to lower the barrier for applying an evolutionary approach might make use of a combination of technologies, including applications software, web services,
19 workflow systems,
20 data standards, and ontologies.
21,22 Powerful applications software already exists already for many steps in evolutionary analysis. Access to these tools can be greatly enhanced through the use of web services and other software services, as in the myGrid
23 and BioMoby
24 projects. However, to assemble these services into fully automatic workflows requires a way to standardize knowledge, thus facilitating data re-use and data interoperability.
In recent years, the utility of ontologies for standardizing knowledge has been widely demonstrated,
25–28 but the role of ontologies remains widely misunderstood. A common misconception (which emerged in the review of this paper) is that an ontology is a special kind of file format, or that a well defined data format obviates the need for an ontology. Actually ontologies and file formats address different problems. Data formats, which are designed to provide a concrete representation of data for purposes of storage or exchange, represent a form of
syntax for “writing down” data. In contrast, an ontology focuses on
semantics, that is, the meaning of the data; an ontology expressed in a given language is not necessarily tied to a specific file format (e.g. OWL statements are commonly represented in RDF/XML, but there is also an OWL functional syntax). An ontology contains not only the vocabulary (terms and labels), but also the definition of the concepts and their relationships for a given domain. To illustrate this important distinction, let us imagine a simple FASTA file:
>AMYLASEE
TGCATNGY
A problem with this representation of data is that a computer does not have access to the semantics. By convention, a FASTA file has an identifier line (sometimes called the “definition line”) starting with “>” and ending with a newline, followed by a sequence. Thus, a human expert would understand that the string “TGCATNGY” must be some kind of sequence, but could not tell if it is a DNA sequence (Thymine, Guanosine, …) or a protein sequence (Threonine, Glycine …), since the symbols could come from either the commonly used alphabet for DNA residues, or that for amino acid residues. Likewise, a human expert would understand that the string “AMYLASEE” is an identifier, but not what it means in relation to the sequence: it might be “Amylase E”, representing the name of a gene or protein, or it might refer to “Amy Lasee”, the name of a donor or an experimenter–-or it might mean something else.
An XML version of the above FASTA example might look like this, noting that a FASTA archive may have multiple sequence records:
<xml>
<fasta_archive>
<fasta_record>
<identifier>AMYLASEE</identifier>
<sequence>TGCATNGY</sequence>
</fasta_record>
</fasta_archive>
</xml>
Rendering the data in XML format, with a schema to validate against, makes the value of the strings “AMYLASEE” and “TGCATNGY” much more interpretable, because they can be accessed and validated by readily available tools on any computer platform. However, this does not solve any of the problems of semantics noted above. We might imagine that adding extra tags would solve the problem:
<fasta_record>
<identifier>
<p r o t e i n _ n a m e >A M Y L A S E E
</protein_name>
</identifier>
<sequence>
<p r o t e i n _ s e q u e n c e >T G C A T N G Y
</protein_sequence>
</sequence>
</fasta_record>
But this does not formalize the semantics or make them accessible to a computer–-unlike a human expert, the computer cannot supply the meanings hidden in the tag names, and only sees arbitrary strings like this:
<string1>
<string2>
<string3>AMYLASEE</string3>
</string2>
<string4>
<string5>TGCATNGY</string5>
</string4>
</string1>
How can we make it clear that “TGCATNGY” represents the sequence of amino acid residues in a protein? How can we explain the relationship between the name and the sequence? Ontologies are designed specifically to solve this kind of problem by encoding or formalizing knowledge in a computable form that can be referenced when data are described. If “TGCATNGY” is a protein sequence, we might express this by referring to SO:0000104, the “polypeptide region” concept in the Sequence Ontology;
29 or we might refer to the fourth residue not with the character “A”, but with a reference to CHEBI:32433, which is the ChEBI (Chemical Entities of Biological Interest;)
30 term for the L-Alanyl moiety in a polypeptide chain.
In order to provide a formalization of knowledge that could serve as a basis for improving interoperability in comparative analysis, we initiated the design and development of a suitable ontology. From the analysis of use cases (i.e. specific tasks representing the widely used methods in evolutionary analysis) and related artefacts (e.g. file formats, database schemas, software interfaces, and so on), the inference of
character histories emerged as the core problem in evolutionary comparative analysis, relying on the concepts
of phylogenetic tree, Operational Taxonomic Unit, character-state data, and
transition (i.e. an evolutionary change in the state of a character). These important concepts were formalized using the standard Web Ontology Language (OWL)
31 to build a prototype version of a
Comparative Data Analysis Ontology (CDAO). An initial evaluation of the prototype has also been performed, encoding token data sets as CDAO instances and implementing simple query and reasoning tasks. The development of CDAO will continue in the context of supporting specific research objectives and we anticipate that, in the near future, CDAO will help to improve data interoperability in evolutionary methods and to lower the technology barrier for applying an evolutionary approach to comparative analyses.
Discussion
The principles of evolutionary analysis follow from the assumption that descent-with-modification is the generating process for comparative biological data. Though powerful and generalizable, evolutionary analysis is difficult to apply in automatic systems. To make this approach more accessible to researchers, we have undertaken the development of a Comparative Data Analysis Ontology (CDAO). The initial implementation of CDAO, described here, covers key concepts required to perform evolutionary-based comparative analyses and has been evaluated for its capacity to support domain-specific representation and reasoning. CDAO is a SourceForge project and has a web home at
www.evolutionaryontology.org/cdao. CDAO is implemented in OWL 1.1 to take advantage of the capabilities of description logics.
To understand the coverage and uses of CDAO, it is important to understand that it is not primarily an ontology of evolutionary processes or of evolutionary biology, but an ontology of evolutionary comparative analysis. The task-oriented nature of comparative analysis is apparent in concepts such as “OTU”–-what defines something as an OTU is that it plays a particular role in an analysis. Thus, in CDAO, a TU (the generalization of OTU) is not restricted to refer to (to be about) any particular type of biological entity. Currently, CDAO provides terms for continuous characters, discrete characters, and several subclasses of discrete characters, including sequence characters. However, at present these classes remain very abstract, as CDAO does not import biological knowledge from other ontologies except the amino acid ontology mentioned above.
A challenge for the development of CDAO is to align its classes and relations with more fundamental concepts and relations, as consensus on these fundamentals begins to emerge from work in other areas of biology.
71–73 As noted above, CDAO focuses on information artefacts rather than evolutionary processes. A phylogenetic tree clearly is not a biological entity or a process, but is more like a time-dependent model (a model in which time is one of the parameters) or a historical narrative. The relationship of such an artefact to a flesh-and-blood biological thing, e.g. the relationship of a terminal “cat” node on a phylogenetic tree to the concept of a cat, or cat species, is an issue that remains to be determined. The proper form of relationship to cross the boundary separating the universe of information artefacts from that of biological objects or processes might be something like “represents” or “is about”; clearly (by way of counter-example), the proper relation cannot be something like “has” or “part_of”. Even concepts that seem familiar in comparative analysis nonetheless pose difficult conceptual problems. When we see the state of a protein sequence character represented as “Ala” for “Alanine”, this does not mean precisely the free amino acid in solution, L-Alanine, because the proper constituent of a protein is the L-Alanyl moiety (i.e. CHEBI:32433 rather than CHEBI:16977).
30 But even this is not quite right, because as a character state, the change from “Ala” to “Gly”, for instance, follows evolutionary rules, not strictly chemical rules; and even the non-change from “Ala” to “Ala” over time (e.g. lack of change over millions of years) is not a simple chemical preservation of a molecule. The “Ala” state is the state of an OTU that represents a population of gene-encoded proteins in some way that is difficult to grasp.
While it remains to be seen whether available upper-level ontologies are suitable for the complexity introduced by evolution, clearly they contain some useful concepts. For instance, the basic relation ontology OBO-REL
71 refers to a formal relation of reproductive descendency, the relation that provides the continuity to an evolutionary lineage, i.e. a path in a tree. Likewise, the latest version of BioTop
73 has a separate hierarchy including nucleotide residues as “informational” components of a sequences, as distinct from chemical compounds, a perspective that corresponds (in our understanding) to the way sequence data are treated in the context of evolutionary analysis. Nevertheless, a complete representation of comparative data analysis in terms of philosophically rigorous principles would seem difficult. Many important concepts in modern data analysis are not ontological in the sense of Smith,
74 including concepts such as “posterior probability” and “annotation”. Thus, it may be appropriate to think of CDAO as an “application ontology” (or as a domain ontology that remains immature pending resolution of relevant philosophical issues).
A more practical challenge for the development of CDAO is to evaluate, revise and expand the ontology further, to ensure that it serves the purposes of comparative data analysis. As an ontology for comparative analysis, CDAO is designed to facilitate: integration of data from different resources; interoperation of different computational tools; creation of powerful software tools and methods based on evolutionary concepts; and interpretation of the results of comparative analysis by the non-expert.
Some of the representation challenges are foreseeable, such as fleshing out the CoordinateSystem concept to provide support for representation and reasoning about sequences (or other coordinate systems). Such an expansion is needed because, while CDAO allows the representation of sequence residues as character states, the columns of a character matrix do not have any inherent order, thus the sequence residues are not ordered in a sequence.
We have described here some initial tests for representation and reasoning, but a stronger test of CDAO will be performed in the context of projects with externally defined technical or scientific goals. The kinds of projects that are most demanding for data interoperability are integrative biology studies that attempt to integrate diverse data resources, while the kinds of projects that are most demanding for software interoperability are workflow systems that aim to provide access to diverse tools. Recently CDAO was made available for use during a Data Resource Interoperability Hackathon sponsored by the National Evolutionary Synthesis Center (March 9 to 13, 2009;
http://evoinfo.nescent.org/Database_Interop_Hackathon). One group of participants used CDAO concepts to anchor metadata annotations in NeXML
49 data files. Another group translated NeXML files into CDAO RDF/XML format using XSLT technology (see “Availability of Products” above), then loaded the results into a “triple store” (a collection of subject-predicate-object statements) which was interrogated using logical queries. We expect that the evaluation and further development of CDAO will take place in the context of such projects. The wider scientific community, particularly those researchers already involved in evolutionary-based analyses, is invited to participate in the further evaluation and development of CDAO.