Advertisement
No access
Special Reviews

Genome Sequence of the Nematode C. elegans: A Platform for Investigating Biology

Science
11 Dec 1998
Vol 282, Issue 5396
pp. 2012-2018

Abstract

The 97-megabase genomic sequence of the nematode Caenorhabditis elegans reveals over 19,000 genes. More than 40 percent of the predicted protein products find significant matches in other organisms. There is a variety of repeated sequences, both local and dispersed. The distinctive distribution of some repeats and highly conserved genes provides evidence for a regional organization of the chromosomes.

Get full access to this article

View all available purchase options and get full access to this article.

Supplementary Material

File (c-elegans.xhtml)

REFERENCES AND NOTES

1
M. S. Chee et al., in Cytomegaloviruses, vol. 154 of Current Topics in Microbiology and Immunology, J. K. McDougall, Ed. (Springer-Verlag, Berlin, 1990), pp. 125–169;
Fleischmann R. D., et al., Science 269, 496 (1995);
Bult C. J., et al., ibid. 273, 1058 (1996);
. F. R. Blattner et al., ibid. 277, 1453 (1997); S. T. Cole et al., Nature 393, 537 (1998).
2
H. W. Mewes et al., Nature387 (suppl.), 7 (1997);
Goffeau A., et al., Science 274, 546 (1996).
3
Coulson A. R., et al., Proc. Natl. Acad. Sci. U.S.A. 83, 7821 (1986).
4
Coulson A., et al., Bioessays 13, 413 (1991);
Coulson A., et al., Nature 335, 184 (1988);
. The current status of the C. elegans physical map is accessible on the World Wide Web (20, 21).
5
The investigations contributing to the C. elegans genome project are too numerous to cite. Two early representative publications are
Greenwald I., Coulson A., Sulston J., Nucleic Acids Res. 15, 2295 (1987).
Ward S., et al., J. Mol. Biol. 199, 1 (1988).
6
Waterston R., et al., Nature Genet. 1, 114 (1992);
; W. R. McCombie et al., ibid., p. 124.
7
Y. Kohara, PNE Protein Nucleic Acid Enzyme41, 715 (1996).
8
Okimoto R., Macfarlane J. L., Clary D. O., Wohlstenholme D. R., Genetics 130, 471 (1992).
9
Burke D. T., Carle G. F., Olson M. V., Science 236, 806 (1987).
10
Sulston J., et al., Nature 356, 37 (1992).
11
Wilson R., et al., ibid. 368, 32 (1994).
12
Vaudin M., et al., Nucleic Acids Res. 23, 670 (1995).
13
For details of the sequencing process, see (49). The process began with the purification of DNA from selected clones of the tiling path. The DNA was sheared mechanically, and after size selection, the resulting fragments were subcloned into M13 or plasmid vectors. Random subclones were selected for sequence generation (the shotgun sequencing approach). Generally, 900 sequence reads per 40 kb of genomic DNA were generated with fluorescent dye–labeled primers or terminators. Bases were determined with PHRED (50). An assembly of these random sequences that was generated with PHRAP (51) typically resulted in two to eight contigs. Gap closure and resolution of sequence ambiguities were achieved during finishing [using the editing packages GAP (52) and CONSED (53) and the collection of additional data] through longer reads, directed sequencing reactions using custom oligonucleotide primers on chosen templates, or additional chemistries as required. High-quality finished sequence was analyzed through the use of a suite of programs (including BLAST and GENEFINDER), and the results were stored in ACEDB and submitted to GenBank. Unfinished and finished sequence data were available to investigators by file transfer protocol (ftp) from both sequencing sites (20, 21).
14
Heiner C. R., Hunkapiller K. L., Chen S. M., Genome Res. 8, 557 (1998);
Lee L. G., et al., Nucleic Acids Res. 20, 2471 (1992);
; J. D. Parsons, Comput. Appl. Biosci. 11, 615 (1995).
15
McMurray A. A., Sulston J. E., Quail M. A., Genome Res. 8, 562 (1998).
16
Kim U. J., Shizuya H., de Jong P. J., Nucleic Acids Res. 20, 1083 (1992).
17
Cheng S., Fockler C., Barnes W. M., Proc. Natl. Acad. Sci. U.S.A. 91, 5695 (1994).
18
A clean separation of the YAC DNA from the host chromosomal DNA sometimes required the use of yeast strains in which specific yeast chromosomes are altered in size to provide a window around the YAC that is free of the native chromosomes.
Hamer L., Johnston M., Green E. D., Proc. Natl. Acad. Sci. U.S.A. 92, 11706 (1995).
19
Devine S. E., Chissoe S. L., Eby Y., Genome Res. 7, 551 (1997).
20
22
Wicky C., et al., Proc. Natl. Acad. Sci. U.S.A. 93, 8983 (1996).
23
Every region must be sequenced either on each strand or with dye primer and dye terminator chemistry, which extensive comparisons have shown to be at least as reliable as double stranding in revealing and correcting compressions and other base-calling errors. All regions must be represented by reads from two or more independent subclones or from PCR products across the region. If subcloned PCR products are used for a region, three independent clones must be sequenced. Rare exceptions to the general rules of double stranding or alternative chemistry were permitted on the basis of the following. For regions of <50 bases where, despite valid efforts, a finisher is unable to achieve double stranding or double chemistry, the sequence may be submitted (provided the sequence is of high quality and both the finisher and his or her supervisor see no ambiguous bases). When editing, in XGAP, all sequence data must be resolved at the 75% consensus level, either by the collection of additional data or by the editing of poorly called traces. In CONSED, any consensus base with a quality <25% must be manually reviewed to determine if the available data are sufficient to unambiguously support the derived contig sequence. If not, additional data are collected.
24
Each finished sequence is submitted to a series of quality control tests, including verification that all of the finishing rules (23) have been followed and a careful verification that the assembly is consistent with all restriction digest information. In addition, every finished sequence undergoes an automatic process of base calling and reassembly with different algorithms than those that were used for the initial assembly and comparison of the resultant consensus by a banded Smith-Waterman analysis [CROSSMATCH (51)] against the sequence that was obtained by the finisher. Any discrepancies in assembly or sequence, along with any regions failing to meet finishing criteria, are manually reviewed, and new data are collected as necessary. Only when all discrepancies are accounted for is the sequence passed on for annotation. In turn, if annotation flags any suspicious regions, these are again passed back to the finisher for resolution, either through additional data collection or editing.
25
P. Green and L. Hillier, unpublished software.
26
Fichant G. A., Burks C., J. Mol. Biol. 220, 659 (1991);
Lowe T. M., Eddy S., Nucleic Acids Res. 25, 955 (1997).
27
Altschul S. F., Gish W., Miller W., J. Mol. Biol. 215, 403 (1990);
; W. Gish, WU-BLAST unpublished software.
28
E. L. L. Sonnhammer and R. Durbin, in Proceedings of the Second International Conference on Intelligent Systems for Molecular Biology, R. Altman, D. Brutlag, P. Karp, R. Lathrop, D. Searls, Eds. (AAAI Press, Menlo Park, CA, 1994), pp. 363–368.
29
Mott R., Comput. Appl. Biosci. 13, 477 (1997).
30
Sonnhammer E. L., Eddy S. R., Birney E., Nucleic Acids Res. 26, 320 (1998);
; S. R. Eddy Curr. Opin. Struct. Biol. 6, 361 (1996).
31
We identified local tandem and inverted repeats with the programs QUICKTANDEM, TANDEM, and INVERTED (20), which search for repeats within 1-kb intervals along the genomic sequence. An index of repeat families used by the project is available at www.sanger.ac.uk/Projects/C_elegans/repeats/.
32
R. Durbin and J. Thierry-Mieg, unpublished software. Documentation, code, and data are available from anonymous ftp servers at lirmm.lirmm.fr/pub/acedb/, ftp.sanger.ac.uk/pub/acedb/, and ncbi.nlm.nih.gov/repository/acedb/.
33
In C. elegans, two or more genes can be transcribed from the same promoter, with one gene separated by no more than a few hundred nucleotides from another. In genes undergoing transsplicing, the 5′ exon begins with a splice acceptor sequence, making this 5′ exon more difficult to distinguish from internal exons. This combination of factors may result in two genes being merged into one [
Blumenthal T., Trends Genet. 11, 132 (1995)].
34
We have identified 182 genes possessing alternative splice variants, which are predominately from EST data. Of these, 67 genes produce proteins that differ at their amino termini, 57 genes produce proteins that differ at the carboxyl end, and 59 genes produce proteins that display an internal variation. Of the internal variations, seven genes showed complete exon skipping. Thirty-one genes were found where the 5′ end of an exon had changed, 21 of which resulted in a difference of three or fewer codons. In contrast, of the 24 alternative transcripts that changed the 3′ end of an exon, only 4 resulted in a change of three or fewer codons.
36
R. K. Herman, in The Nematode Caenorhabditis elegans, W. B. Wood, Ed. (Cold Spring Harbor Laboratory Press, Plainview, NY, 1988), pp. 17–45;
Waterston R., Sulston J., Proc. Natl. Acad. Sci. U.S.A. 92, 10836 (1995).
37
These results were obtained with WU-BLAST (version 2.0a13MP), using default parameters and a threshold P value of 10−3.
Green P., et al., Science 259, 1711 (1993).
38
Chervitz S. A., et al., Science 282, 2022 (1998).
39
Sonnhammer E. L., Durbin R., Genomics 46, 200 (1997).
40
GENEFINDER systematically uses statistical criteria [primarily log likelihood ratios (LLRs)] to attempt to identify likely genes within a region of genomic sequence. Candidate genes are evaluated on the basis of “scores” that reflect their splice site, translation start site, coding potential LLRs, and intron sizes. These scores are normalized by reference to the distribution of combined scores in a simulated sequence as follows: If a given combined score occurs, on average, once in every 10s nucleotides in simulated DNA, then the corresponding normalized score is set to s. (For example, exons with a normalized score of 5.0 or greater will be found only once in every 100 kb of simulated DNA. With the current reference simulated sequence, which is 1 Mb in length, 6.0 is the maximum normalized score that can occur.) A dynamic programming algorithm is then used to find the set of nonoverlapping candidate genes (on a given strand) that has the highest total score (among all such sets). About 85% of experimentally verified “exon ORFs” (open reading frames containing true exons) in C. elegans genes in GenBank have normalized scores above 5.0 (and many of the remaining 15% are initial or terminal exons, which have a single splice site). The fraction of exons with scores >5.0 may be lower for all C. elegans genes because of the bias toward highly expressed genes (which often have very high coding segment scores) in the experimentally verified set. However, even for genes in the current verified set that are expressed at moderate to low levels, a majority of exon ORF scores exceed 5.0; this score should be an effective criterion for identifying at least part of most genes. In theory, high-scoring ORFs could arise in other ways. For example, intergenic or intronic regions having abnormal nucleotide composition might appear to have coding segments and occasionally, by chance, may have high-scoring splice sites. So far, there seem to be relatively few such regions in the C. elegans genomic sequence. These regions may account for the anomalous orphan exons that we occasionally find. In addition, there are examples where these GENEFINDER-predicted genes fall into clear gene families that are nematode-specific or have only very distant similarity outside the nematodes, for example, chemoreceptor genes (54).
41
Pfam is a collection of protein family alignments that were constructed semiautomatically with hidden Markov models within the HMMER package. The collagen and seven transmembrane chemoreceptor data were obtained with unpublished hidden Markov models. The number of seven transmembrane chemoreceptor genes is lower than that found by Robertson (54), which could be due to pseudogenes.
42
Putative tRNA pseudogenes are identified by the search program tRNAscan-SE as sequences that are significantly related to a tRNA sequence consensus but do not appear to be likely to adopt a tRNA's canonical secondary structure (26). Many higher eukaryotic genomes have mobile, tRNA-derived short interspersed nuclear elements (SINEs). However, because they are few in number, the nematode tRNA pseudogenes seem more likely to have arisen by some rare event rather than by the extensive mobility that characterizes mobile SINEs [
Daniels G. R., Deininger P. L., Nature 317, 819 (1985)].
43
A. F. Smit, Curr. Opin. Genet. Dev . 6, 743 (1996).
44
Ketting R. F., Fischer S. E. J., Plasterk R. H., Nucleic Acids Res. 25, 4041 (1997).
45
Bernardi G., Annu. Rev. Genet. 29, 445 (1995);
Dujon B., et al., Nature 369, 371 (1994).
46
The abundance of C. elegans ESTs does not directly reflect expression levels, because they are derived from cDNAs in which more abundantly expressed genes were partially selected against (6, 7).
47
Barnes T. M., Kohara Y., Coulson A., Genetics 141, 159 (1995).
48
This approach is also being used for the human genome (Sanger Centre, Washington University Genome Sequencing Center, Genome Res., in press).
49
For methodological details, see (20) or (21). For biochemical procedures, see R. K. Wilson and E. R. Mardis, in Genome Analysis: A Laboratory Manual, B. Birren, E. D. Green, S. Klapholz, R. M. Myers, J. Roskams, Eds. (Cold Spring Harbor Laboratory Press, Plainview, NY, 1997), vol. 1, pp. 397–454. For software packages, see (20) or (21) and
Dear S., et al., Genome Res. 8, 260 (1998);
; M. Wendl et al., ibid., p. 975; J. D. Parsons, Comput. Appl. Biosci.11, 615 (1995); and
Cooper M., et al., Genome Res. 6, 1110 (1996).
50
Ewing B., Hillier L., Wendl M. C., Genome Res. 8, 175 (1998);
; B. Ewing and P. Green, ibid., p. 186.
51
P. Green, personal communication.
52
Bonfield J. K., Smith K. F., Staden R., Nucleic Acids Res. 23, 4992 (1995).
53
Gordon D., Abajian C., Green P., Genome Res. 8, 195 (1998).
54
Robertson H. M., Genome Res. 8, 449 (1998).
55
This work has been supported by grants from the U.S. National Human Genome Research Institute and the UK MRC. We would also like to thank the many members of the C. elegans community who have shared data and provided encouragement in the course of this project.

(0)eLetters

eLetters is a forum for ongoing peer review. eLetters are not edited, proofread, or indexed, but they are screened. eLetters should provide substantive and scholarly commentary on the article. Embedded figures cannot be submitted, and we discourage the use of figures within eLetters in general. If a figure is essential, please include a link to the figure within the text of the eLetter. Please read our Terms of Service before submitting an eLetter.

Log In to Submit a Response

No eLetters have been published for this article yet.

Information & Authors

Information

Published In

Science
Volume 282 | Issue 5396
11 December 1998

Submission history

Published in print: 11 December 1998

Permissions

Request permissions for this article.

Authors

Affiliations

The C. elegans Sequencing Consortium *

Notes

*
See genome.wustl.edu/gsc/C_elegans/ andwww.sanger.ac.uk/Projects/C_elegans/ for a list of authors. Address correspondence to The Washington University Genome Sequencing Center, Box 8501, 4444 Forest Park Parkway, St. Louis, MO 63108, USA. E-mail: [email protected]; or The Sanger Centre, The Wellcome Trust Genome Campus, Hinxton, Cambridgeshire CB10 1SA, UK. E-mail: [email protected]

Metrics & Citations

Metrics

Article Usage

Altmetrics

Citations

Cite as

Export citation

Select the format you want to export the citation of this publication.

Cited by

  1. Microfluidics in High-Throughput Drug Screening: Organ-on-a-Chip and C. elegans-Based Innovations, Biosensors, 14, 1, (55), (2024).https://doi.org/10.3390/bios14010055
    Crossref
  2. Invertebrate genetic models of amyotrophic lateral sclerosis, Frontiers in Molecular Neuroscience, 17, (2024).https://doi.org/10.3389/fnmol.2024.1328578
    Crossref
  3. A catalogue of chromosome counts for Phylum Nematoda, Wellcome Open Research, 9, (55), (2024).https://doi.org/10.12688/wellcomeopenres.20550.1
    Crossref
  4. A chromosome-level genome for the nudibranch gastropod Berghia stephanieae helps parse clade-specific gene expression in novel and conserved phenotypes, BMC Biology, 22, 1, (2024).https://doi.org/10.1186/s12915-024-01814-3
    Crossref
  5. Nucleotide-level distance metrics to quantify alternative splicing implemented in TranD , Nucleic Acids Research, 52, 5, (e28-e28), (2024).https://doi.org/10.1093/nar/gkae056
    Crossref
  6. Genome assembly and annotation of the mermithid nematode Mermis nigrescens , G3: Genes, Genomes, Genetics, (2024).https://doi.org/10.1093/g3journal/jkae023
    Crossref
  7. Young duplicate genes show developmental stage- and cell type-specific expression and function in Caenorhabditis elegans, Cell Genomics, 4, 1, (100467), (2024).https://doi.org/10.1016/j.xgen.2023.100467
    Crossref
  8. Caenorhabditis elegans immune responses to microsporidia and viruses, Developmental & Comparative Immunology, (105148), (2024).https://doi.org/10.1016/j.dci.2024.105148
    Crossref
  9. Functional genomics methods to target the interface between schistosomes and the host immune system, Microbes at Bio/Nano Interfaces, (81-98), (2024).https://doi.org/10.1016/bs.mim.2023.10.002
    Crossref
  10. Heat shock and thermotolerance in Caenorhabditis elegans: An overview of laboratory techniques, , (2024).https://doi.org/10.1016/bs.mcb.2024.02.001
    Crossref
  11. See more
Loading...

View Options

Check Access

Log in to view the full text

AAAS ID LOGIN

AAAS login provides access to Science for AAAS Members, and access to other journals in the Science family to users who have purchased individual subscriptions.

Log in via OpenAthens.
Log in via Shibboleth.

More options

Purchase digital access to this article

Download and print this article for your personal scholarly, research, and educational use.

Purchase this issue in print

Buy a single issue of Science for just $15 USD.

View options

PDF format

Download this article as a PDF file

Download PDF

Full Text

FULL TEXT

Media

Figures

Multimedia

Tables

Share

Share

Share article link

Share on social media