De novo assembly of short sequence reads

Konrad Paszkiewicz; David J Studholme

doi:10.1093/bib/bbq020

De novo assembly of short sequence reads

Brief Bioinform. 2010 Sep;11(5):457-72. doi: 10.1093/bib/bbq020. Epub 2010 Aug 19.

Authors

Konrad Paszkiewicz¹, David J Studholme

Affiliation

¹ Imperial College, London.

PMID: 20724458
DOI: 10.1093/bib/bbq020

Abstract

A new generation of sequencing technologies is revolutionizing molecular biology. Illumina's Solexa and Applied Biosystems' SOLiD generate gigabases of nucleotide sequence per week. However, a perceived limitation of these ultra-high-throughput technologies is their short read-lengths. De novo assembly of sequence reads generated by classical Sanger capillary sequencing is a mature field of research. Unfortunately, the existing sequence assembly programs were not effective for short sequence reads generated by Illumina and SOLiD platforms. Early studies suggested that, in principle, sequence reads as short as 20-30 nucleotides could be used to generate useful assemblies of both prokaryotic and eukaryotic genome sequences, albeit containing many gaps. The early feasibility studies and proofs of principle inspired several bioinformatics research groups to implement new algorithms as freely available software tools specifically aimed at assembling reads of 30-50 nucleotides in length. This has led to the generation of several draft genome sequences based exclusively on short sequence Illumina sequence reads, recently culminating in the assembly of the 2.25-Gb genome of the giant panda from Illumina sequence reads with an average length of just 52 nucleotides. As well as reviewing recent developments in the field, we discuss some practical aspects such as data filtering and submission of assembly data to public repositories.

Publication types

Review

MeSH terms

Algorithms
Animals
Base Sequence*
Computational Biology / methods
Databases, Genetic
Genome
Humans
Molecular Sequence Data
Sequence Analysis, DNA* / methods