Abstract
Abstract Motivation: Researchers need general purpose methods for objectively evaluating the accuracy of single and metagenome assemblies and for automatically detecting any errors they may contain. Current methods do not fully meet this need because they require a reference, only consider one of the many aspects of assembly quality or lack statistical justification, and none are designed to evaluate metagenome assemblies. Results: In this article, we present an Assembly Likelihood Evaluation (ALE) framework that overcomes these limitations, systematically evaluating the accuracy of an assembly in a reference-independent manner using rigorous statistical methods. This framework is comprehensive, and integrates read quality, mate pair orientation and insert length (for paired-end reads), sequencing coverage, read alignment and k-mer frequency. ALE pinpoints synthetic errors in both single and metagenomic assemblies, including single-base errors, insertions/deletions, genome rearrangements and chimeric assemblies presented in metagenomes. At the genome level with real-world data, ALE identifies three large misassemblies from the Spirochaeta smaragdinae finished genome, which were all independently validated by Pacific Biosciences sequencing. At the single-base level with Illumina data, ALE recovers 215 of 222 (97%) single nucleotide variants in a training set from a GC-rich Rhodobacter sphaeroides genome. Using real Pacific Biosciences data, ALE identifies 12 of 12 synthetic errors in a Lambda Phage genome, surpassing even Pacific Biosciences’ own variant caller, EviCons. In summary, the ALE framework provides a comprehensive, reference-independent and statistically rigorous measure of single genome and metagenome assembly accuracy, which can be used to identify misassemblies or to optimize the assembly process. Availability: ALE is released as open source software under the UoI/NCSA license at http://www.alescore.org. It is implemented in C and Python. Contact: pf98@cornell.edu or ZhongWang@lbl.gov Supplementary information: Supplementary data are available at Bioinformatics online.
Keywords
Affiliated Institutions
Related Publications
QUAST: quality assessment tool for genome assemblies
Abstract Summary: Limitations of genome sequencing techniques have led to dozens of assembly algorithms, none of which is perfect. A number of methods for comparing assemblers h...
SHARCGS, a fast and highly accurate short-read assembly algorithm for de novo genomic sequencing
The latest revolution in the DNA sequencing field has been brought about by the development of automated sequencers that are capable of generating giga base pair data sets quick...
Assembly of long, error-prone reads using repeat graphs
Accurate genome assembly is hampered by repetitive regions. Although long single molecule sequencing reads are better able to resolve genomic repeats than short-read data, most ...
Assemblathon 1: A competitive assessment of de novo short read assembly methods
Low-cost short read sequencing technology has revolutionized genomics, though it is only just becoming practical for the high-quality de novo assembly of a novel large genome. W...
Whole-Genome Sequencing and Assembly with High-Throughput, Short-Read Technologies
While recently developed short-read sequencing technologies may dramatically reduce the sequencing cost and eventually achieve the $1000 goal for re-sequencing, their limitation...
Publication Info
- Year
- 2013
- Type
- article
- Volume
- 29
- Issue
- 4
- Pages
- 435-443
- Citations
- 190
- Access
- Closed
External Links
Social Impact
Social media, news, blog, policy document mentions
Citation Metrics
Cite This
Identifiers
- DOI
- 10.1093/bioinformatics/bts723