ALE: a generic assembly likelihood evaluation framework for assessing the accuracy of genome and metagenome assemblies

Abstract

Abstract Motivation: Researchers need general purpose methods for objectively evaluating the accuracy of single and metagenome assemblies and for automatically detecting any errors they may contain. Current methods do not fully meet this need because they require a reference, only consider one of the many aspects of assembly quality or lack statistical justification, and none are designed to evaluate metagenome assemblies. Results: In this article, we present an Assembly Likelihood Evaluation (ALE) framework that overcomes these limitations, systematically evaluating the accuracy of an assembly in a reference-independent manner using rigorous statistical methods. This framework is comprehensive, and integrates read quality, mate pair orientation and insert length (for paired-end reads), sequencing coverage, read alignment and k-mer frequency. ALE pinpoints synthetic errors in both single and metagenomic assemblies, including single-base errors, insertions/deletions, genome rearrangements and chimeric assemblies presented in metagenomes. At the genome level with real-world data, ALE identifies three large misassemblies from the Spirochaeta smaragdinae finished genome, which were all independently validated by Pacific Biosciences sequencing. At the single-base level with Illumina data, ALE recovers 215 of 222 (97%) single nucleotide variants in a training set from a GC-rich Rhodobacter sphaeroides genome. Using real Pacific Biosciences data, ALE identifies 12 of 12 synthetic errors in a Lambda Phage genome, surpassing even Pacific Biosciences’ own variant caller, EviCons. In summary, the ALE framework provides a comprehensive, reference-independent and statistically rigorous measure of single genome and metagenome assembly accuracy, which can be used to identify misassemblies or to optimize the assembly process. Availability: ALE is released as open source software under the UoI/NCSA license at http://www.alescore.org. It is implemented in C and Python. Contact: pf98@cornell.edu or ZhongWang@lbl.gov Supplementary information: Supplementary data are available at Bioinformatics online.

Keywords

MetagenomicsGenomeSequence assemblyComputational biologyReference genomeComputer scienceData miningBiologyGeneticsGene

Affiliated Institutions

Related Publications

QUAST: quality assessment tool for genome assemblies

Alexey Gurevich , Vladislav Saveliev , Nikolay Vyahhi +1 more

Abstract Summary: Limitations of genome sequencing techniques have led to dozens of assembly algorithms, none of which is perfect. A number of methods for comparing assemblers h...

2013 Bioinformatics 10133 citations

SHARCGS, a fast and highly accurate short-read assembly algorithm for de novo genomic sequencing

Juliane C. Dohm , Claudio Lottaz , Tatiana Borodina +1 more

The latest revolution in the DNA sequencing field has been brought about by the development of automated sequencers that are capable of generating giga base pair data sets quick...

2007 Genome Research 281 citations

Assembly of long, error-prone reads using repeat graphs

Mikhail Kolmogorov , Jeffrey Yuan , Yu Lin +1 more

Accurate genome assembly is hampered by repetitive regions. Although long single molecule sequencing reads are better able to resolve genomic repeats than short-read data, most ...

2019 Nature Biotechnology 5451 citations

Assemblathon 1: A competitive assessment of de novo short read assembly methods

Dent Earl , Keith Bradnam , John St. John +68 more

Low-cost short read sequencing technology has revolutionized genomics, though it is only just becoming practical for the high-quality de novo assembly of a novel large genome. W...

2011 Genome Research 534 citations

Whole-Genome Sequencing and Assembly with High-Throughput, Short-Read Technologies

Andreas Sundquist , Mostafa Ronaghi , Haixu Tang +2 more

While recently developed short-read sequencing technologies may dramatically reduce the sequencing cost and eventually achieve the $1000 goal for re-sequencing, their limitation...

2007 PLoS ONE 126 citations

Publication Info

Year: 2013
Type: article
Volume: 29
Issue: 4
Pages: 435-443
Citations: 190
Access: Closed

External Links

View on DOI.org

Social Impact

Altmetric

ALE: a generic assembly likelihood evaluation framework for assessing the accuracy of genome and metagenome assemblies

PlumX Metrics

Social media, news, blog, policy document mentions

Citation Metrics

190

OpenAlex

Cite This

APA Style

                            
                                
                                    Scott Clark, 
                                
                                    Rob Egan, 
                                
                                    Peter I. Frazier
                                
                                et al.
                            
                            (2013). 
                            ALE: a generic assembly likelihood evaluation framework for assessing the accuracy of genome and metagenome assemblies. 
                            Bioinformatics
                            , 29
                            (4)
                            , 435-443.
                            https://doi.org/10.1093/bioinformatics/bts723
                        

Identifiers

DOI: 10.1093/bioinformatics/bts723