Identifying and removing haplotypic duplication in primary genome assemblies

Abstract

Abstract Motivation Rapid development in long-read sequencing and scaffolding technologies is accelerating the production of reference-quality assemblies for large eukaryotic genomes. However, haplotype divergence in regions of high heterozygosity often results in assemblers creating two copies rather than one copy of a region, leading to breaks in contiguity and compromising downstream steps such as gene annotation. Several tools have been developed to resolve this problem. However, they either focus only on removing contained duplicate regions, also known as haplotigs, or fail to use all the relevant information and hence make errors. Results Here we present a novel tool, purge_dups, that uses sequence similarity and read depth to automatically identify and remove both haplotigs and heterozygous overlaps. In comparison with current tools, we demonstrate that purge_dups can reduce heterozygous duplication and increase assembly continuity while maintaining completeness of the primary assembly. Moreover, purge_dups is fully automatic and can easily be integrated into assembly pipelines. Availability and implementation The source code is written in C and is available at https://github.com/dfguan/purge_dups. Supplementary information Supplementary data are available at Bioinformatics online.

Keywords

Computer sciencePurgeSource codeSequence assemblyContiguityAnnotationGenomeGene duplicationData miningComputational biologyBiologyProgramming languageGeneticsArtificial intelligenceGeneOperating system

Affiliated Institutions

Related Publications

Assembly of long, error-prone reads using repeat graphs

Mikhail Kolmogorov , Jeffrey Yuan , Yu Lin +1 more

2019 Nature Biotechnology 5451 citations

GAGE: A critical evaluation of genome assemblies and assembly algorithms

Steven L. Salzberg , Adam M. Phillippy , Aleksey V. Zimin +10 more

New sequencing technology has dramatically altered the landscape of whole-genome sequencing, allowing scientists to initiate numerous projects to decode the genomes of previousl...

2011 Genome Research 733 citations

QUAST: quality assessment tool for genome assemblies

Alexey Gurevich , Vladislav Saveliev , Nikolay Vyahhi +1 more

Abstract Summary: Limitations of genome sequencing techniques have led to dozens of assembly algorithms, none of which is perfect. A number of methods for comparing assemblers h...

2013 Bioinformatics 10133 citations

Assemblathon 1: A competitive assessment of de novo short read assembly methods

Dent Earl , Keith Bradnam , John St. John +68 more

Low-cost short read sequencing technology has revolutionized genomics, though it is only just becoming practical for the high-quality de novo assembly of a novel large genome. W...

2011 Genome Research 534 citations

Whole-Genome Sequencing and Assembly with High-Throughput, Short-Read Technologies

Andreas Sundquist , Mostafa Ronaghi , Haixu Tang +2 more

While recently developed short-read sequencing technologies may dramatically reduce the sequencing cost and eventually achieve the $1000 goal for re-sequencing, their limitation...

2007 PLoS ONE 126 citations

Publication Info

Year: 2020
Type: article
Volume: 36
Issue: 9
Pages: 2896-2898
Citations: 2530
Access: Closed

External Links

View on DOI.org

Social Impact

Altmetric

Identifying and removing haplotypic duplication in primary genome assemblies

PlumX Metrics

Social media, news, blog, policy document mentions

Citation Metrics

2530

OpenAlex

Cite This

APA Style

                            
                                
                                    Dengfeng Guan, 
                                
                                    Shane McCarthy, 
                                
                                    Jonathan Wood
                                
                                et al.
                            
                            (2020). 
                            Identifying and removing haplotypic duplication in primary genome assemblies. 
                            Bioinformatics
                            , 36
                            (9)
                            , 2896-2898.
                            https://doi.org/10.1093/bioinformatics/btaa025
                        

Identifiers

DOI: 10.1093/bioinformatics/btaa025