A fast, lock-free approach for efficient parallel counting of occurrences of <i>k</i> -mers

Abstract

Abstract Motivation: Counting the number of occurrences of every k-mer (substring of length k) in a long string is a central subproblem in many applications, including genome assembly, error correction of sequencing reads, fast multiple sequence alignment and repeat detection. Recently, the deep sequence coverage generated by next-generation sequencing technologies has caused the amount of sequence to be processed during a genome project to grow rapidly, and has rendered current k-mer counting tools too slow and memory intensive. At the same time, large multicore computers have become commonplace in research facilities allowing for a new parallel computational paradigm. Results: We propose a new k-mer counting algorithm and associated implementation, called Jellyfish, which is fast and memory efficient. It is based on a multithreaded, lock-free hash table optimized for counting k-mers up to 31 bases in length. Due to their flexibility, suffix arrays have been the data structure of choice for solving many string problems. For the task of k-mer counting, important in many biological applications, Jellyfish offers a much faster and more memory-efficient solution. Availability: The Jellyfish software is written in C++ and is GPL licensed. It is available for download at http://www.cbcb.umd.edu/software/jellyfish. Contact: gmarcais@umd.edu Supplementary information: Supplementary data are available at Bioinformatics online.

Keywords

SubstringComputer sciencek-merParallel computingString (physics)SoftwareSuffix arrayLock (firearm)Hash tableHash functionBloom filterSequence assemblyMulti-core processorSequence (biology)Data structureTheoretical computer scienceAlgorithmGenomeProgramming languageBiologyMathematics

Affiliated Institutions

University of Maryland, College Park US

Related Publications

Velvet: Algorithms for de novo short read assembly using de Bruijn graphs

Daniel R. Zerbino , Ewan Birney

We have developed a new set of algorithms, collectively called “Velvet,” to manipulate de Bruijn graphs for genomic sequence assembly. A de Bruijn graph is a compact representat...

2008 Genome Research 9539 citations

Fast and accurate short read alignment with Burrows–Wheeler transform

Heng Li , Richard Durbin

Abstract Motivation: The enormous amount of short reads generated by the new DNA sequencing technologies call for the development of fast and accurate read alignment programs. A...

2009 Bioinformatics 59569 citations

A New Algorithm for DNA Sequence Assembly

Ramana M. Idury , Michael S. Waterman

Since the advent of rapid DNA sequencing methods in 1976, scientists have had the problem of inferring DNA sequences from sequenced fragments. Shotgun sequencing is a well-estab...

1995 Journal of Computational Biology 344 citations

GTDB-Tk: a toolkit to classify genomes with the Genome Taxonomy Database

Pierre-Alain Chaumeil , Aaron J. Mussig , Philip Hugenholtz +1 more

Abstract Summary The Genome Taxonomy Database Toolkit (GTDB-Tk) provides objective taxonomic assignments for bacterial and archaeal genomes based on the GTDB. GTDB-Tk is computa...

2019 Bioinformatics 4811 citations

QUAST: quality assessment tool for genome assemblies

Alexey Gurevich , Vladislav Saveliev , Nikolay Vyahhi +1 more

Abstract Summary: Limitations of genome sequencing techniques have led to dozens of assembly algorithms, none of which is perfect. A number of methods for comparing assemblers h...

2013 Bioinformatics 10133 citations

Publication Info

Year: 2011
Type: article
Volume: 27
Issue: 6
Pages: 764-770
Citations: 4605
Access: Closed

External Links

View on DOI.org

Social Impact

Altmetric

A fast, lock-free approach for efficient parallel counting of occurrences of <i>k</i> -mers

PlumX Metrics

Social media, news, blog, policy document mentions

Citation Metrics

4605

OpenAlex

Cite This

APA Style

                            
                                    Guillaume Marçais, 
                                
                                    Carl Kingsford
                                
                            (2011). 
                            A fast, lock-free approach for efficient parallel counting of occurrences of <i>k</i> -mers. 
                            Bioinformatics
                            , 27
                            (6)
                            , 764-770.
                            https://doi.org/10.1093/bioinformatics/btr011

Identifiers

DOI: 10.1093/bioinformatics/btr011