Abstract

Abstract Motivation: Counting the number of occurrences of every k-mer (substring of length k) in a long string is a central subproblem in many applications, including genome assembly, error correction of sequencing reads, fast multiple sequence alignment and repeat detection. Recently, the deep sequence coverage generated by next-generation sequencing technologies has caused the amount of sequence to be processed during a genome project to grow rapidly, and has rendered current k-mer counting tools too slow and memory intensive. At the same time, large multicore computers have become commonplace in research facilities allowing for a new parallel computational paradigm. Results: We propose a new k-mer counting algorithm and associated implementation, called Jellyfish, which is fast and memory efficient. It is based on a multithreaded, lock-free hash table optimized for counting k-mers up to 31 bases in length. Due to their flexibility, suffix arrays have been the data structure of choice for solving many string problems. For the task of k-mer counting, important in many biological applications, Jellyfish offers a much faster and more memory-efficient solution. Availability: The Jellyfish software is written in C++ and is GPL licensed. It is available for download at http://www.cbcb.umd.edu/software/jellyfish. Contact: gmarcais@umd.edu Supplementary information: Supplementary data are available at Bioinformatics online.

Keywords

SubstringComputer sciencek-merParallel computingString (physics)SoftwareSuffix arrayLock (firearm)Hash tableHash functionBloom filterSequence assemblyMulti-core processorSequence (biology)Data structureTheoretical computer scienceAlgorithmGenomeProgramming languageBiologyMathematics

Affiliated Institutions

Related Publications

Publication Info

Year
2011
Type
article
Volume
27
Issue
6
Pages
764-770
Citations
4605
Access
Closed

External Links

Social Impact

Social media, news, blog, policy document mentions

Citation Metrics

4605
OpenAlex

Cite This

Guillaume Marçais, Carl Kingsford (2011). A fast, lock-free approach for efficient parallel counting of occurrences of <i>k</i> -mers. Bioinformatics , 27 (6) , 764-770. https://doi.org/10.1093/bioinformatics/btr011

Identifiers

DOI
10.1093/bioinformatics/btr011