Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences

Abstract

Abstract Motivation: In 2001 and 2002, we published two papers (Bioinformatics, 17, 282–283, Bioinformatics, 18, 77–82) describing an ultrafast protein sequence clustering program called cd-hit. This program can efficiently cluster a huge protein database with millions of sequences. However, the applications of the underlying algorithm are not limited to only protein sequences clustering, here we present several new programs using the same algorithm including cd-hit-2d, cd-hit-est and cd-hit-est-2d. Cd-hit-2d compares two protein datasets and reports similar matches between them; cd-hit-est clusters a DNA/RNA sequence database and cd-hit-est-2d compares two nucleotide datasets. All these programs can handle huge datasets with millions of sequences and can be hundreds of times faster than methods based on the popular sequence comparison and database search tools, such as BLAST. Availability: Contact: liwz@sdsc.edu

Keywords

Cluster analysisComputer scienceSequence (biology)Protein sequencingSequence databaseSequence alignmentData miningComputational biologyDatabaseBioinformaticsPeptide sequenceBiologyGeneticsArtificial intelligenceGene

Affiliated Institutions

Related Publications

Search and clustering orders of magnitude faster than BLAST

R. C. Edgar

Abstract Motivation: Biological sequence data is accumulating rapidly, motivating the development of improved high-throughput methods for sequence classification. Results: UBLAS...

2010 Bioinformatics 20899 citations

Assembling millions of short DNA sequences using SSAKE

Robin M. Warren , Granger G. Sutton , Steven J.M. Jones +1 more

Abstract Summary: Novel DNA sequencing technologies with the potential for up to three orders magnitude more sequence throughput than conventional Sanger sequencing are emerging...

2006 Bioinformatics 500 citations

Generating consensus sequences from partialorder multiple sequence alignment graphs

Christopher J. Lee

Abstract Motivation: Consensus sequence generation is important in many kinds of sequence analysis ranging from sequence assembly to profile-based iterative search methods. Howe...

2003 Bioinformatics 99 citations

Learning to Count: Robust Estimates for Labeled Distances between Molecular Sequences

John O’Brien , Vladimir N. Minin , Marc A. Suchard

Researchers routinely estimate distances between molecular sequences using continuous-time Markov chain models. We present a new method, robust counting, that protects against t...

2009 Molecular Biology and Evolution 113 citations

Gapped BLAST and PSI-BLAST: a new generation of protein database search programs

Stephen F. Altschul

The BLAST programs are widely used tools for searching protein and DNA databases for sequence similarities. For protein comparisons, a variety of definitional, algorithmic and s...

1997 Nucleic Acids Research 73388 citations

Publication Info

Year: 2006
Type: article
Volume: 22
Issue: 13
Pages: 1658-1659
Citations: 11310
Access: Closed

External Links

View on DOI.org

Social Impact

Altmetric

Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences

PlumX Metrics

Social media, news, blog, policy document mentions

Citation Metrics

11310

OpenAlex

Cite This

APA Style

                            
                                    Weizhong Li, 
                                
                                    Adam Godzik
                                
                            (2006). 
                            Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. 
                            Bioinformatics
                            , 22
                            (13)
                            , 1658-1659.
                            https://doi.org/10.1093/bioinformatics/btl158

Identifiers

DOI: 10.1093/bioinformatics/btl158