Abstract
Abstract Motivation: Sequence clustering replaces groups of similar sequences in a database with single representatives. Clustering large protein databases like the NCBI Non-Redundant database (NR) using even the best currently available clustering algorithms is very time-consuming and only practical at relatively high sequence identity thresholds. Our previous program, CD-HI, clustered NR at 90% identity in ∼1 h and at 75% identity in ∼1 day on a 1 GHz Linux PC (Li et al. , Bioinformatics, 17, 282, 2001); however even faster clustering speed is needed because the size of protein databases are rapidly growing and many applications desire a lower attainable thresholds. Results: For our previous algorithm (CD-HI), we have employed short-word filters to speed up the clustering. In this paper, we show that tolerating some redundancy makes for more efficient use of these short-word filters and increases the program’s speed 100 times. Our new program implements this technique and clusters NR at 70% identity within 2 h, and at 50% identity in ∼5 days. Although some redundancy is present after clustering, our new program’s results only differ from our previous program’s by less than 0.4%. Availability: The program and its previous version are available at http://bioinformatics.burnham-inst.org/cd-hi Contact: liwz@burnham-inst.org; adam@burnham-inst.org * To whom correspondence should be addressed.
Keywords
Affiliated Institutions
Related Publications
Search and clustering orders of magnitude faster than BLAST
Abstract Motivation: Biological sequence data is accumulating rapidly, motivating the development of improved high-throughput methods for sequence classification. Results: UBLAS...
Gapped BLAST and PSI-BLAST: a new generation of protein database search programs
The BLAST programs are widely used tools for searching protein and DNA databases for sequence similarities. For protein comparisons, a variety of definitional, algorithmic and s...
The SWISS-MODEL workspace: a web-based environment for protein structure homology modelling
Abstract Motivation: Homology models of proteins are of great interest for planning and analysing biological experiments when no experimental three-dimensional structures are av...
Automated generation of heuristics for biological sequence comparison
Abstract Background Exhaustive methods of sequence alignment are accurate but slow, whereas heuristic approaches run quickly, but their complexity makes them more difficult to i...
KEGG: Kyoto Encyclopedia of Genes and Genomes
KEGG (Kyoto Encyclopedia of Genes and Genomes) is a knowledge base for systematic analysis of gene functions, linking genomic information with higher order functional informatio...
Publication Info
- Year
- 2002
- Type
- article
- Volume
- 18
- Issue
- 1
- Pages
- 77-82
- Citations
- 505
- Access
- Closed
External Links
Social Impact
Social media, news, blog, policy document mentions
Citation Metrics
Cite This
Identifiers
- DOI
- 10.1093/bioinformatics/18.1.77