Abstract

Abstract Motivation: Sequence clustering replaces groups of similar sequences in a database with single representatives. Clustering large protein databases like the NCBI Non-Redundant database (NR) using even the best currently available clustering algorithms is very time-consuming and only practical at relatively high sequence identity thresholds. Our previous program, CD-HI, clustered NR at 90% identity in ∼1 h and at 75% identity in ∼1 day on a 1 GHz Linux PC (Li et al. , Bioinformatics, 17, 282, 2001); however even faster clustering speed is needed because the size of protein databases are rapidly growing and many applications desire a lower attainable thresholds. Results: For our previous algorithm (CD-HI), we have employed short-word filters to speed up the clustering. In this paper, we show that tolerating some redundancy makes for more efficient use of these short-word filters and increases the program’s speed 100 times. Our new program implements this technique and clusters NR at 70% identity within 2 h, and at 50% identity in ∼5 days. Although some redundancy is present after clustering, our new program’s results only differ from our previous program’s by less than 0.4%. Availability: The program and its previous version are available at http://bioinformatics.burnham-inst.org/cd-hi Contact: liwz@burnham-inst.org; adam@burnham-inst.org * To whom correspondence should be addressed.

Keywords

Cluster analysisRedundancy (engineering)Computer scienceIdentity (music)DatabaseSequence (biology)Sequence databaseData miningArtificial intelligenceBiologyGeneticsOperating system

Affiliated Institutions

Related Publications

Publication Info

Year
2002
Type
article
Volume
18
Issue
1
Pages
77-82
Citations
505
Access
Closed

External Links

Social Impact

Social media, news, blog, policy document mentions

Citation Metrics

505
OpenAlex

Cite This

Weizhong Li, Lukasz Jaroszewski, Adam Godzik (2002). Tolerating some redundancy significantly speeds up clustering of large protein databases. Bioinformatics , 18 (1) , 77-82. https://doi.org/10.1093/bioinformatics/18.1.77

Identifiers

DOI
10.1093/bioinformatics/18.1.77