Tolerating some redundancy significantly speeds up clustering of large protein databases

Abstract

Abstract Motivation: Sequence clustering replaces groups of similar sequences in a database with single representatives. Clustering large protein databases like the NCBI Non-Redundant database (NR) using even the best currently available clustering algorithms is very time-consuming and only practical at relatively high sequence identity thresholds. Our previous program, CD-HI, clustered NR at 90% identity in ∼1 h and at 75% identity in ∼1 day on a 1 GHz Linux PC (Li et al. , Bioinformatics, 17, 282, 2001); however even faster clustering speed is needed because the size of protein databases are rapidly growing and many applications desire a lower attainable thresholds. Results: For our previous algorithm (CD-HI), we have employed short-word filters to speed up the clustering. In this paper, we show that tolerating some redundancy makes for more efficient use of these short-word filters and increases the program’s speed 100 times. Our new program implements this technique and clusters NR at 70% identity within 2 h, and at 50% identity in ∼5 days. Although some redundancy is present after clustering, our new program’s results only differ from our previous program’s by less than 0.4%. Availability: The program and its previous version are available at http://bioinformatics.burnham-inst.org/cd-hi Contact: liwz@burnham-inst.org; adam@burnham-inst.org * To whom correspondence should be addressed.

Keywords

Cluster analysisRedundancy (engineering)Computer scienceIdentity (music)DatabaseSequence (biology)Sequence databaseData miningArtificial intelligenceBiologyGeneticsOperating system

Affiliated Institutions

Sanford Burnham Prebys Medical Discovery Institute US

Related Publications

Search and clustering orders of magnitude faster than BLAST

R. C. Edgar

Abstract Motivation: Biological sequence data is accumulating rapidly, motivating the development of improved high-throughput methods for sequence classification. Results: UBLAS...

2010 Bioinformatics 20899 citations

Gapped BLAST and PSI-BLAST: a new generation of protein database search programs

Stephen F. Altschul

The BLAST programs are widely used tools for searching protein and DNA databases for sequence similarities. For protein comparisons, a variety of definitional, algorithmic and s...

1997 Nucleic Acids Research 73388 citations

The SWISS-MODEL workspace: a web-based environment for protein structure homology modelling

Konstantin Arnold , Lorenza Bordoli , Jürgen Kopp +1 more

Abstract Motivation: Homology models of proteins are of great interest for planning and analysing biological experiments when no experimental three-dimensional structures are av...

2005 Bioinformatics 7038 citations

Automated generation of heuristics for biological sequence comparison

Guy Slater , Ewan Birney

Abstract Background Exhaustive methods of sequence alignment are accurate but slow, whereas heuristic approaches run quickly, but their complexity makes them more difficult to i...

2005 BMC Bioinformatics 2984 citations

KEGG: Kyoto Encyclopedia of Genes and Genomes

Minoru Kanehisa

KEGG (Kyoto Encyclopedia of Genes and Genomes) is a knowledge base for systematic analysis of gene functions, linking genomic information with higher order functional informatio...

2000 Nucleic Acids Research 36415 citations

Publication Info

Year: 2002
Type: article
Volume: 18
Issue: 1
Pages: 77-82
Citations: 505
Access: Closed

External Links

View on DOI.org

Social Impact

Altmetric

Tolerating some redundancy significantly speeds up clustering of large protein databases

PlumX Metrics

Social media, news, blog, policy document mentions

Citation Metrics

505

OpenAlex

Cite This

APA Style

                            
                                    Weizhong Li, 
                                
                                    Lukasz Jaroszewski, 
                                
                                    Adam Godzik
                                
                            (2002). 
                            Tolerating some redundancy significantly speeds up clustering of large protein databases. 
                            Bioinformatics
                            , 18
                            (1)
                            , 77-82.
                            https://doi.org/10.1093/bioinformatics/18.1.77

Identifiers

DOI: 10.1093/bioinformatics/18.1.77