Local homology recognition and distance measures in linear time using compressed amino acid alphabets

Abstract

Methods for discovery of local similarities and estimation of evolutionary distance by identifying k-mers (contiguous subsequences of length k) common to two sequences are described. Given unaligned sequences of length L, these methods have O(L) time complexity. The ability of compressed amino acid alphabets to extend these techniques to distantly related proteins was investigated. The performance of these algorithms was evaluated for different alphabets and choices of k using a test set of 1848 pairs of structurally alignable sequences selected from the FSSP database. Distance measures derived from k-mer counting were found to correlate well with percentage identity derived from sequence alignments. Compressed alphabets were seen to improve performance in local similarity discovery, but no evidence was found of improvements when applied to distance estimates. The performance of our local similarity discovery method was compared with the fast Fourier transform (FFT) used in MAFFT, which has O(L log L) time complexity. The method for achieving comparable coverage to FFT is revealed here, and is more than an order of magnitude faster. We suggest using k-mer distance for fast, approximate phylogenetic tree construction, and show that a speed improvement of more than three orders of magnitude can be achieved relative to standard distance methods, which require alignments.

Keywords

Fast Fourier transformPhylogenetic treeEdit distanceSimilarity (geometry)BiologyTime complexityPattern recognition (psychology)Sequence (biology)MathematicsSmith–Waterman algorithmTree (set theory)AlgorithmComputer scienceSequence alignmentCombinatoricsArtificial intelligenceGeneticsPeptide sequence

Affiliated Institutions

Mill Valley Public Library US

Related Publications

A Simple, Fast, and Accurate Algorithm to Estimate Large Phylogenies by Maximum Likelihood

Stéphane Guindon , Olivier Gascuel

The increase in the number of large data sets and the complexity of current probabilistic sequence evolution models necessitates fast and reliable phylogeny reconstruction metho...

2003 Systematic Biology 16737 citations

IQPNNI: Moving Fast Through Tree Space and Stopping in Time

Lê Sỹ Vinh

An efficient tree reconstruction method (IQPNNI) is introduced to reconstruct a phylogenetic tree based on DNA or amino acid sequence data. Our approach combines various fast al...

2004 Molecular Biology and Evolution 169 citations

PANDIT: an evolution-centric database of protein and associated nucleotide domains with inferred trees

Simon Whelan

PANDIT is a database of homologous sequence alignments accompanied by estimates of their corresponding phylogenetic trees. It provides a valuable resource to those studying phyl...

2005 Nucleic Acids Research 70 citations

MUSCLE: multiple sequence alignment with high accuracy and high throughput

R. C. Edgar

We describe MUSCLE, a new computer program for creating multiple alignments of protein sequences. Elements of the algorithm include fast distance estimation using kmer counting,...

2004 Nucleic Acids Research 44728 citations

VSEARCH: a versatile open source tool for metagenomics

Torbjørn Rognes , Tomáš Flouri , Ben Nichols +2 more

Background VSEARCH is an open source and free of charge multithreaded 64-bit tool for processing and preparing metagenomics, genomics and population genomics nucleotide sequence...

2016 PeerJ 10017 citations

Publication Info

Year: 2004
Type: article
Volume: 32
Issue: 1
Pages: 380-385
Citations: 151
Access: Closed

External Links

View on DOI.org

Social Impact

Altmetric

Local homology recognition and distance measures in linear time using compressed amino acid alphabets

PlumX Metrics

Social media, news, blog, policy document mentions

Citation Metrics

151

OpenAlex

Cite This

APA Style

                            
                                    R. C. Edgar
                                
                            (2004). 
                            Local homology recognition and distance measures in linear time using compressed amino acid alphabets. 
                            Nucleic Acids Research
                            , 32
                            (1)
                            , 380-385.
                            https://doi.org/10.1093/nar/gkh180

Identifiers

DOI: 10.1093/nar/gkh180