Abstract
Methods for discovery of local similarities and estimation of evolutionary distance by identifying k-mers (contiguous subsequences of length k) common to two sequences are described. Given unaligned sequences of length L, these methods have O(L) time complexity. The ability of compressed amino acid alphabets to extend these techniques to distantly related proteins was investigated. The performance of these algorithms was evaluated for different alphabets and choices of k using a test set of 1848 pairs of structurally alignable sequences selected from the FSSP database. Distance measures derived from k-mer counting were found to correlate well with percentage identity derived from sequence alignments. Compressed alphabets were seen to improve performance in local similarity discovery, but no evidence was found of improvements when applied to distance estimates. The performance of our local similarity discovery method was compared with the fast Fourier transform (FFT) used in MAFFT, which has O(L log L) time complexity. The method for achieving comparable coverage to FFT is revealed here, and is more than an order of magnitude faster. We suggest using k-mer distance for fast, approximate phylogenetic tree construction, and show that a speed improvement of more than three orders of magnitude can be achieved relative to standard distance methods, which require alignments.
Keywords
Affiliated Institutions
Related Publications
A Simple, Fast, and Accurate Algorithm to Estimate Large Phylogenies by Maximum Likelihood
The increase in the number of large data sets and the complexity of current probabilistic sequence evolution models necessitates fast and reliable phylogeny reconstruction metho...
IQPNNI: Moving Fast Through Tree Space and Stopping in Time
An efficient tree reconstruction method (IQPNNI) is introduced to reconstruct a phylogenetic tree based on DNA or amino acid sequence data. Our approach combines various fast al...
PANDIT: an evolution-centric database of protein and associated nucleotide domains with inferred trees
PANDIT is a database of homologous sequence alignments accompanied by estimates of their corresponding phylogenetic trees. It provides a valuable resource to those studying phyl...
MUSCLE: multiple sequence alignment with high accuracy and high throughput
We describe MUSCLE, a new computer program for creating multiple alignments of protein sequences. Elements of the algorithm include fast distance estimation using kmer counting,...
VSEARCH: a versatile open source tool for metagenomics
Background VSEARCH is an open source and free of charge multithreaded 64-bit tool for processing and preparing metagenomics, genomics and population genomics nucleotide sequence...
Publication Info
- Year
- 2004
- Type
- article
- Volume
- 32
- Issue
- 1
- Pages
- 380-385
- Citations
- 151
- Access
- Closed
External Links
Social Impact
Social media, news, blog, policy document mentions
Citation Metrics
Cite This
Identifiers
- DOI
- 10.1093/nar/gkh180