Abstract
Abstract Motivation: Multiple sequence alignment is a basic part of much biological research, including phylogeny estimation and protein structure and function prediction. Different alignments on the same set of unaligned sequences are often compared, sometimes in order to assess the accuracy of alignment methods or to infer a consensus alignment from a set of estimated alignments. Three of the standard techniques for comparing alignments, Developer, Modeler and Total Column (TC) scores can be derived through calculations of the set of homologies that the alignments share. However, the brute-force technique for calculating this set is quadratic in the input size. The remaining standard technique, Cline Shift Score, inherently requires quadratic time. Results: In this article, we prove that each of these scores can be computed in linear time, and we present FastSP, a linear-time algorithm for calculating these scores. Even on the largest alignments we explored (one with 50 000 sequences), FastSP completed <2 min and used at most 2 GB of the main memory. The best alternative is qscore, a method whose empirical running time is approximately the same as FastSP when given sufficient memory (at least 8 GB), but whose asymptotic running time has never been theoretically established. In addition, for comparisons of large alignments under lower memory conditions (at most 4 GB of main memory), qscore uses substantial memory (up to 10 GB for the datasets we studied), took more time and failed to analyze the largest datasets. Availability: The open-source software and executables are available online at http://www.cs.utexas.edu/~phylo/software/fastsp/. Contact: tandy@cs.utexas.edu
Keywords
Affiliated Institutions
Related Publications
Aligning two sequences within a specified diagonal band
We describe an algorithm for aligning two sequences within a diagonal band that requires only O(NW) computation time and O(N) space, where N is the length of the shorter of the ...
FastTree: Computing Large Minimum Evolution Trees with Profiles instead of a Distance Matrix
Gene families are growing rapidly, but standard methods for inferring phylogenies do not scale to alignments with over 10,000 sequences. We present FastTree, a method for constr...
FAMSA: Fast and accurate multiple sequence alignment of huge protein families
Abstract Rapid development of modern sequencing platforms has contributed to the unprecedented growth of protein families databases. The abundance of sets containing hundreds of...
MMseqs software suite for fast and deep clustering and searching of large protein sequence sets
Abstract Motivation: Sequence databases are growing fast, challenging existing analysis pipelines. Reducing the redundancy of sequence databases by similarity clustering improve...
DIALIGN 2: improvement of the segment-to-segment approach to multiple sequence alignment.
Abstract MOTIVATION: The performance and time complexity of an improved version of the segment-to-segment approach to multiple sequence alignment is discussed. In this approach,...
Publication Info
- Year
- 2011
- Type
- article
- Volume
- 27
- Issue
- 23
- Pages
- 3250-3258
- Citations
- 63
- Access
- Closed
External Links
Social Impact
Social media, news, blog, policy document mentions
Citation Metrics
Cite This
Identifiers
- DOI
- 10.1093/bioinformatics/btr553