Abstract

Protein sequence database search programs may be evaluated both for their retrieval accuracy--the ability to separate meaningful from chance similarities--and for the accuracy of their statistical assessments of reported alignments. However, methods for improving statistical accuracy can degrade retrieval accuracy by discarding compositional evidence of sequence relatedness. This evidence may be preserved by combining essentially independent measures of alignment and compositional similarity into a unified measure of sequence similarity. A version of the BLAST protein database search program, modified to employ this new measure, outperforms the baseline program in both retrieval and statistical accuracy on ASTRAL, a SCOP-based test set.

Keywords

Similarity (geometry)Sequence (biology)Set (abstract data type)Computer scienceInformation retrievalSimilarity measureMeasure (data warehouse)DatabaseData miningBiologyArtificial intelligenceGenetics

MeSH Terms

Data InterpretationStatisticalDatabasesProteinReproducibility of ResultsSequence AlignmentSequence AnalysisProteinSoftware

Affiliated Institutions

Related Publications

Publication Info

Year
2006
Type
article
Volume
34
Issue
20
Pages
5966-5973
Citations
55
Access
Closed

Social Impact

Social media, news, blog, policy document mentions

Citation Metrics

55
OpenAlex
4
Influential
49
CrossRef

Cite This

Yi‐Kuo Yu, E. Michael Gertz, Richa Agarwala et al. (2006). Retrieval accuracy, statistical significance and compositional similarity in protein sequence database searches. Nucleic Acids Research , 34 (20) , 5966-5973. https://doi.org/10.1093/nar/gkl731

Identifiers

DOI
10.1093/nar/gkl731
PMID
17068079
PMCID
PMC1635310

Data Quality

Data completeness: 86%