Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements

2001 Nucleic Acids Research 1,381 citations

Abstract

PSI-BLAST is an iterative program to search a database for proteins with distant similarity to a query sequence. We investigated over a dozen modifications to the methods used in PSI-BLAST, with the goal of improving accuracy in finding true positive matches. To evaluate performance we used a set of 103 queries for which the true positives in yeast had been annotated by human experts, and a popular measure of retrieval accuracy (ROC) that can be normalized to take on values between 0 (worst) and 1 (best). The modifications we consider novel improve the ROC score from 0.758 +/- 0.005 to 0.895 +/- 0.003. This does not include the benefits from four modifications we included in the 'baseline' version, even though they were not implemented in PSI-BLAST version 2.0. The improvement in accuracy was confirmed on a small second test set. This test involved analyzing three protein families with curated lists of true positives from the non-redundant protein database. The modification that accounts for the majority of the improvement is the use, for each database sequence, of a position-specific scoring system tuned to that sequence's amino acid composition. The use of composition-based statistics is particularly beneficial for large-scale automated applications of PSI-BLAST.

Keywords

False positive paradoxSimilarity (geometry)Set (abstract data type)Sequence (biology)BiologyComputer scienceDatabaseTest setData miningInformation retrievalStatisticsArtificial intelligenceMathematicsGenetics

MeSH Terms

AlgorithmsAmino AcidsAnimalsComputational BiologyDatabasesFactualHumansInformation Storage and RetrievalProteinsReproducibility of ResultsSensitivity and SpecificitySequence AlignmentSoftware

Affiliated Institutions

Related Publications

Publication Info

Year
2001
Type
review
Volume
29
Issue
14
Pages
2994-3005
Citations
1381
Access
Closed

Social Impact

Social media, news, blog, policy document mentions

Citation Metrics

1381
OpenAlex
166
Influential
1138
CrossRef

Cite This

A. A. Schaffer (2001). Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements. Nucleic Acids Research , 29 (14) , 2994-3005. https://doi.org/10.1093/nar/29.14.2994

Identifiers

DOI
10.1093/nar/29.14.2994
PMID
11452024
PMCID
PMC55814

Data Quality

Data completeness: 86%