Abstract
Protein-protein interactions (PPIs) are essential to most fundamental cellular processes. There has been increasing interest in reconstructing PPIs networks. However, several critical difficulties exist in obtaining reliable predictions. Noticeably, false positive rates can be as high as >80%. Error correction from each generating source can be both time-consuming and inefficient due to the difficulty of covering the errors from multiple levels of data processing procedures within a single test. We propose a novel Bayesian integration method, deemed nonparametric Bayes ensemble learning (NBEL), to lower the misclassification rate (both false positives and negatives) through automatically up-weighting data sources that are most informative, while down-weighting less informative and biased sources. Extensive studies indicate that NBEL is significantly more robust than the classic naïve Bayes to unreliable, error-prone and contaminated data. On a large human data set our NBEL approach predicts many more PPIs than naïve Bayes. This suggests that previous studies may have large numbers of not only false positives but also false negatives. The validation on two human PPIs datasets having high quality supports our observations. Our experiments demonstrate that it is feasible to predict high-throughput PPIs computationally with substantially reduced false positives and false negatives. The ability of predicting large numbers of PPIs both reliably and automatically may inspire people to use computational approaches to correct data errors in general, and may speed up PPIs prediction with high quality. Such a reliable prediction may provide a solid platform to other studies such as protein functions prediction and roles of PPIs in disease susceptibility.
Keywords
MeSH Terms
Affiliated Institutions
Related Publications
The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation
Abstract Background To evaluate binary classifications and their confusion matrices, scientific researchers can employ several statistical rates, accordingly to the goal of the ...
Enrichment of High-Throughput Screening Data with Increasing Levels of Noise Using Support Vector Machines, Recursive Partitioning, and Laplacian-Modified Naive Bayesian Classifiers
High-throughput screening (HTS) plays a pivotal role in lead discovery for the pharmaceutical industry. In tandem, cheminformatics approaches are employed to increase the probab...
A review of methods for the assessment of prediction errors in conservation presence/absence models
Predicting the distribution of endangered species from habitat data is frequently perceived to be a useful technique. Models that predict the presence or absence of a species ar...
ROC Curves for Classification Trees
A common problem in medical diagnosis is to combine information from several tests or patient characteristics into a decision rule to distinguish diseased from healthy patients....
Hidden Markov models for detecting remote protein homologies.
Abstract MOTIVATION: A new hidden Markov model method (SAM-T98) for finding remote homologs of protein sequences is described and evaluated. The method begins with a single targ...
Publication Info
- Year
- 2011
- Type
- article
- Volume
- 7
- Issue
- 7
- Pages
- e1002110-e1002110
- Citations
- 14
- Access
- Closed
External Links
Social Impact
Social media, news, blog, policy document mentions
Citation Metrics
Cite This
Identifiers
- DOI
- 10.1371/journal.pcbi.1002110
- PMID
- 21829334
- PMCID
- PMC3145649