Abstract
A probabilistic analysis of the Rocchio relevance feedback algorithm, one of the most popular learning methods from information retrieval, is presented in a text categorization framework. The analysis results in a probabilistic version of the Rocchio classifier and offers an explanation for the TFIDF word weighting heuristic. The Rocchio classifier, its probabilistic variant and a standard naive Bayes classifier are compared on three text categorization tasks. The results suggest that the probabilistic algorithms are preferable to the heuristic Rocchio classifier. This research is sponsored by the Wright Laboratory, Aeronautical Systems Center, Air Force Materiel Command, USAF, and the Advanced Research Projects Agency (ARPA) under grant F33615-93-1-1330. The US Government is authorized to reproduce and distribute reprints for Government purposes, notwithstanding any copyright notation thereon. Views and conclusions contained in this document are those of the authors and should not be ...
Keywords
Related Publications
Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy
Feature selection is an important problem for pattern classification systems. We study how to select good features according to the maximal statistical dependency criterion base...
Pattern classification and scene analysis
Provides a unified, comprehensive and up-to-date treatment of both statistical and descriptive methods for pattern recognition. The topics treated include Bayesian decision theo...
Applied Latent Class Analysis
Applied Latent Class Analysis introduces several innovations in latent class analysis to a wider audience of researchers. Many of the world's leading innovators in the field of ...
Statistical pattern recognition: a review
The primary goal of pattern recognition is supervised or unsupervised classification. Among the various frameworks in which pattern recognition has been traditionally formulated...
Context and Hierarchy in a Probabilistic Image Model
It is widely conjectured that the excellent ROC performance of biological vision systems is due in large part to the exploitation of context at each of many levels in a part/who...
Publication Info
- Year
- 1997
- Type
- article
- Pages
- 143-151
- Citations
- 1265
- Access
- Closed