Abstract

A probabilistic analysis of the Rocchio relevance feedback algorithm, one of the most popular learning methods from information retrieval, is presented in a text categorization framework. The analysis results in a probabilistic version of the Rocchio classifier and offers an explanation for the TFIDF word weighting heuristic. The Rocchio classifier, its probabilistic variant and a standard naive Bayes classifier are compared on three text categorization tasks. The results suggest that the probabilistic algorithms are preferable to the heuristic Rocchio classifier. This research is sponsored by the Wright Laboratory, Aeronautical Systems Center, Air Force Materiel Command, USAF, and the Advanced Research Projects Agency (ARPA) under grant F33615-93-1-1330. The US Government is authorized to reproduce and distribute reprints for Government purposes, notwithstanding any copyright notation thereon. Views and conclusions contained in this document are those of the authors and should not be ...

Keywords

Computer scienceProbabilistic logictf–idfArtificial intelligenceCategorizationClassifier (UML)Naive Bayes classifierText categorizationWeightingRelevance feedbackPattern recognition (psychology)Statistical classificationMachine learningNatural language processingMedicineSupport vector machine

Related Publications

Publication Info

Year
1997
Type
article
Pages
143-151
Citations
1265
Access
Closed

External Links

Citation Metrics

1265
OpenAlex

Cite This

Thorsten Joachims (1997). A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization. , 143-151.