Abstract

We describe the results of extensive experiments using optimized rule-based induction methods on large document collections. The goal of these methods is to discover automatically classification patterns that can be used for general document categorization or personalized filtering of free text. Previous reports indicate that human-engineered rule-based systems, requiring many man-years of developmental efforts, have been successfully built to “read” documents and assign topics to them. We show that machine-generated decision rules appear comparable to human performance, while using the identical rule-based representation. In comparison with other machine-learning techniques, results on a key benchmark from the Reuters collection show a large gain in performance, from a previously reported 67% recall/precision breakeven point to 80.5%. In the context of a very high-dimensional feature space, several methodological alternatives are examined, including universal versus local dictionaries, and binary versus frequency-related features.

Keywords

Computer scienceArtificial intelligenceCategorizationMachine learningBenchmark (surveying)Key (lock)Representation (politics)Precision and recallContext (archaeology)Text categorizationRule inductionFeature (linguistics)Point (geometry)Binary classificationInformation retrievalNatural language processingData miningSupport vector machine

Affiliated Institutions

Related Publications

The multiscale classifier

Proposes a rule-based inductive learning algorithm called multiscale classification (MSC). It can be applied to any N-dimensional real or binary classification problem to classi...

1996 IEEE Transactions on Pattern Analysis... 50 citations

Publication Info

Year
1994
Type
article
Volume
12
Issue
3
Pages
233-251
Citations
860
Access
Closed

External Links

Social Impact

Social media, news, blog, policy document mentions

Citation Metrics

860
OpenAlex

Cite This

Chidanand Apté, Fred J. Damerau, Sholom M. Weiss (1994). Automated learning of decision rules for text categorization. ACM Transactions on Information Systems , 12 (3) , 233-251. https://doi.org/10.1145/183422.183423

Identifiers

DOI
10.1145/183422.183423