Abstract
We describe the results of extensive experiments using optimized rule-based induction methods on large document collections. The goal of these methods is to discover automatically classification patterns that can be used for general document categorization or personalized filtering of free text. Previous reports indicate that human-engineered rule-based systems, requiring many man-years of developmental efforts, have been successfully built to “read” documents and assign topics to them. We show that machine-generated decision rules appear comparable to human performance, while using the identical rule-based representation. In comparison with other machine-learning techniques, results on a key benchmark from the Reuters collection show a large gain in performance, from a previously reported 67% recall/precision breakeven point to 80.5%. In the context of a very high-dimensional feature space, several methodological alternatives are examined, including universal versus local dictionaries, and binary versus frequency-related features.
Keywords
Affiliated Institutions
Related Publications
A Comparative Study on Feature Selection in Text Categorization
This paper is a comparative study of feature selection methods in statistical learning of text categorization. The focus is on aggressive dimensionality reduction. Five methods ...
A learner-independent evaluation of the usefulness of statistical phrases for automated text categorization
In this work we investigate the usefulness of n-grams for document indexing in text categorization (TCi We call-gram a set g k of n word stems, and we say that g k occurs in a d...
The multiscale classifier
Proposes a rule-based inductive learning algorithm called multiscale classification (MSC). It can be applied to any N-dimensional real or binary classification problem to classi...
Machine learning in automated text categorization
The automated categorization (or classification) of texts into predefined categories has witnessed a booming interest in the last 10 years, due to the increased availability of ...
Enhanced hypertext categorization using hyperlinks
A major challenge in indexing unstructured hypertext databases is to automatically extract meta-data that enables structured search using topic taxonomies, circumvents keyword a...
Publication Info
- Year
- 1994
- Type
- article
- Volume
- 12
- Issue
- 3
- Pages
- 233-251
- Citations
- 860
- Access
- Closed
External Links
Social Impact
Social media, news, blog, policy document mentions
Citation Metrics
Cite This
Identifiers
- DOI
- 10.1145/183422.183423