Abstract
Categorization of documents is challenging, as the number of discriminating words can be very large.We present a nearest neighbor classification scheme for text categorization in which the importance of discriminating words is learned using mutual information and weight adjustment techniques.The nearest neighbors for a particular document are then computed based on the matching words and their weights.We evaluate our scheme on both synthetic and real world documents.Our experiments with synthetic data sets show that this scheme is robust under different emulated conditions.Empirical results on real world documents demonstrate that this scheme outperforms state of the art classification algorithms such as C4.5, RIPPER, Rainbow, and PEBLS.
Keywords
Affiliated Institutions
Related Publications
Multi-way distributional clustering via pairwise interactions
We present a novel unsupervised learning scheme that simultaneously clusters variables of several types (e.g., documents, words and authors) based on pairwise interactions betwe...
Publication Info
- Year
- 1999
- Type
- report
- Citations
- 82
- Access
- Closed
External Links
Social Impact
Social media, news, blog, policy document mentions
Citation Metrics
Cite This
Identifiers
- DOI
- 10.21236/ada439688