Abstract
This paper is a comparative study of feature selection methods in statistical learning of text categorization. The focus is on aggressive dimensionality reduction. Five methods were evaluated, including term selection based on document frequency (DF), information gain (IG), mutual information (MI), a Ø 2 -test (CHI), and term strength (TS). We found IG and CHI most effective in our experiments. Using IG thresholding with a knearest neighbor classifier on the Reuters corpus, removal of up to 98% removal of unique terms actually yielded an improved classification accuracy (measured by average precision) . DF thresholding performed similarly. Indeed we found strong correlations between the DF, IG and CHI values of a term. This suggests that DF thresholding, the simplest method with the lowest cost in computation, can be reliably used instead of IG or CHI when the computation of these measures are too expensive. TS compares favorably with the other methods with up to 50% vocabulary redu...
Keywords
Related Publications
Feature selection for high-dimensional genomic microarray data
We report on the successful application of feature selection methods to a classification problem in molecular biology involving only 72 data points in a 7130 dimensional space. ...
Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy
Feature selection is an important problem for pattern classification systems. We study how to select good features according to the maximal statistical dependency criterion base...
Feature selection: evaluation, application, and small sample performance
A large number of algorithms have been proposed for feature subset selection. Our experimental results show that the sequential forward floating selection algorithm, proposed by...
Biomarker Identification by Feature Wrappers
Gene expression studies bridge the gap between DNA information and trait information by dissecting biochemical pathways into intermediate components between genotype and phenoty...
Input feature selection by mutual information based on Parzen window
Mutual information is a good indicator of relevance between variables, and have been used as a measure in several feature selection algorithms: However, calculating the mutual i...
Publication Info
- Year
- 1997
- Type
- article
- Pages
- 412-420
- Citations
- 4766
- Access
- Closed