mind the gap

mat'log

TCEJ

An Experiment System for Text Classification implemented in Java


Experiment II

This experiment shows the importance of noise reduction when (1) weighting functions are used that are not robust to noisy data and (2) the aggressiveness of the dimensionality reduction is high.


Setting

I used a subset of the Reuters 21578 ModApté split. (All ModApté training documents out of the files reut2-000.xml, reut2-001.xml, reut2-002.xml, reut2-003.xml, and all ModApté test documents out of the file reut2-018.xml). This resulted in 2561 training documents. 2008 of them were labeled with at least one category. After stop word removal and stemming the number of unique terms was 23706. The 30 nearest neighbors classifier was used to classify the 338 test documents.


Results

More than 60% of the terms in the Reuters 21578 ModApté split have both document and term frequency smaller than 2. A portion of these terms is noise, that is, misspellings, left out blanks, random numbers etc. Some weighting functions are more robust to noise than others.

Information gain, for example, is rather robust since it takes the prior probability of the categories into account (more specifically the entropy). For any term, if the right summand of the formula for information gain is small for many categories (first condition of getting a high weight value for this term), the term gets a higher weight value if the categories have higher prior probabilities (i.e., contain more documents).



Figure 1 - Formulas for information gain and chi square.


Chi-square, on the other hand, assigns also high weight values to terms with document frequency 1 (could be noise) if they are, e.g., in a document which is the only document of a category (thus making P(tk|C) 1.0, P(not tk | C) 0.0, P(not tk | not C) 1.0, P(tk | not C) 0.0), therefore reducing the number of meaningful terms in the feature space.
As an illustrative example, the following table shows ranks 37-63 according to chi-sqare (whithout normalization) without noise reduction:




Noise Reduction

I used a noise reduction method which takes also the term frequency into account (rather than just the document frequency) because

1. it is not very likely that the same misspelling, left out blank, etc. appear twice (x times) in the same document

2. we do not set all term weights to zero for documents that are the only document in a category.


Noise reduction can be applied very efficiently during the computation of the weights (remember: 60% of the words in the corpus have both DF and TF smaller than 2).

The following three graphs show the precision/recall plots for the weighting function chi-square
(1) without noise reduction
(2) when the weights of terms with both DF and TF smaller than 2 are set to zero, and
(3) when the weights of terms with both DF and TF smaller than 3 are set to zero.


chi square without noise reduction


Chi square; Reuter ModApte subset; feature space dimensionality: 500;
Classifier: 30 - nearest neighbors

Break Even Point : 0.3208377500620222 (tolerance: 0.013320665779562008)
11 Point Precision: 0.32586704646198483
Average Precision: 0.31628834971854053



chi square (weights of terms with both TF and DF < 2 are set to zero)


Chi square; Reuter ModApte subset; feature space dimensionality: 500;
weights of terms with DF and TF < 2 set to zero ("noise reduction")
Classifier: 30 - nearest neighbors

Break Even Point : 0.510619797342732 (tolerance: 0.014132677904792834)
11 Point Precision: 0.5598801714349142
Average Precision: 0.5634028693893797



chi square (weights of terms with both TF and DF < 3 are set to zero)


Chi square; Reuter ModApte subset; feature space dimensionality: 500;
weights of terms with DF and TF < 3 set to zero ("noise reduction")
Classifier: 30 - nearest neighbors

Break Even Point : 0.6641893717907771 (tolerance: 8.230351571136829E-4)
11 Point Precision: 0.6877919976669019
Average Precision: 0.7031995181941424
mind the gap