An Experiment System for Text Classification implemented in Java
Experiment II
This experiment shows the importance of noise reduction when (1)
weighting functions are used that are not robust to noisy data
and (2) the aggressiveness of the dimensionality reduction is high.
Setting
I used a subset of the Reuters 21578 ModApté split. (All ModApté training documents out of the files reut2-000.xml, reut2-001.xml, reut2-002.xml,
reut2-003.xml, and all ModApté test documents out of the file reut2-018.xml).
This resulted in 2561 training documents. 2008 of them were labeled with
at least one category. After stop word removal and stemming the number of
unique terms was 23706. The 30 nearest neighbors classifier was used to
classify the 338 test documents.
Results
More than 60% of the terms in the Reuters 21578 ModApté split have
both document and term frequency smaller than 2. A portion of these
terms is noise, that is, misspellings, left out blanks, random numbers etc.
Some weighting functions are more robust to noise than others.
Information gain, for example, is rather robust since it takes
the prior probability of the categories into account (more specifically the
entropy). For any term, if the right summand of the formula for information
gain is small for many categories (first condition of getting a high weight value for
this term), the term gets a higher weight value if the categories have
higher prior probabilities (i.e., contain
more documents).

Figure 1 - Formulas for information gain and chi square.
Chi-square, on the other hand, assigns also high weight values to terms with
document frequency 1 (could be noise) if they are, e.g., in a document which
is the only document of a category (thus making P(tk|C) 1.0,
P(not tk | C) 0.0, P(not tk | not C) 1.0, P(tk | not C) 0.0), therefore reducing the number
of meaningful terms in the feature space.
As an illustrative example, the following table shows ranks 37-63 according
to chi-sqare (whithout normalization)
without noise reduction:
Noise Reduction
I used a noise reduction method which takes also the term frequency into
account (rather than just the document frequency) because
1. it is not very likely that the same misspelling, left out blank, etc.
appear twice (x times) in the same document
2. we do not set all term weights to zero for documents that are the only document in a category.
Noise reduction can be applied very efficiently during the computation of the weights (remember: 60% of the
words in the corpus have both DF and TF smaller than 2).
The following three graphs show the precision/recall plots for the weighting
function chi-square
(1) without noise reduction
(2) when the weights of terms
with both DF and TF smaller than 2 are set to zero, and
(3) when the weights
of terms with both DF and TF smaller than 3 are set to zero.
chi square without noise reduction
Chi square; Reuter ModApte subset; feature space dimensionality: 500;
Classifier: 30 - nearest neighbors
Break Even Point : 0.3208377500620222 (tolerance: 0.013320665779562008)
11 Point Precision: 0.32586704646198483
Average Precision: 0.31628834971854053
chi square (weights of terms with both TF and DF < 2 are set to zero)
Chi square; Reuter ModApte subset; feature space dimensionality: 500;
weights of terms with DF and TF < 2 set to zero ("noise reduction")
Classifier: 30 - nearest neighbors
Break Even Point : 0.510619797342732 (tolerance: 0.014132677904792834)
11 Point Precision: 0.5598801714349142
Average Precision: 0.5634028693893797
chi square (weights of terms with both TF and DF < 3 are set to zero)
Chi square; Reuter ModApte subset; feature space dimensionality: 500;
weights of terms with DF and TF < 3 set to zero ("noise reduction")
Classifier: 30 - nearest neighbors
Break Even Point : 0.6641893717907771 (tolerance: 8.230351571136829E-4)
11 Point Precision: 0.6877919976669019
Average Precision: 0.7031995181941424