An Experiment System for Text Classification implemented in Java
Experiment I
Setting
I used a subset of the Reuters 21578 ModApté split. (All ModApté training
documents out of the files reut2-000.xml, reut2-001.xml, reut2-002.xml,
reut2-003.xml, and all ModApté test documents out of the file reut2-018.xml).
This resulted in 2561 training documents. 2008 of them were labeled with
at least one category. After stop word removal and stemming the number of
unique terms was 23706. The 30 nearest neighbors classifier was used to
classify the 338 test documents.
Results
The first three graphs illustrate the precision/recall plots for the three weighting
functions chi-square, information-gain, and MSF (for details see my final write-up).
Information gain has the best break even point and eleven point precision. This shows that
information gain is especially suitable for dimensionality reduction with high aggressiveness, something that has also been observed by Yang and Pedersen in [1]. MSF has a slightly better break even point and 11 point
precision than chi-quare. It seems that MFS is another possibility for a weighting function. However,
the performance of MSF has yet to be evaluated with the complete Reuters ModApte split and also
with other corpora.
Graphs:
chi square
Chi square; Reuter ModApté subset; feature space dimensionality: 1000;
all terms with both, DF and TF < 2 ignored (noise reduction);
Classifier: 30 - nearest neighbors
Break Even Point : 0.6625473658476959 (tolerance: 8.189707859674877E-4)
11 Point Precision: 0.6911469177724453
Average Precision: 0.7082521675463176
information gain
information gain; Reuter ModApté subset; feature space dimensionality: 1000;
Classifier: 30 - nearest neighbors
Break Even Point : 0.6964075129597327 (tolerance: 8.629585042871923E-4)
11 Point Precision: 0.7274910276202387
Average Precision: 0.7536151523480823
MSF
MSF; Reuter ModApté subset; feature space dimensionality: 1000;
all terms with DF and TF < 2 ignored (noise reduction);
Classifier: 30 - nearest neighbors
Break Even Point : 0.6782702896224785 (tolerance: 0.0025277153650526962)
11 Point Precision: 0.7034398646793802
Average Precision: 0.728789300551402
References
[1] A Comparative Study on Feature Selection in Text Categorization (1997), Yiming Yang, Jan O. Pedersen, Proceedings of ICML-97, 14th International Conference on Machine Learning