mind the gap

mat'log

TCEJ

An Experiment System for Text Classification implemented in Java


Experiment I


Setting

I used a subset of the Reuters 21578 ModApté split. (All ModApté training documents out of the files reut2-000.xml, reut2-001.xml, reut2-002.xml, reut2-003.xml, and all ModApté test documents out of the file reut2-018.xml). This resulted in 2561 training documents. 2008 of them were labeled with at least one category. After stop word removal and stemming the number of unique terms was 23706. The 30 nearest neighbors classifier was used to classify the 338 test documents.


Results


The first three graphs illustrate the precision/recall plots for the three weighting functions chi-square, information-gain, and MSF (for details see my final write-up).





Information gain has the best break even point and eleven point precision. This shows that information gain is especially suitable for dimensionality reduction with high aggressiveness, something that has also been observed by Yang and Pedersen in [1]. MSF has a slightly better break even point and 11 point precision than chi-quare. It seems that MFS is another possibility for a weighting function. However, the performance of MSF has yet to be evaluated with the complete Reuters ModApte split and also with other corpora.


Graphs:

chi square


Chi square; Reuter ModApté subset; feature space dimensionality: 1000;
all terms with both, DF and TF < 2 ignored (noise reduction);
Classifier: 30 - nearest neighbors

Break Even Point : 0.6625473658476959 (tolerance: 8.189707859674877E-4)
11 Point Precision: 0.6911469177724453
Average Precision: 0.7082521675463176



information gain


information gain; Reuter ModApté subset; feature space dimensionality: 1000;
Classifier: 30 - nearest neighbors

Break Even Point : 0.6964075129597327 (tolerance: 8.629585042871923E-4)
11 Point Precision: 0.7274910276202387
Average Precision: 0.7536151523480823



MSF


MSF; Reuter ModApté subset; feature space dimensionality: 1000;
all terms with DF and TF < 2 ignored (noise reduction);
Classifier: 30 - nearest neighbors

Break Even Point : 0.6782702896224785 (tolerance: 0.0025277153650526962)
11 Point Precision: 0.7034398646793802
Average Precision: 0.728789300551402



References

[1] A Comparative Study on Feature Selection in Text Categorization (1997), Yiming Yang, Jan O. Pedersen, Proceedings of ICML-97, 14th International Conference on Machine Learning
mind the gap