mind the gap

mat'log

TCEJ

TCEJ -- An Experiment System for Text Classification implemented in Java


Mathias Niepert, Spring 2005


News

  • Due to popular demand this page has been revived. Enjoy!
  • I added one, two experiments conducted with the system. They include precision/recall plots and frequency tables.
  • Screenshot showing some of the analysis tools.


Abstract
A variety of machine learning algorithms address the problem of text classification. However, it remains difficult to assess the performance of different approaches because test corpora, term weighting and representation, and dimensionality reduction techniques are not standardized in the literature. A method that is very successful for one corpus might not be the best solution for another. The goal of TCEJ is the implementation of every step of a text categorization system in order to simplify experiments and future investigations in the area. The main purpose of the program is to provide a framework to which more classifiers (e.g. neural network classification, other statistical methods, case- and rule-based systems) can easily be added, and to give the user the opportunity to compare and evaluate different preprocessing techniques like stemming, term weighting, and dimensionality reduction. In research areas where quality is determined almost entirely based on empirical results, the standardization of every step from preprocessing to classification is essential. Furthermore, TCEJ can aide the exploration and analysis of corpora using term/document/category tables and graphical tools. For the same reasons, TCEJ will also be useful for educational purposes such as classroom demonstrations and experiments.


Already Implemented Steps

Corpus Generation
Word Stemming
Removal of Stop Words
Dimensionality Reduction
Naive Bayes Classification (Bernoulli)
k-nearest neighbors classification
Graphical User Interface
Evaluation (precision/recall) Interface
RCut, Thresholding


Future work

C4.5 Decision Tree Classification ?
Improve Batch Evaluation Interface



Resources


Feel free to use the source code, corpora, and all files that are listed below. However, if you use the program for research or the classroom, please do acknowledge the author.

Documentation (pdf) [I apologize for the poor English; the paper was written just a few months after I had arrived in the US]

Complete Workspace (tar.gz)
(13.3 MB, with test document scorings, both corpora, JavaDoc, and batch evaluation files)

to start the program, reserving 512 MB of memory, use the command:
(JVM 1.5 recommended but not required)

"java -Xmx512m TCEJ"

Source Code (tar.gz)
(238 kB, including .class files)

Java Documentation (html)

Corpora

Reuters-21578 Corpus [5] (tar.gz)
Converted from SGML to XML with sx [6]
I had to remove some characters (decimal < 30) which would have been rejected by the SAX2 XML parser. Besides that, no changes.

Self-made Training corpus (tar.gz)
Categories: Sports {30 Training Documents}, Health {30}, Science {27}, Business {23}, Education {24}, Travel {6}, Movie {10}; overall 150 training documents with 700 words per document. This corpus includes 50 test documents. All documents are New York Times articles.

References

[1] Text Categorization Using Weight Adjusted k-Nearest Neighbor Classification (1999), Eui-Hong (Sam) Han, George Karypis and Vipin Kumar, Lecture Notes in Computer Science

[2] A Comparison of Event Models for Naive Bayes Text Classification (1998) , Andrew McCallum, Kamal Nigamy

[3] Machine Learning in Automated Text Categorization (2002), Fabrizio Sebastiani, ACM Computing Surveys Volume 34,  Issue 1

[4] A Comparative Study on Feature Selection in Text Categorization (1997), Yiming Yang, Jan O. Pedersen, Proceedings of ICML-97, 14th International Conference on Machine Learning

[5] Reuters-21578 text categorization test collection, Distribution 1.0, 26 September 1997,
David D. Lewis, AT&T Labs - Research
The corpus in SGML format can be downloaded at http://www.daviddlewis.com/resources/testcollections/reuters21578/

[6] SX, SGML to XML converter, by James Clark, http://xml.coverpages.org/sx-doc.html

mind the gap