Mathias Niepert, Spring 2005
A variety of machine learning algorithms address the problem of text classification. However, it remains difficult to assess the performance of different approaches because test corpora, term weighting and representation, and dimensionality reduction techniques are not standardized in the literature. A method that is very successful for one corpus might not be the best solution for another. The goal of TCEJ is the implementation of every step of a text categorization system in order to simplify experiments and future investigations in the area. The main purpose of the program is to provide a framework to which more classifiers (e.g. neural network classification, other statistical methods, case- and rule-based systems) can easily be added, and to give the user the opportunity to compare and evaluate different preprocessing techniques like stemming, term weighting, and dimensionality reduction. In research areas where quality is determined almost entirely based on empirical results, the standardization of every step from preprocessing to classification is essential. Furthermore, TCEJ can aide the exploration and analysis of corpora using term/document/category tables and graphical tools. For the same reasons, TCEJ will also be useful for educational purposes such as classroom demonstrations and experiments.
Already Implemented Steps
Corpus Generation
Word Stemming
Removal of Stop Words
Dimensionality Reduction
Naive Bayes Classification (Bernoulli)
k-nearest neighbors classification
Graphical User Interface
Evaluation (precision/recall) Interface
RCut, Thresholding
Future work
C4.5 Decision Tree Classification ?
Improve Batch Evaluation Interface
Resources
Feel free to use the source code, corpora, and all files that are listed below. However, if you use the program for research or the classroom, please do acknowledge the author.
Documentation (
pdf) [I apologize for the poor English; the paper was written just a few months after I had arrived in the US]
Complete Workspace (
tar.gz)
(13.3 MB, with test document scorings, both corpora, JavaDoc, and batch evaluation files)
to start the program, reserving 512 MB of memory, use the command:
(JVM 1.5 recommended but not required)
"java -Xmx512m TCEJ"
Source Code (
tar.gz)
(238 kB, including .class files)
Java Documentation (
html)
Corpora
Reuters-21578 Corpus [5] (
tar.gz)
Converted from SGML to XML with sx [6]
I had to remove some characters (decimal < 30) which would
have been rejected by the SAX2 XML parser. Besides that, no changes.
Self-made Training corpus (
tar.gz)
Categories: Sports {30 Training Documents}, Health {30}, Science {27}, Business {23},
Education {24}, Travel {6}, Movie {10}; overall 150 training documents with 700 words
per document. This corpus includes 50 test documents. All documents are New York Times articles.