tc
Class TCExperiment

java.lang.Object
  extended bytc.TCExperiment

public class TCExperiment
extends java.lang.Object

This class represents an experiment with one corpus. it holds the properties of the experiment (weighting function, stemming, stopword removal, dimensionality, classifier properties etc.) and provides methods to load and test the corpus, fill tables, create graphs for the GUI, ...


Field Summary
(package private)  java.util.ArrayList batchEvaluationResults
          holds all the batch evaluations made so far
(package private)  TCCorpusLoader corpusLoader
          the loader for the corpus
(package private)  java.lang.String corpusName
          holds the name of the corpus
(package private)  TCCorpusDistributionsData distributionsDataCorpus
          contains the mean and standard deviation of some gaussian distributions
(package private)  TCCorpusDistributionsData distributionsDataForTermsOfInterest
          contains the mean and standard deviation of some gaussian distributions
(package private)  int featureSpaceDimension
          the current feature space dimension (dimension of the feature term vector)
(package private)  int kNN_k
          number of nearest neighbors used for the knn algorithm
(package private)  double kNNThreshold
          the threshold for the kNN classifier
(package private)  double naiveBayesThreshold
          the threshold for the naive bayes classifier
(package private)  javax.swing.JFrame parentFrame
          the parent frame (used for progress monitoring etc...)
(package private)  java.util.ArrayList[] precisionRecallResults
          holds all the (recall, precision) values for the precision and recall graph
(package private)  int stemmerMethod
          indicates which stemming method should be used
(package private)  java.lang.String stopWordsString
          the string containing all the words considered as stop words
(package private)  TCCorpus testCorpus
          the actual corpus used for this experiment
(package private)  boolean useStopWordRemoval
          indicates if stop words should be removed
(package private)  int weightFunctionIndex
          the index of the weighting function currently used
(package private)  int weightFunctionSumIndex
          the method of sumation of the weighting function currently in use
 
Constructor Summary
TCExperiment()
          constructor.
 
Method Summary
 java.lang.String classifyDocumentAndProvideReport(java.lang.String path, int classifierIndex)
          classifies a single document and provides a textual report
 TCAccuracyResult computeClassifierAccuracyKNN(double stepSize)
          computes the accuracy (break even, 11pt precision, precision-recall values, etc.) for the kNN classifier.
 TCAccuracyResult computeClassifierAccuracyNBB(double stepSize)
          computes the accuracy (break even, precision-recall values) for the naive bayes classifier.
 boolean computeWeightValues(int minimumDFTFThreshold, javax.swing.ProgressMonitor progressMonitor)
          computes the weight values for the experiment corpus
 Graph2D createEvaluationGraph(int evaluationNumber)
          creates the batch evaluation graph where different setting (e.g.
 Graph2D createGDGraph(int distributionIndex)
          creates the gaussian distributions graph for later use
 Graph2D createPrecisionRecallGraph(int classifierIndex)
          creates the precision<->recall graph based on an ArrayList with evaluation results for many different thresholds
 TCEvaluationResult evaluateClassification(int classifierIndex)
          Evaluates the kNN or NBB classificator with the current settings and calculates recall and precision.
 void evaluateClassificationBatch()
          Evaluates (precision and/or recall) one or more classifiers for different types of setting (weighting function, properties of the classificatiob algorithm, number of unique terms, etc.) and fills a TCBatchResult object which can be saved to disk or used for plotting the curves...
 TCEvaluationResult evaluateKNNClassificationForThreshold(double threshold, boolean display)
          Evaluates the kNN classificator with the current settings (and one single threshold) and calculates recall and precision.
 TCEvaluationResult evaluateNBBClassificationForThreshold(double threshold, boolean display)
          Evaluates the naive bayes bernoulli classificator with the current settings (and one single threshold) and calculates recall and precision.
 void exportDocumentVectors(java.lang.String path, int documentType)
           
 TCTableData fillBatchEvaluationDataTable()
          creates the table with the batch evaluation data
 TCTableData fillCategoryDataTable()
          resembles the data of every category (number of document, etc.)
 TCTableData fillDocumentDataTable(int property)
          fills the table with properties of all the documents
 TCTableData fillTermDataTable()
          shows the properties of all the terms in the terms of interest arraylist (which are the terms after the dimensionality reduction step)
 int getNumberOfCategories()
          returns the number of categories in the experiment corpus
 int getNumberOfDocuments()
          returns the number of documents in the experiment corpus
 int getNumberOfLabeledDocuments()
          returns the number of documents which were assigned to at least one category
 int getNumberOfProcessedWords()
          the number of processed words (the number of words which were sorted into the term vectors by the loadCorpus algorithms
 int getNumberOfTestDocuments()
          returns the number of test documents currently loaded
 int getNumberOfTestDocumentsWithLabel()
          returns the number of test documents with at least one label (which are essigned to at least one category)
 int getNumberOfUniqueTerms()
          returns the number of distinct (unique) terms in the corpus
 double getNumberOfWordsPerDocument()
          returns the average number of words per document in the corpus
 java.lang.String getStopWordsString()
           
 int getWeightFunctionIndex()
          returns the weight function index currently in use
 int getWeightFunctionSumMethodIndex()
          returns the index of the weight function sumation method
 void loadBatchEvaluationResult(java.lang.String path)
           
 void loadClassificatorScoring(java.lang.String path, int classifierIndex)
          loads a scoring file.
 void loadCorpus(boolean useSWR, int stemMethod)
          loads one of the implemented corpora.
 void loadStopWordsString()
          loads the file with all the stop words and saves it in a String for later use
 double reduceFeatureSpaceDimension(int fSD, int minimumDFTFFrequency)
          reduces the feature space and calculates the distributions and he tfidf values for all the documents in the corpus
 void saveBatchEvaluationResult(java.lang.String path, int resultIndex)
          saves a batch evaluation to the given path
 void saveClassificatorScoring(java.lang.String path, int classifierIndex)
          saves the classificator scorings for later use.
 void setCorpusName(java.lang.String name)
          sets the label of the corpus
 void setKNNk(int k)
          sets the k value for the kNN classifier
 void setKNNThreshold(double t)
          sets the threshold for the naive bayes classifier
 void setNaiveBayesThreshold(double t)
          sets the threshold for the naive bayes classifier
 void setParentJFrame(javax.swing.JFrame frame)
          sets the JFrame this class was created from (used for progress monitoring etc.)
 void setWeightFunctionIndex(int wfindex)
          sets the weighting function which should be used for the dimensionality reduction step
 void setWeightFunctionSumMethodIndex(int wfsmi)
          sets the method of sumation for the weighting function currently in use
 void trainKNNClassifier()
          calculates all the kNN similarity scores for every test documents loaded by the corpusLoader.
 void trainNaiveBayesClassifier()
          trains the naive bayes classifier (calculates the category probabilities for every document) results are saved in the TCCorpus class for later processing
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

corpusName

java.lang.String corpusName
holds the name of the corpus


stemmerMethod

int stemmerMethod
indicates which stemming method should be used


useStopWordRemoval

boolean useStopWordRemoval
indicates if stop words should be removed


weightFunctionIndex

int weightFunctionIndex
the index of the weighting function currently used


weightFunctionSumIndex

int weightFunctionSumIndex
the method of sumation of the weighting function currently in use


featureSpaceDimension

int featureSpaceDimension
the current feature space dimension (dimension of the feature term vector)


kNN_k

int kNN_k
number of nearest neighbors used for the knn algorithm


naiveBayesThreshold

double naiveBayesThreshold
the threshold for the naive bayes classifier


kNNThreshold

double kNNThreshold
the threshold for the kNN classifier


stopWordsString

java.lang.String stopWordsString
the string containing all the words considered as stop words


testCorpus

TCCorpus testCorpus
the actual corpus used for this experiment


corpusLoader

TCCorpusLoader corpusLoader
the loader for the corpus


distributionsDataForTermsOfInterest

TCCorpusDistributionsData distributionsDataForTermsOfInterest
contains the mean and standard deviation of some gaussian distributions


distributionsDataCorpus

TCCorpusDistributionsData distributionsDataCorpus
contains the mean and standard deviation of some gaussian distributions


precisionRecallResults

java.util.ArrayList[] precisionRecallResults
holds all the (recall, precision) values for the precision and recall graph


batchEvaluationResults

java.util.ArrayList batchEvaluationResults
holds all the batch evaluations made so far


parentFrame

javax.swing.JFrame parentFrame
the parent frame (used for progress monitoring etc...)

Constructor Detail

TCExperiment

public TCExperiment()
constructor. set all the values to standard and initialize the classes

Method Detail

setCorpusName

public void setCorpusName(java.lang.String name)
sets the label of the corpus

Parameters:
name - the label of the corpus

setParentJFrame

public void setParentJFrame(javax.swing.JFrame frame)
sets the JFrame this class was created from (used for progress monitoring etc.)

Parameters:
frame - the JFrame object

setKNNk

public void setKNNk(int k)
sets the k value for the kNN classifier

Parameters:
k - number of nearest neighbors considered

setNaiveBayesThreshold

public void setNaiveBayesThreshold(double t)
sets the threshold for the naive bayes classifier

Parameters:
t -

setKNNThreshold

public void setKNNThreshold(double t)
sets the threshold for the naive bayes classifier

Parameters:
t -

setWeightFunctionIndex

public void setWeightFunctionIndex(int wfindex)
sets the weighting function which should be used for the dimensionality reduction step

Parameters:
wfindex - the index of the weighting function

setWeightFunctionSumMethodIndex

public void setWeightFunctionSumMethodIndex(int wfsmi)
sets the method of sumation for the weighting function currently in use

Parameters:
wfsmi - the index of the sumation method

getWeightFunctionIndex

public int getWeightFunctionIndex()
returns the weight function index currently in use

Returns:
the index of the weighting function currently in use

getWeightFunctionSumMethodIndex

public int getWeightFunctionSumMethodIndex()
returns the index of the weight function sumation method

Returns:
index of the sumation method

getNumberOfDocuments

public int getNumberOfDocuments()
returns the number of documents in the experiment corpus

Returns:
number of documents in the loaded corpus

getNumberOfLabeledDocuments

public int getNumberOfLabeledDocuments()
returns the number of documents which were assigned to at least one category

Returns:
the number of labeled documents in the corpus

getNumberOfCategories

public int getNumberOfCategories()
returns the number of categories in the experiment corpus

Returns:
the number of categories in the currently loaded corpus

getNumberOfUniqueTerms

public int getNumberOfUniqueTerms()
returns the number of distinct (unique) terms in the corpus

Returns:
the number of unique terms

getNumberOfProcessedWords

public int getNumberOfProcessedWords()
the number of processed words (the number of words which were sorted into the term vectors by the loadCorpus algorithms

Returns:
the number of processed words

getNumberOfWordsPerDocument

public double getNumberOfWordsPerDocument()
returns the average number of words per document in the corpus

Returns:
average number of words per documents

getNumberOfTestDocuments

public int getNumberOfTestDocuments()
returns the number of test documents currently loaded

Returns:
the number of test documents loaded

getNumberOfTestDocumentsWithLabel

public int getNumberOfTestDocumentsWithLabel()
returns the number of test documents with at least one label (which are essigned to at least one category)

Returns:
the number of test documents loaded

loadStopWordsString

public void loadStopWordsString()
loads the file with all the stop words and saves it in a String for later use


getStopWordsString

public java.lang.String getStopWordsString()

loadCorpus

public void loadCorpus(boolean useSWR,
                       int stemMethod)
loads one of the implemented corpora. To use a different corpus it is necessary to write a load-function for this corpus

Parameters:
useSWR - indicates if the stopword removal should be applied

computeWeightValues

public boolean computeWeightValues(int minimumDFTFThreshold,
                                   javax.swing.ProgressMonitor progressMonitor)
computes the weight values for the experiment corpus

Parameters:
progressMonitor - show progress with a progressMonitor
Returns:
indicates if the algorithm could finish without being interrupted

reduceFeatureSpaceDimension

public double reduceFeatureSpaceDimension(int fSD,
                                          int minimumDFTFFrequency)
reduces the feature space and calculates the distributions and he tfidf values for all the documents in the corpus

Parameters:
fSD - the dimensionality the feature space should have after the reduction step

classifyDocumentAndProvideReport

public java.lang.String classifyDocumentAndProvideReport(java.lang.String path,
                                                         int classifierIndex)
classifies a single document and provides a textual report

Parameters:
path - the path to the document to classify
Returns:
the textual report to pass to the console

trainKNNClassifier

public void trainKNNClassifier()
calculates all the kNN similarity scores for every test documents loaded by the corpusLoader. Takes time!!


trainNaiveBayesClassifier

public void trainNaiveBayesClassifier()
trains the naive bayes classifier (calculates the category probabilities for every document) results are saved in the TCCorpus class for later processing


evaluateClassification

public TCEvaluationResult evaluateClassification(int classifierIndex)
Evaluates the kNN or NBB classificator with the current settings and calculates recall and precision. The evaluation considers the current settings and tests all test document once for one particular setting (using a single threshold given by the user)

Returns:
object which holds the evaluation information

computeClassifierAccuracyKNN

public TCAccuracyResult computeClassifierAccuracyKNN(double stepSize)
computes the accuracy (break even, 11pt precision, precision-recall values, etc.) for the kNN classifier.

Returns:

computeClassifierAccuracyNBB

public TCAccuracyResult computeClassifierAccuracyNBB(double stepSize)
computes the accuracy (break even, precision-recall values) for the naive bayes classifier.

Returns:

evaluateKNNClassificationForThreshold

public TCEvaluationResult evaluateKNNClassificationForThreshold(double threshold,
                                                                boolean display)
Evaluates the kNN classificator with the current settings (and one single threshold) and calculates recall and precision. Will be reduced to one routine for all classifiers.

Returns:
object which holds the evaluation information

evaluateNBBClassificationForThreshold

public TCEvaluationResult evaluateNBBClassificationForThreshold(double threshold,
                                                                boolean display)
Evaluates the naive bayes bernoulli classificator with the current settings (and one single threshold) and calculates recall and precision. Will be reduced to one routine for all classifiers.

Returns:
object which holds the evaluation information

evaluateClassificationBatch

public void evaluateClassificationBatch()
Evaluates (precision and/or recall) one or more classifiers for different types of setting (weighting function, properties of the classificatiob algorithm, number of unique terms, etc.) and fills a TCBatchResult object which can be saved to disk or used for plotting the curves... HAS NOT BEEN FINISHED YET


fillBatchEvaluationDataTable

public TCTableData fillBatchEvaluationDataTable()
creates the table with the batch evaluation data

Returns:

fillCategoryDataTable

public TCTableData fillCategoryDataTable()
resembles the data of every category (number of document, etc.)

Returns:

fillDocumentDataTable

public TCTableData fillDocumentDataTable(int property)
fills the table with properties of all the documents

Returns:
instance of TCTableData for further processing

fillTermDataTable

public TCTableData fillTermDataTable()
shows the properties of all the terms in the terms of interest arraylist (which are the terms after the dimensionality reduction step)

Returns:

createGDGraph

public Graph2D createGDGraph(int distributionIndex)
creates the gaussian distributions graph for later use

Returns:
the graph to display

createPrecisionRecallGraph

public Graph2D createPrecisionRecallGraph(int classifierIndex)
creates the precision<->recall graph based on an ArrayList with evaluation results for many different thresholds

Returns:

createEvaluationGraph

public Graph2D createEvaluationGraph(int evaluationNumber)
creates the batch evaluation graph where different setting (e.g. different weighting functions for different dimensionalities) are plottet

Parameters:
evaluationNumber - indicates the index in the batchevaluation arraylist currently only one batch evaluation can be saved and loaded (index = 0)
Returns:

saveBatchEvaluationResult

public void saveBatchEvaluationResult(java.lang.String path,
                                      int resultIndex)
                               throws java.io.IOException
saves a batch evaluation to the given path

Parameters:
path - the path of the file to save the batch evaluation result in
resultIndex -
Throws:
java.io.IOException

loadBatchEvaluationResult

public void loadBatchEvaluationResult(java.lang.String path)
                               throws java.io.IOException,
                                      java.lang.ClassNotFoundException
Throws:
java.io.IOException
java.lang.ClassNotFoundException

saveClassificatorScoring

public void saveClassificatorScoring(java.lang.String path,
                                     int classifierIndex)
                              throws java.io.IOException
saves the classificator scorings for later use. Saves alot of processing time...

Parameters:
path - the path of the file to save the scoring in
classifierIndex - the index of the classifier the scoring is saved for
Throws:
java.io.IOException

loadClassificatorScoring

public void loadClassificatorScoring(java.lang.String path,
                                     int classifierIndex)
                              throws java.io.IOException,
                                     java.lang.ClassNotFoundException
loads a scoring file.

Parameters:
path - the path of the file the scoring is loaded from
classifierIndex - the index of the classifer for which the scoring is loaded
Throws:
java.io.IOException
java.lang.ClassNotFoundException

exportDocumentVectors

public void exportDocumentVectors(java.lang.String path,
                                  int documentType)
                           throws java.io.IOException
Throws:
java.io.IOException