tc
Class TCCorpus

java.lang.Object
  extended bytc.TCCorpus

public class TCCorpus
extends java.lang.Object

Class representing the whole corpus of documents. Contains all the methods and members for the dimensionality reductions step (different weighting functions), implementation of k-nearest neighbor, Naive Bayes (bernoulli), etc.


Field Summary
(package private)  java.util.ArrayList allCategoryLabels
          contains all category labels of all the categories in the corpus
private  java.util.ArrayList allTerms
          all terms in this corpus (with corresponding number of occurence, term frequency, ...) see class TCTerm
private  java.util.ArrayList allTermsOfInterest
          contains all the terms after the application of a dimensionality reduction step.
(package private)  java.util.ArrayList kNNScoreList
          holds the {labels, score} pair for every document computed by the kNN algorithm
(package private)  java.util.ArrayList myCategories
          All categories in this corpus
(package private)  java.util.ArrayList myDocuments
          all the documents belonging this corpus
(package private)  java.util.ArrayList naiveBayesScoreList
          according to the current dimensionality, this class holds the naive bayes (category-)scores for every document
private  int numberOfDocuments
          the number of documents in the corpus
private  int numberOfProcessedWords
          number of processed words, that is, all (not necessarily unique) words that appear in all the documents
private  int numberOfUniqueTerms
          the number of different terms in the corpus
(package private)  double numberOfWordsPerDocument
          average number of words per document
(package private)  int overallDocumentFrequency
          the sum of all document frequencies of all terms in the entire corpus
(package private)  int overallTermFrequency
          the sum of all term frequencies of all terms in the entire corpus
(package private)  double termFrequencyDistributionOfAllTermsMean
          the mean of the distribution of term frequencies of all terms in the corpus
(package private)  double termFrequencyDistributionOfAllTermsOfInterestMean
          the mean of the ditribution of term frequencies of all terms of interest
(package private)  double termFrequencyDistributionOfAllTermsOfInterestStdDev
          the standard deviation of the ditribution of term frequencies of all terms of interest
(package private)  double termFrequencyDistributionOfAllTermsStdDev
          standard deviation of the distribution of term frequencies in the entire corpus
 
Constructor Summary
TCCorpus()
           
 
Method Summary
 void addDocument(TCDocument newDocument)
          adds a new document to the corpus.
 TCkNNResult applyKNNToDocumentFromFile(TCDocument documentToClassify, int method, int k)
          applies the kNN classifier to the given document (loaded from a file before!).
 void applyKNNToTestDocument(TCDocument documentToClassify, int method)
          applies the kNN classifier to the given document (out of the test documents!) calculates similarity to all documents in the training set and saves the scores in a list together with the labels the documents belonged to
 TCNaiveBayesResult applyNaiveBayesBernoulliToDocumentFromFile(TCDocument documentToClassify)
          Classifies one document out of the test documents with the naive bayes algorithm (bernoulli).
 void applyNaiveBayesBernoulliToTestDocument(TCDocument documentToClassify)
          Classifies one document out of the test documents with the naive bayes algorithm (bernoulli).
 double complexDimensionReduction(double Threshold)
          Fills the allTermsOfInterest term-vector with terms whose weight value is greater than the given threshold
 double complexDimensionReduction2(int NumberOfDimensions, int documentFrequencyThreshold)
          Fills the allTermsOfInterest term-vector with the most important terms (according to the weight values of the terms)
 void computeAllTFIDFValues(boolean normalize)
          computes all TFIDF calues for all the documents in the corpus given the terms of interest
 void computeGaussianDistributionForAllTermsInCorpus(TCCorpusDistributionsData distributionsData)
          computes mean and standard deviation of the distribution of frequencies in the entire corpus.
 void computeGaussianDistributionsForAllTermsOfInterestInCorpus(TCCorpusDistributionsData distributionsData)
          computes mean and stddev for the plot of the gaussian distribution for the "terms of interest" that is, all the terms which have been chosen for the current feature space
 void computeInverseDocumentFrequencies()
          Calculates the inverse document frequency for every "term of interest" (all the term in the reduced feature space) in the corpus
 boolean computeWeightValues(int reductionMethodNumber, int howToSum, int documentFrequencyThreshold, javax.swing.ProgressMonitor calcWeightsProgressMonitor)
          Computes the weight of each term according to the given weighting method
 void exportTrainingDocumentVectors(java.lang.String path)
           
 java.util.ArrayList getAllCategoryLabels()
          returns an arraylist with all the category labels in the corpus
 java.util.ArrayList getAllTerms()
          returns the ArrayList containing all different terms of the corpus
 java.util.ArrayList getAllTermsOfInterest()
          returns the ArrayList containing all terms of interest of the corpus (filled by the dimensionality reduction step)
 TCkNNResult getKNNResultForTestDocument(TCDocument documentToClassify, int k)
          returns the kNN result, given the test document index and k for the k nearest neighbors classifier
 java.util.ArrayList getMyCategories()
           
 java.util.ArrayList getMyDocuments()
           
 TCNaiveBayesResult getNaiveBayesResultForTestDocument(TCDocument documentToClassify)
          gets the naive bayes classification result for the test document with the given index
 int getNumberOfCategories()
          returns the number of categories in the training (!) set
 int getNumberOfDocuments()
          returns the number of documents
 int getNumberOfLabeledDocuments()
          returns the number of labeled documents in the corpus (trainingset) documents are only added to the myDocument ArrayList, if they are assigned to at least one category
 int getNumberOfProcessedWords()
           
 int getNumberOfTerms()
          returns the number of unique terms in the corpus
 int getNumberOfTermsOfInterest()
           
 int getNumberOfUniqueTerms()
           
 double getNumberOfWordsPerDocument()
           
 java.util.ArrayList getScoreList(int classifierIndex)
          returns the score list for the given classifier
 double getTermFrequencyDistributionMean()
           
 double getTermFrequencyDistributionOfAllTermsOfInterestMean()
           
 double getTermFrequencyDistributionOfAllTermsOfInterestStdDev()
           
 double getTermFrequencyDistributionStdDev()
           
 boolean hasKnownCategoryLabel(java.util.ArrayList labels)
          checks if at least one one of the labels in "labels" is a known label (a label of a category which at least one training document was assigned to)
 boolean myWeightingFunction(int method, int documentFrequencyThreshold, javax.swing.ProgressMonitor calcWeightsProgressMonitor)
          my own weighting function (more later)
 void removeTerm(TCTerm term)
          removes the given term (recursively, i.e., in every category and in every document)
 void setScoreList(java.util.ArrayList scoreList, int classifierIndex)
          sets the score list for the given classifier
 double simpleDimensionReduction(int Threshold)
          Fills the allTermsOfInterest term-vector with terms whose document frequency value is greater than the given threshold
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

myCategories

java.util.ArrayList myCategories
All categories in this corpus


myDocuments

java.util.ArrayList myDocuments
all the documents belonging this corpus


allCategoryLabels

java.util.ArrayList allCategoryLabels
contains all category labels of all the categories in the corpus


allTerms

private java.util.ArrayList allTerms
all terms in this corpus (with corresponding number of occurence, term frequency, ...) see class TCTerm


allTermsOfInterest

private java.util.ArrayList allTermsOfInterest
contains all the terms after the application of a dimensionality reduction step. all classification algorithms use these terms for their computations


numberOfDocuments

private int numberOfDocuments
the number of documents in the corpus


numberOfUniqueTerms

private int numberOfUniqueTerms
the number of different terms in the corpus


numberOfProcessedWords

private int numberOfProcessedWords
number of processed words, that is, all (not necessarily unique) words that appear in all the documents


numberOfWordsPerDocument

double numberOfWordsPerDocument
average number of words per document


termFrequencyDistributionOfAllTermsMean

double termFrequencyDistributionOfAllTermsMean
the mean of the distribution of term frequencies of all terms in the corpus


termFrequencyDistributionOfAllTermsOfInterestMean

double termFrequencyDistributionOfAllTermsOfInterestMean
the mean of the ditribution of term frequencies of all terms of interest


termFrequencyDistributionOfAllTermsStdDev

double termFrequencyDistributionOfAllTermsStdDev
standard deviation of the distribution of term frequencies in the entire corpus


termFrequencyDistributionOfAllTermsOfInterestStdDev

double termFrequencyDistributionOfAllTermsOfInterestStdDev
the standard deviation of the ditribution of term frequencies of all terms of interest


overallDocumentFrequency

int overallDocumentFrequency
the sum of all document frequencies of all terms in the entire corpus


overallTermFrequency

int overallTermFrequency
the sum of all term frequencies of all terms in the entire corpus


kNNScoreList

java.util.ArrayList kNNScoreList
holds the {labels, score} pair for every document computed by the kNN algorithm


naiveBayesScoreList

java.util.ArrayList naiveBayesScoreList
according to the current dimensionality, this class holds the naive bayes (category-)scores for every document

Constructor Detail

TCCorpus

public TCCorpus()
Method Detail

getTermFrequencyDistributionMean

public double getTermFrequencyDistributionMean()

getTermFrequencyDistributionStdDev

public double getTermFrequencyDistributionStdDev()

getTermFrequencyDistributionOfAllTermsOfInterestMean

public double getTermFrequencyDistributionOfAllTermsOfInterestMean()

getTermFrequencyDistributionOfAllTermsOfInterestStdDev

public double getTermFrequencyDistributionOfAllTermsOfInterestStdDev()

getNumberOfLabeledDocuments

public int getNumberOfLabeledDocuments()
returns the number of labeled documents in the corpus (trainingset) documents are only added to the myDocument ArrayList, if they are assigned to at least one category

Returns:
the number of labeled documents

getNumberOfWordsPerDocument

public double getNumberOfWordsPerDocument()

getNumberOfProcessedWords

public int getNumberOfProcessedWords()

getNumberOfUniqueTerms

public int getNumberOfUniqueTerms()

getAllTerms

public java.util.ArrayList getAllTerms()
returns the ArrayList containing all different terms of the corpus

Returns:
all unique terms of the corpus

getScoreList

public java.util.ArrayList getScoreList(int classifierIndex)
returns the score list for the given classifier

Parameters:
classifierIndex - the index of the classifier
Returns:
the scoring (as an arraylist)

setScoreList

public void setScoreList(java.util.ArrayList scoreList,
                         int classifierIndex)
sets the score list for the given classifier

Parameters:
scoreList - the arraylist with the actual scores
classifierIndex - the index of the classifier

getAllTermsOfInterest

public java.util.ArrayList getAllTermsOfInterest()
returns the ArrayList containing all terms of interest of the corpus (filled by the dimensionality reduction step)

Returns:
all unique terms of interest of the corpus

getNumberOfTermsOfInterest

public int getNumberOfTermsOfInterest()

getMyCategories

public java.util.ArrayList getMyCategories()

getMyDocuments

public java.util.ArrayList getMyDocuments()

getNumberOfDocuments

public int getNumberOfDocuments()
returns the number of documents

Returns:
number of documents in the corpus

getNumberOfCategories

public int getNumberOfCategories()
returns the number of categories in the training (!) set

Returns:
the number of distinct categories

getAllCategoryLabels

public java.util.ArrayList getAllCategoryLabels()
returns an arraylist with all the category labels in the corpus

Returns:
the arraylist containing all the labels

hasKnownCategoryLabel

public boolean hasKnownCategoryLabel(java.util.ArrayList labels)
checks if at least one one of the labels in "labels" is a known label (a label of a category which at least one training document was assigned to)

Parameters:
labels - the category labels of the document
Returns:
true, if the label is known

getNumberOfTerms

public int getNumberOfTerms()
returns the number of unique terms in the corpus

Returns:
number of unique terms in the corpus

removeTerm

public void removeTerm(TCTerm term)
removes the given term (recursively, i.e., in every category and in every document)

Parameters:
term - term to delete out of the corpus

computeGaussianDistributionForAllTermsInCorpus

public void computeGaussianDistributionForAllTermsInCorpus(TCCorpusDistributionsData distributionsData)
computes mean and standard deviation of the distribution of frequencies in the entire corpus. (considering all loaded unique words)

Parameters:
distributionsData - the distribution data object to save the results in

computeGaussianDistributionsForAllTermsOfInterestInCorpus

public void computeGaussianDistributionsForAllTermsOfInterestInCorpus(TCCorpusDistributionsData distributionsData)
computes mean and stddev for the plot of the gaussian distribution for the "terms of interest" that is, all the terms which have been chosen for the current feature space

Parameters:
distributionsData - the data object to save the results in

computeInverseDocumentFrequencies

public void computeInverseDocumentFrequencies()
Calculates the inverse document frequency for every "term of interest" (all the term in the reduced feature space) in the corpus


computeWeightValues

public boolean computeWeightValues(int reductionMethodNumber,
                                   int howToSum,
                                   int documentFrequencyThreshold,
                                   javax.swing.ProgressMonitor calcWeightsProgressMonitor)
Computes the weight of each term according to the given weighting method

Parameters:
reductionMethodNumber - The name of the weighting function (0 = chi-square, 1 = mutual-information, 2 = odds-ratio, 3 = NGL, 4 = GSS, 5 = relevancy-score, 6 = information-gain, 7 = document frequency)

howToSum - How the results for every cetegory should be summed for the results for the entire corpus (0 = sum, 1 = weighted sum, 2 = maximum)

myWeightingFunction

public boolean myWeightingFunction(int method,
                                   int documentFrequencyThreshold,
                                   javax.swing.ProgressMonitor calcWeightsProgressMonitor)
my own weighting function (more later)


computeAllTFIDFValues

public void computeAllTFIDFValues(boolean normalize)
computes all TFIDF calues for all the documents in the corpus given the terms of interest

Parameters:
normalize - indicates if the values should be normalized (cosine normalization)

simpleDimensionReduction

public double simpleDimensionReduction(int Threshold)
Fills the allTermsOfInterest term-vector with terms whose document frequency value is greater than the given threshold

Parameters:
Threshold - the document frequency threshold
Returns:
the agressivity of the dimension reduction

complexDimensionReduction

public double complexDimensionReduction(double Threshold)
Fills the allTermsOfInterest term-vector with terms whose weight value is greater than the given threshold

Parameters:
Threshold - the document frequency threshold
Returns:
the agressivity of the dimension reduction

complexDimensionReduction2

public double complexDimensionReduction2(int NumberOfDimensions,
                                         int documentFrequencyThreshold)
Fills the allTermsOfInterest term-vector with the most important terms (according to the weight values of the terms)

Parameters:
NumberOfDimensions - the number of dimensions for the feature vectors (size of the allTermsOfInterest term-vector)
Returns:
the agressivity of the dimension reduction

applyKNNToTestDocument

public void applyKNNToTestDocument(TCDocument documentToClassify,
                                   int method)
applies the kNN classifier to the given document (out of the test documents!) calculates similarity to all documents in the training set and saves the scores in a list together with the labels the documents belonged to

Parameters:
documentToClassify - the document to classify
method - the similarity function
1: cosine similarity
2: euclidian distance

applyKNNToDocumentFromFile

public TCkNNResult applyKNNToDocumentFromFile(TCDocument documentToClassify,
                                              int method,
                                              int k)
applies the kNN classifier to the given document (loaded from a file before!). Calculates similarity to all documents in the training set and returns the TCkNNResult array.

Parameters:
documentToClassify - the document to classify
method - the similarity function
1: cosine similarity
2: euclidian distance

getKNNResultForTestDocument

public TCkNNResult getKNNResultForTestDocument(TCDocument documentToClassify,
                                               int k)
returns the kNN result, given the test document index and k for the k nearest neighbors classifier

Parameters:
k - k nearest neighbor
Returns:
the TCkNNResult for the document

applyNaiveBayesBernoulliToTestDocument

public void applyNaiveBayesBernoulliToTestDocument(TCDocument documentToClassify)
Classifies one document out of the test documents with the naive bayes algorithm (bernoulli). Saves the results in the ArrayList naiveBayesScoreList for later further processing.

Parameters:
documentToClassify - the document to classify

applyNaiveBayesBernoulliToDocumentFromFile

public TCNaiveBayesResult applyNaiveBayesBernoulliToDocumentFromFile(TCDocument documentToClassify)
Classifies one document out of the test documents with the naive bayes algorithm (bernoulli). Saves the results in the ArrayList naiveBayesScoreList for later further processing.

Parameters:
documentToClassify - the document to classify

getNaiveBayesResultForTestDocument

public TCNaiveBayesResult getNaiveBayesResultForTestDocument(TCDocument documentToClassify)
gets the naive bayes classification result for the test document with the given index

Returns:

exportTrainingDocumentVectors

public void exportTrainingDocumentVectors(java.lang.String path)
                                   throws java.io.IOException
Throws:
java.io.IOException

addDocument

public void addDocument(TCDocument newDocument)
                 throws java.lang.CloneNotSupportedException
adds a new document to the corpus. This method is the only one which must be used when loading a corpus after the TCDocument object is created

Parameters:
newDocument - the new document
Throws:
java.lang.CloneNotSupportedException