|
|||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | ||||||||||
java.lang.Objecttc.TCCorpus
Class representing the whole corpus of documents. Contains all the methods and members for the dimensionality reductions step (different weighting functions), implementation of k-nearest neighbor, Naive Bayes (bernoulli), etc.
| Field Summary | |
(package private) java.util.ArrayList |
allCategoryLabels
contains all category labels of all the categories in the corpus |
private java.util.ArrayList |
allTerms
all terms in this corpus (with corresponding number of occurence, term frequency, ...) see class TCTerm |
private java.util.ArrayList |
allTermsOfInterest
contains all the terms after the application of a dimensionality reduction step. |
(package private) java.util.ArrayList |
kNNScoreList
holds the {labels, score} pair for every document computed by the kNN algorithm |
(package private) java.util.ArrayList |
myCategories
All categories in this corpus |
(package private) java.util.ArrayList |
myDocuments
all the documents belonging this corpus |
(package private) java.util.ArrayList |
naiveBayesScoreList
according to the current dimensionality, this class holds the naive bayes (category-)scores for every document |
private int |
numberOfDocuments
the number of documents in the corpus |
private int |
numberOfProcessedWords
number of processed words, that is, all (not necessarily unique) words that appear in all the documents |
private int |
numberOfUniqueTerms
the number of different terms in the corpus |
(package private) double |
numberOfWordsPerDocument
average number of words per document |
(package private) int |
overallDocumentFrequency
the sum of all document frequencies of all terms in the entire corpus |
(package private) int |
overallTermFrequency
the sum of all term frequencies of all terms in the entire corpus |
(package private) double |
termFrequencyDistributionOfAllTermsMean
the mean of the distribution of term frequencies of all terms in the corpus |
(package private) double |
termFrequencyDistributionOfAllTermsOfInterestMean
the mean of the ditribution of term frequencies of all terms of interest |
(package private) double |
termFrequencyDistributionOfAllTermsOfInterestStdDev
the standard deviation of the ditribution of term frequencies of all terms of interest |
(package private) double |
termFrequencyDistributionOfAllTermsStdDev
standard deviation of the distribution of term frequencies in the entire corpus |
| Constructor Summary | |
TCCorpus()
|
|
| Method Summary | |
void |
addDocument(TCDocument newDocument)
adds a new document to the corpus. |
TCkNNResult |
applyKNNToDocumentFromFile(TCDocument documentToClassify,
int method,
int k)
applies the kNN classifier to the given document (loaded from a file before!). |
void |
applyKNNToTestDocument(TCDocument documentToClassify,
int method)
applies the kNN classifier to the given document (out of the test documents!) calculates similarity to all documents in the training set and saves the scores in a list together with the labels the documents belonged to |
TCNaiveBayesResult |
applyNaiveBayesBernoulliToDocumentFromFile(TCDocument documentToClassify)
Classifies one document out of the test documents with the naive bayes algorithm (bernoulli). |
void |
applyNaiveBayesBernoulliToTestDocument(TCDocument documentToClassify)
Classifies one document out of the test documents with the naive bayes algorithm (bernoulli). |
double |
complexDimensionReduction(double Threshold)
Fills the allTermsOfInterest term-vector with terms whose weight value is greater than the given threshold |
double |
complexDimensionReduction2(int NumberOfDimensions,
int documentFrequencyThreshold)
Fills the allTermsOfInterest term-vector with the most important terms (according to the weight values of the terms) |
void |
computeAllTFIDFValues(boolean normalize)
computes all TFIDF calues for all the documents in the corpus given the terms of interest |
void |
computeGaussianDistributionForAllTermsInCorpus(TCCorpusDistributionsData distributionsData)
computes mean and standard deviation of the distribution of frequencies in the entire corpus. |
void |
computeGaussianDistributionsForAllTermsOfInterestInCorpus(TCCorpusDistributionsData distributionsData)
computes mean and stddev for the plot of the gaussian distribution for the "terms of interest" that is, all the terms which have been chosen for the current feature space |
void |
computeInverseDocumentFrequencies()
Calculates the inverse document frequency for every "term of interest" (all the term in the reduced feature space) in the corpus |
boolean |
computeWeightValues(int reductionMethodNumber,
int howToSum,
int documentFrequencyThreshold,
javax.swing.ProgressMonitor calcWeightsProgressMonitor)
Computes the weight of each term according to the given weighting method |
void |
exportTrainingDocumentVectors(java.lang.String path)
|
java.util.ArrayList |
getAllCategoryLabels()
returns an arraylist with all the category labels in the corpus |
java.util.ArrayList |
getAllTerms()
returns the ArrayList containing all different terms of the corpus |
java.util.ArrayList |
getAllTermsOfInterest()
returns the ArrayList containing all terms of interest of the corpus (filled by the dimensionality reduction step) |
TCkNNResult |
getKNNResultForTestDocument(TCDocument documentToClassify,
int k)
returns the kNN result, given the test document index and k for the k nearest neighbors classifier |
java.util.ArrayList |
getMyCategories()
|
java.util.ArrayList |
getMyDocuments()
|
TCNaiveBayesResult |
getNaiveBayesResultForTestDocument(TCDocument documentToClassify)
gets the naive bayes classification result for the test document with the given index |
int |
getNumberOfCategories()
returns the number of categories in the training (!) set |
int |
getNumberOfDocuments()
returns the number of documents |
int |
getNumberOfLabeledDocuments()
returns the number of labeled documents in the corpus (trainingset) documents are only added to the myDocument ArrayList, if they are assigned to at least one category |
int |
getNumberOfProcessedWords()
|
int |
getNumberOfTerms()
returns the number of unique terms in the corpus |
int |
getNumberOfTermsOfInterest()
|
int |
getNumberOfUniqueTerms()
|
double |
getNumberOfWordsPerDocument()
|
java.util.ArrayList |
getScoreList(int classifierIndex)
returns the score list for the given classifier |
double |
getTermFrequencyDistributionMean()
|
double |
getTermFrequencyDistributionOfAllTermsOfInterestMean()
|
double |
getTermFrequencyDistributionOfAllTermsOfInterestStdDev()
|
double |
getTermFrequencyDistributionStdDev()
|
boolean |
hasKnownCategoryLabel(java.util.ArrayList labels)
checks if at least one one of the labels in "labels" is a known label (a label of a category which at least one training document was assigned to) |
boolean |
myWeightingFunction(int method,
int documentFrequencyThreshold,
javax.swing.ProgressMonitor calcWeightsProgressMonitor)
my own weighting function (more later) |
void |
removeTerm(TCTerm term)
removes the given term (recursively, i.e., in every category and in every document) |
void |
setScoreList(java.util.ArrayList scoreList,
int classifierIndex)
sets the score list for the given classifier |
double |
simpleDimensionReduction(int Threshold)
Fills the allTermsOfInterest term-vector with terms whose document frequency value is greater than the given threshold |
| Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
| Field Detail |
java.util.ArrayList myCategories
java.util.ArrayList myDocuments
java.util.ArrayList allCategoryLabels
private java.util.ArrayList allTerms
private java.util.ArrayList allTermsOfInterest
private int numberOfDocuments
private int numberOfUniqueTerms
private int numberOfProcessedWords
double numberOfWordsPerDocument
double termFrequencyDistributionOfAllTermsMean
double termFrequencyDistributionOfAllTermsOfInterestMean
double termFrequencyDistributionOfAllTermsStdDev
double termFrequencyDistributionOfAllTermsOfInterestStdDev
int overallDocumentFrequency
int overallTermFrequency
java.util.ArrayList kNNScoreList
java.util.ArrayList naiveBayesScoreList
| Constructor Detail |
public TCCorpus()
| Method Detail |
public double getTermFrequencyDistributionMean()
public double getTermFrequencyDistributionStdDev()
public double getTermFrequencyDistributionOfAllTermsOfInterestMean()
public double getTermFrequencyDistributionOfAllTermsOfInterestStdDev()
public int getNumberOfLabeledDocuments()
public double getNumberOfWordsPerDocument()
public int getNumberOfProcessedWords()
public int getNumberOfUniqueTerms()
public java.util.ArrayList getAllTerms()
public java.util.ArrayList getScoreList(int classifierIndex)
classifierIndex - the index of the classifier
public void setScoreList(java.util.ArrayList scoreList,
int classifierIndex)
scoreList - the arraylist with the actual scoresclassifierIndex - the index of the classifierpublic java.util.ArrayList getAllTermsOfInterest()
public int getNumberOfTermsOfInterest()
public java.util.ArrayList getMyCategories()
public java.util.ArrayList getMyDocuments()
public int getNumberOfDocuments()
public int getNumberOfCategories()
public java.util.ArrayList getAllCategoryLabels()
public boolean hasKnownCategoryLabel(java.util.ArrayList labels)
labels - the category labels of the document
public int getNumberOfTerms()
public void removeTerm(TCTerm term)
term - term to delete out of the corpuspublic void computeGaussianDistributionForAllTermsInCorpus(TCCorpusDistributionsData distributionsData)
distributionsData - the distribution data object to save the results inpublic void computeGaussianDistributionsForAllTermsOfInterestInCorpus(TCCorpusDistributionsData distributionsData)
distributionsData - the data object to save the results inpublic void computeInverseDocumentFrequencies()
public boolean computeWeightValues(int reductionMethodNumber,
int howToSum,
int documentFrequencyThreshold,
javax.swing.ProgressMonitor calcWeightsProgressMonitor)
reductionMethodNumber - The name of the weighting function
(0 = chi-square, 1 = mutual-information, 2 = odds-ratio, 3 = NGL, 4 = GSS,
5 = relevancy-score, 6 = information-gain, 7 = document frequency) howToSum - How the results for every cetegory should be summed for the results for the entire corpus
(0 = sum, 1 = weighted sum, 2 = maximum)
public boolean myWeightingFunction(int method,
int documentFrequencyThreshold,
javax.swing.ProgressMonitor calcWeightsProgressMonitor)
public void computeAllTFIDFValues(boolean normalize)
normalize - indicates if the values should be normalized (cosine normalization)public double simpleDimensionReduction(int Threshold)
Threshold - the document frequency threshold
public double complexDimensionReduction(double Threshold)
Threshold - the document frequency threshold
public double complexDimensionReduction2(int NumberOfDimensions,
int documentFrequencyThreshold)
NumberOfDimensions - the number of dimensions for the feature vectors
(size of the allTermsOfInterest term-vector)
public void applyKNNToTestDocument(TCDocument documentToClassify,
int method)
documentToClassify - the document to classifymethod - the similarity function
public TCkNNResult applyKNNToDocumentFromFile(TCDocument documentToClassify,
int method,
int k)
documentToClassify - the document to classifymethod - the similarity function
public TCkNNResult getKNNResultForTestDocument(TCDocument documentToClassify,
int k)
k - k nearest neighbor
public void applyNaiveBayesBernoulliToTestDocument(TCDocument documentToClassify)
documentToClassify - the document to classifypublic TCNaiveBayesResult applyNaiveBayesBernoulliToDocumentFromFile(TCDocument documentToClassify)
documentToClassify - the document to classifypublic TCNaiveBayesResult getNaiveBayesResultForTestDocument(TCDocument documentToClassify)
public void exportTrainingDocumentVectors(java.lang.String path)
throws java.io.IOException
java.io.IOException
public void addDocument(TCDocument newDocument)
throws java.lang.CloneNotSupportedException
newDocument - the new document
java.lang.CloneNotSupportedException
|
|||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | ||||||||||