tc
Class TCDocument

java.lang.Object
  extended bytc.TCDocument

public class TCDocument
extends java.lang.Object

The class represents a single Document in the TC approach. It contains methods for preprocessing (tokenize, filter most used words ,...) creating the bag of words and for calculating the TFIDF values, similarity to another vector etc.


Field Summary
private  java.util.ArrayList allTerms
          all the distinct terms in the document
(package private)  java.util.ArrayList categoryLabels
          an arraylist with all category labels for this document
private  int numberOfProcessedWords
          the number of processed words for the document (all word tokens with length > 1)
private  int numberOfUniqueTerms
          the number of unique (distinct) terms in the document
(package private)  int testSetIndex
           
 
Constructor Summary
TCDocument(java.util.ArrayList cl)
          Constructor of this Class
 
Method Summary
 void computeTFIDFValues(java.util.ArrayList corpusAllTermsOfInterest, boolean normalize)
          Ccomputes the TFIDF values for this document given the corpus terms of interest and their IDF values
 java.util.ArrayList getAllTerms()
          returns the ArrayList with all unique terms
 java.lang.String getCategoryLabel(int i)
          returns the i-th category label for the document
 java.lang.String getCategoryLabelAsString(int index)
           
 java.lang.String getCategoryLabelsAsString()
          returns the categroy labels for the document as a formatted string
 TCkNNListElement getCosineSimilarity(TCDocument documentToClassify, java.util.ArrayList corpusAllTermsOfInterest)
          (ONLY FOR UNNORMALIZED DOCUMENT VECTORS!!) The function calculates the cosine similarity of two Document-Weight-Vectors given the vector of all relevant terms in the corpus (corpusAllTermsOfInterest) Can e.g.
 TCkNNListElement getCosineSimilarityFast(TCDocument documentToClassify, java.util.ArrayList corpusAllTermsOfInterest)
          The function calculates the cosine similarity of two Document-Weight-Vectors given the vector of all relevant terms in the corpus (corpusAllTermsOfInterest) Can e.g.
 TCkNNListElement getEuclidianDistance(TCDocument documentToClassify, java.util.ArrayList corpusAllTermsOfInterest)
          The function calculates the euclidian distance between two Document-Weight-Vectors given the vector of all relevant terms in the corpus (corpusAllTermsOfInterest) Can e.g.
 int getNumberOfLabels()
          returns the number of categories the document is assigned to
 int getNumberOfProcessedWords()
          the number of processed words in the document
 int getNumberOfUniqueTerms()
          returns the number of unique terms in the document
 java.lang.String getTerm(int i)
          returns the term with index i
 int getTermFrequency(int i)
          Returns the term frequency of the i_th term in the arraylist (number of times the term occurs in the document)
 int getTestSetIndex()
           
 boolean isLabeled()
          returns true if the document is assigned to at least one category
 boolean isLabeledAs(java.lang.String label)
          checks if the document is labeled with "label"
 void loadData(java.lang.String inputString, boolean isPath, boolean removeStopWords, java.lang.String stop_word_string, int stemmerMethod)
          Loads the document data, creates the bag of terms for this document and saves the number of occurences for each term
 void removeTerm(TCTerm term)
          Removes the term out of this document (forever!)
 void setTermFrequency(int i, int occ)
          sets the term frequency of a term in the document (used in the loadData function)
 void setTestSetIndex(int index)
           
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

testSetIndex

int testSetIndex

categoryLabels

java.util.ArrayList categoryLabels
an arraylist with all category labels for this document


allTerms

private java.util.ArrayList allTerms
all the distinct terms in the document


numberOfUniqueTerms

private int numberOfUniqueTerms
the number of unique (distinct) terms in the document


numberOfProcessedWords

private int numberOfProcessedWords
the number of processed words for the document (all word tokens with length > 1)

Constructor Detail

TCDocument

public TCDocument(java.util.ArrayList cl)
Constructor of this Class

Parameters:
cl - the ArrayList containing all the category labels for the document (list is empty if it doesn't belong to any category)
Method Detail

isLabeled

public boolean isLabeled()
returns true if the document is assigned to at least one category

Returns:
true if the document is assigned to at least one category, otherwise false

setTestSetIndex

public void setTestSetIndex(int index)

getTestSetIndex

public int getTestSetIndex()

getNumberOfLabels

public int getNumberOfLabels()
returns the number of categories the document is assigned to

Returns:
the number of categories the document is assigned to

getCategoryLabelsAsString

public java.lang.String getCategoryLabelsAsString()
returns the categroy labels for the document as a formatted string

Returns:
the formatted string containing all the labels for the document

getCategoryLabelAsString

public java.lang.String getCategoryLabelAsString(int index)

getCategoryLabel

public java.lang.String getCategoryLabel(int i)
returns the i-th category label for the document

Parameters:
i - the index for the position in the ArrayList
Returns:
the i-th category label

getNumberOfUniqueTerms

public int getNumberOfUniqueTerms()
returns the number of unique terms in the document

Returns:
the number of unique terms in the document

getAllTerms

public java.util.ArrayList getAllTerms()
returns the ArrayList with all unique terms

Returns:
the unique terms in the document as an ArrayList of TCTerm-Objects

getNumberOfProcessedWords

public int getNumberOfProcessedWords()
the number of processed words in the document

Returns:
the number of processed words

isLabeledAs

public boolean isLabeledAs(java.lang.String label)
checks if the document is labeled with "label"

Parameters:
label - the label to check for
Returns:
true if the document is labeled with "label"

getTermFrequency

public int getTermFrequency(int i)
Returns the term frequency of the i_th term in the arraylist (number of times the term occurs in the document)

Parameters:
i - index of term in the term-vector
Returns:
number of times the term occurs in the document

setTermFrequency

public void setTermFrequency(int i,
                             int occ)
sets the term frequency of a term in the document (used in the loadData function)

Parameters:
i - the index of the term in the document
occ - the new number of occurence for this term

getTerm

public java.lang.String getTerm(int i)
returns the term with index i

Parameters:
i - the index of the term within the document
Returns:
the term with index i within the document

removeTerm

public void removeTerm(TCTerm term)
Removes the term out of this document (forever!)

Parameters:
term - the term to be removed

getEuclidianDistance

public TCkNNListElement getEuclidianDistance(TCDocument documentToClassify,
                                             java.util.ArrayList corpusAllTermsOfInterest)
The function calculates the euclidian distance between two Document-Weight-Vectors given the vector of all relevant terms in the corpus (corpusAllTermsOfInterest) Can e.g. be used for the k-NN algorithm

Parameters:
documentToClassify - The document to which the distance is calculated
corpusAllTermsOfInterest - The terms of interest in the entire corpus
Returns:
The euclidian distance between the weight-vectors

getCosineSimilarity

public TCkNNListElement getCosineSimilarity(TCDocument documentToClassify,
                                            java.util.ArrayList corpusAllTermsOfInterest)
(ONLY FOR UNNORMALIZED DOCUMENT VECTORS!!) The function calculates the cosine similarity of two Document-Weight-Vectors given the vector of all relevant terms in the corpus (corpusAllTermsOfInterest) Can e.g. be used for the k-NN algorithm

Parameters:
documentToClassify - The document to which the distance is calculated
corpusAllTermsOfInterest - The terms of interest in the entire corpus
Returns:
The cosine similarity for the weight-vectors

getCosineSimilarityFast

public TCkNNListElement getCosineSimilarityFast(TCDocument documentToClassify,
                                                java.util.ArrayList corpusAllTermsOfInterest)
The function calculates the cosine similarity of two Document-Weight-Vectors given the vector of all relevant terms in the corpus (corpusAllTermsOfInterest) Can e.g. be used for the k-NN algorithm

Parameters:
documentToClassify - The document to which the distance is calculated
corpusAllTermsOfInterest - The terms of interest in the entire corpus
Returns:
The cosine similarity for the weight-vectors

computeTFIDFValues

public void computeTFIDFValues(java.util.ArrayList corpusAllTermsOfInterest,
                               boolean normalize)
Ccomputes the TFIDF values for this document given the corpus terms of interest and their IDF values

Parameters:
corpusAllTermsOfInterest - all the terms of the entire corpus (with all values) which are of interest
normalize - indicates if the TFIDF value will be normalized (cosine normalization)

loadData

public void loadData(java.lang.String inputString,
                     boolean isPath,
                     boolean removeStopWords,
                     java.lang.String stop_word_string,
                     int stemmerMethod)
              throws java.io.IOException
Loads the document data, creates the bag of terms for this document and saves the number of occurences for each term

Parameters:
inputString - Contains either a path to a raw document or the document itself
isPath - indicates if inputString is a path to a single document or the document content as a string
removeStopWords - If true, the swl.txt file is used to neglect certain words
Throws:
java.io.IOException