|
|||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | ||||||||||
java.lang.Objecttc.TCDocument
The class represents a single Document in the TC approach. It contains methods for preprocessing (tokenize, filter most used words ,...) creating the bag of words and for calculating the TFIDF values, similarity to another vector etc.
| Field Summary | |
private java.util.ArrayList |
allTerms
all the distinct terms in the document |
(package private) java.util.ArrayList |
categoryLabels
an arraylist with all category labels for this document |
private int |
numberOfProcessedWords
the number of processed words for the document (all word tokens with length > 1) |
private int |
numberOfUniqueTerms
the number of unique (distinct) terms in the document |
(package private) int |
testSetIndex
|
| Constructor Summary | |
TCDocument(java.util.ArrayList cl)
Constructor of this Class |
|
| Method Summary | |
void |
computeTFIDFValues(java.util.ArrayList corpusAllTermsOfInterest,
boolean normalize)
Ccomputes the TFIDF values for this document given the corpus terms of interest and their IDF values |
java.util.ArrayList |
getAllTerms()
returns the ArrayList with all unique terms |
java.lang.String |
getCategoryLabel(int i)
returns the i-th category label for the document |
java.lang.String |
getCategoryLabelAsString(int index)
|
java.lang.String |
getCategoryLabelsAsString()
returns the categroy labels for the document as a formatted string |
TCkNNListElement |
getCosineSimilarity(TCDocument documentToClassify,
java.util.ArrayList corpusAllTermsOfInterest)
(ONLY FOR UNNORMALIZED DOCUMENT VECTORS!!) The function calculates the cosine similarity of two Document-Weight-Vectors given the vector of all relevant terms in the corpus (corpusAllTermsOfInterest) Can e.g. |
TCkNNListElement |
getCosineSimilarityFast(TCDocument documentToClassify,
java.util.ArrayList corpusAllTermsOfInterest)
The function calculates the cosine similarity of two Document-Weight-Vectors given the vector of all relevant terms in the corpus (corpusAllTermsOfInterest) Can e.g. |
TCkNNListElement |
getEuclidianDistance(TCDocument documentToClassify,
java.util.ArrayList corpusAllTermsOfInterest)
The function calculates the euclidian distance between two Document-Weight-Vectors given the vector of all relevant terms in the corpus (corpusAllTermsOfInterest) Can e.g. |
int |
getNumberOfLabels()
returns the number of categories the document is assigned to |
int |
getNumberOfProcessedWords()
the number of processed words in the document |
int |
getNumberOfUniqueTerms()
returns the number of unique terms in the document |
java.lang.String |
getTerm(int i)
returns the term with index i |
int |
getTermFrequency(int i)
Returns the term frequency of the i_th term in the arraylist (number of times the term occurs in the document) |
int |
getTestSetIndex()
|
boolean |
isLabeled()
returns true if the document is assigned to at least one category |
boolean |
isLabeledAs(java.lang.String label)
checks if the document is labeled with "label" |
void |
loadData(java.lang.String inputString,
boolean isPath,
boolean removeStopWords,
java.lang.String stop_word_string,
int stemmerMethod)
Loads the document data, creates the bag of terms for this document and saves the number of occurences for each term |
void |
removeTerm(TCTerm term)
Removes the term out of this document (forever!) |
void |
setTermFrequency(int i,
int occ)
sets the term frequency of a term in the document (used in the loadData function) |
void |
setTestSetIndex(int index)
|
| Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
| Field Detail |
int testSetIndex
java.util.ArrayList categoryLabels
private java.util.ArrayList allTerms
private int numberOfUniqueTerms
private int numberOfProcessedWords
| Constructor Detail |
public TCDocument(java.util.ArrayList cl)
cl - the ArrayList containing all the category labels for the document
(list is empty if it doesn't belong to any category)| Method Detail |
public boolean isLabeled()
public void setTestSetIndex(int index)
public int getTestSetIndex()
public int getNumberOfLabels()
public java.lang.String getCategoryLabelsAsString()
public java.lang.String getCategoryLabelAsString(int index)
public java.lang.String getCategoryLabel(int i)
i - the index for the position in the ArrayList
public int getNumberOfUniqueTerms()
public java.util.ArrayList getAllTerms()
public int getNumberOfProcessedWords()
public boolean isLabeledAs(java.lang.String label)
label - the label to check for
public int getTermFrequency(int i)
i - index of term in the term-vector
public void setTermFrequency(int i,
int occ)
i - the index of the term in the documentocc - the new number of occurence for this termpublic java.lang.String getTerm(int i)
i - the index of the term within the document
public void removeTerm(TCTerm term)
term - the term to be removed
public TCkNNListElement getEuclidianDistance(TCDocument documentToClassify,
java.util.ArrayList corpusAllTermsOfInterest)
documentToClassify - The document to which the distance is calculatedcorpusAllTermsOfInterest - The terms of interest in the entire corpus
public TCkNNListElement getCosineSimilarity(TCDocument documentToClassify,
java.util.ArrayList corpusAllTermsOfInterest)
documentToClassify - The document to which the distance is calculatedcorpusAllTermsOfInterest - The terms of interest in the entire corpus
public TCkNNListElement getCosineSimilarityFast(TCDocument documentToClassify,
java.util.ArrayList corpusAllTermsOfInterest)
documentToClassify - The document to which the distance is calculatedcorpusAllTermsOfInterest - The terms of interest in the entire corpus
public void computeTFIDFValues(java.util.ArrayList corpusAllTermsOfInterest,
boolean normalize)
corpusAllTermsOfInterest - all the terms of the entire corpus (with all values) which are of interestnormalize - indicates if the TFIDF value will be normalized (cosine normalization)
public void loadData(java.lang.String inputString,
boolean isPath,
boolean removeStopWords,
java.lang.String stop_word_string,
int stemmerMethod)
throws java.io.IOException
inputString - Contains either a path to a raw document or the document itselfisPath - indicates if inputString is a path to a single document
or the document content as a stringremoveStopWords - If true, the swl.txt file is used to neglect certain words
java.io.IOException
|
|||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | ||||||||||