|
|||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | ||||||||||
java.lang.Objectorg.xml.sax.helpers.DefaultHandler
tc.TCCorpusLoader
tc.TCCorpusLoaderReuters
loads the reuters corpus (ModApte split in the current configuration) using the SAX parser... (see comments for more details)
| Field Summary | |
(package private) int |
corpusConfiguration
the configuration of the corpus, e.g., a subset, the split etc. |
(package private) boolean |
in_TOPICS_D
if true, the next character string will be a label (the parser is within a TOPICS and a D tag) |
(package private) boolean |
inBODY
the next character string will be the content of the document |
(package private) boolean |
inREUTERS
if true, the current document will be added |
(package private) boolean |
inTOPICS
the parser is currently within a TOPICS tag |
(package private) boolean |
inTRAINDOCUMENT
indicates if the currently processed document is a test or training document |
(package private) java.lang.String |
text
a string temporarily holding the label and the content of the document, respectively |
(package private) static java.util.ArrayList |
topTenCategoryLabels
|
| Fields inherited from class tc.TCCorpusLoader |
corpus, labels, numberOfTestDocumentsWithLabel, stemmerMethod, stopWordsString, testSet, useStopWordRemoval |
| Constructor Summary | |
TCCorpusLoaderReuters(int configuration,
boolean useSWR,
int stemMethod,
java.lang.String sws)
initialize the loader |
|
| Method Summary | |
void |
characters(char[] ch,
int start,
int length)
|
void |
endDocument()
|
void |
endElement(java.lang.String namespaceURI,
java.lang.String localName,
java.lang.String qName)
|
java.util.ArrayList |
getTestSet()
|
TCCorpus |
loadCorpus()
|
void |
readXML(java.io.InputStream inStream)
Read XML from input stream and parse, generating SAX events |
void |
startDocument()
|
void |
startElement(java.lang.String namespaceURI,
java.lang.String localName,
java.lang.String qName,
org.xml.sax.Attributes atts)
|
| Methods inherited from class tc.TCCorpusLoader |
getNumberOfTestDocumentsWithLabel |
| Methods inherited from class org.xml.sax.helpers.DefaultHandler |
endPrefixMapping, error, fatalError, ignorableWhitespace, notationDecl, processingInstruction, resolveEntity, setDocumentLocator, skippedEntity, startPrefixMapping, unparsedEntityDecl, warning |
| Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
| Field Detail |
static java.util.ArrayList topTenCategoryLabels
boolean inTRAINDOCUMENT
boolean inREUTERS
boolean in_TOPICS_D
boolean inTOPICS
boolean inBODY
java.lang.String text
int corpusConfiguration
| Constructor Detail |
public TCCorpusLoaderReuters(int configuration,
boolean useSWR,
int stemMethod,
java.lang.String sws)
| Method Detail |
public void readXML(java.io.InputStream inStream)
public void startDocument()
throws org.xml.sax.SAXException
org.xml.sax.SAXException
public void endDocument()
throws org.xml.sax.SAXException
org.xml.sax.SAXException
public void startElement(java.lang.String namespaceURI,
java.lang.String localName,
java.lang.String qName,
org.xml.sax.Attributes atts)
throws org.xml.sax.SAXException
org.xml.sax.SAXException
public void endElement(java.lang.String namespaceURI,
java.lang.String localName,
java.lang.String qName)
throws org.xml.sax.SAXException
org.xml.sax.SAXException
public void characters(char[] ch,
int start,
int length)
throws org.xml.sax.SAXException
org.xml.sax.SAXExceptionpublic java.util.ArrayList getTestSet()
getTestSet in class TCCorpusLoaderpublic TCCorpus loadCorpus()
loadCorpus in class TCCorpusLoader
|
|||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | ||||||||||