tc
Class TCCorpusLoaderReuters

java.lang.Object
  extended byorg.xml.sax.helpers.DefaultHandler
      extended bytc.TCCorpusLoader
          extended bytc.TCCorpusLoaderReuters
All Implemented Interfaces:
org.xml.sax.ContentHandler, org.xml.sax.DTDHandler, org.xml.sax.EntityResolver, org.xml.sax.ErrorHandler

public class TCCorpusLoaderReuters
extends TCCorpusLoader

loads the reuters corpus (ModApte split in the current configuration) using the SAX parser... (see comments for more details)


Field Summary
(package private)  int corpusConfiguration
          the configuration of the corpus, e.g., a subset, the split etc.
(package private)  boolean in_TOPICS_D
          if true, the next character string will be a label (the parser is within a TOPICS and a D tag)
(package private)  boolean inBODY
          the next character string will be the content of the document
(package private)  boolean inREUTERS
          if true, the current document will be added
(package private)  boolean inTOPICS
          the parser is currently within a TOPICS tag
(package private)  boolean inTRAINDOCUMENT
          indicates if the currently processed document is a test or training document
(package private)  java.lang.String text
          a string temporarily holding the label and the content of the document, respectively
(package private) static java.util.ArrayList topTenCategoryLabels
           
 
Fields inherited from class tc.TCCorpusLoader
corpus, labels, numberOfTestDocumentsWithLabel, stemmerMethod, stopWordsString, testSet, useStopWordRemoval
 
Constructor Summary
TCCorpusLoaderReuters(int configuration, boolean useSWR, int stemMethod, java.lang.String sws)
          initialize the loader
 
Method Summary
 void characters(char[] ch, int start, int length)
           
 void endDocument()
           
 void endElement(java.lang.String namespaceURI, java.lang.String localName, java.lang.String qName)
           
 java.util.ArrayList getTestSet()
           
 TCCorpus loadCorpus()
           
 void readXML(java.io.InputStream inStream)
          Read XML from input stream and parse, generating SAX events
 void startDocument()
           
 void startElement(java.lang.String namespaceURI, java.lang.String localName, java.lang.String qName, org.xml.sax.Attributes atts)
           
 
Methods inherited from class tc.TCCorpusLoader
getNumberOfTestDocumentsWithLabel
 
Methods inherited from class org.xml.sax.helpers.DefaultHandler
endPrefixMapping, error, fatalError, ignorableWhitespace, notationDecl, processingInstruction, resolveEntity, setDocumentLocator, skippedEntity, startPrefixMapping, unparsedEntityDecl, warning
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

topTenCategoryLabels

static java.util.ArrayList topTenCategoryLabels

inTRAINDOCUMENT

boolean inTRAINDOCUMENT
indicates if the currently processed document is a test or training document


inREUTERS

boolean inREUTERS
if true, the current document will be added


in_TOPICS_D

boolean in_TOPICS_D
if true, the next character string will be a label (the parser is within a TOPICS and a D tag)


inTOPICS

boolean inTOPICS
the parser is currently within a TOPICS tag


inBODY

boolean inBODY
the next character string will be the content of the document


text

java.lang.String text
a string temporarily holding the label and the content of the document, respectively


corpusConfiguration

int corpusConfiguration
the configuration of the corpus, e.g., a subset, the split etc. current values:
0: reuters 21578 ModApté split
1: reuters 21578 ModApté split (subset)
2: reuters 21578 ModApté [10]

Constructor Detail

TCCorpusLoaderReuters

public TCCorpusLoaderReuters(int configuration,
                             boolean useSWR,
                             int stemMethod,
                             java.lang.String sws)
initialize the loader

Method Detail

readXML

public void readXML(java.io.InputStream inStream)
Read XML from input stream and parse, generating SAX events


startDocument

public void startDocument()
                   throws org.xml.sax.SAXException
Throws:
org.xml.sax.SAXException

endDocument

public void endDocument()
                 throws org.xml.sax.SAXException
Throws:
org.xml.sax.SAXException

startElement

public void startElement(java.lang.String namespaceURI,
                         java.lang.String localName,
                         java.lang.String qName,
                         org.xml.sax.Attributes atts)
                  throws org.xml.sax.SAXException
Throws:
org.xml.sax.SAXException

endElement

public void endElement(java.lang.String namespaceURI,
                       java.lang.String localName,
                       java.lang.String qName)
                throws org.xml.sax.SAXException
Throws:
org.xml.sax.SAXException

characters

public void characters(char[] ch,
                       int start,
                       int length)
                throws org.xml.sax.SAXException
Throws:
org.xml.sax.SAXException

getTestSet

public java.util.ArrayList getTestSet()
Overrides:
getTestSet in class TCCorpusLoader

loadCorpus

public TCCorpus loadCorpus()
Overrides:
loadCorpus in class TCCorpusLoader