XmlFragmentWhitespacingDocumentReader (illinois-cogcomp-nlp 3.1.29 API)

java.lang.Object
- edu.illinois.cs.cogcomp.nlp.corpusreaders.AnnotationReader<T>
- - edu.illinois.cs.cogcomp.nlp.corpusreaders.AbstractIncrementalCorpusReader<TextAnnotation>
  - - edu.illinois.cs.cogcomp.nlp.corpusreaders.XmlFragmentWhitespacingDocumentReader

All Implemented Interfaces:

IResetableIterator<TextAnnotation>, Iterable<TextAnnotation>, Iterator<TextAnnotation>
```
public class XmlFragmentWhitespacingDocumentReader
extends AbstractIncrementalCorpusReader<TextAnnotation>
```
Generates a TextAnnotation object per file for a corpus consisting of files containing xml fragments. Created for the DEFT ERE collection for Belief and Sentiment task (LDC2016E27). All documents appear to be forum data, not full xml, but xml-ish. LDC README IN LDC2016E27 INDICATES THAT THESE DOCUMENTS ARE XML FRAGMENTS, NOT FULL XML. Therefore, they should be treated as raw text, even though they contain xml-escaped character forms: character offsets for standoff annotation will refer to these expanded forms. Structure consists of one or more "post" elements, each possibly containing one or more "quote" elements (which may be nested) and which may have other tags (image files and other url-like stuff, possibly html formatting). This reader (initial implementation) tries to clean up text as much as possible while preserving character offsets of the original text. This is achieved by whitespacing the xml/other tags; the Illinois Tokenizer should be able to handle this in an offset-preserving way. The TextAnnotations will be returned with TextID fields set to the name of the source file. WARNING! No effort is made to represent the inter-post/quoted segment structure. When trying to align annotations to the original file, beware the following annotation property (explained in the README from the corpus: Because each CMP document is extracted verbatim from source XML files, certain characters in its content (ampersands, angle brackets, etc.) are escaped according to the XML specification. The offsets of text extents are based on treating this escaped text as-is (e.g. "&" in a cmp.txt file is counted as five characters). Whenever any such string of "raw" text is included in a .rich_ere.xml file (as the text extent to which an annotation is applied), a second level of escaping has been applied, so that XML parsing of the ERE XML file will produce a string that exactly matches the source text.

Field Summary

Fields
Modifier and Type Field and Description

protected String fileId

protected String newFileText

protected TextAnnotationBuilder taBuilder
- Fields inherited from class edu.illinois.cs.cogcomp.nlp.corpusreaders.AbstractIncrementalCorpusReader
  fileList, sourceDirectory
- Fields inherited from class edu.illinois.cs.cogcomp.nlp.corpusreaders.AnnotationReader
  corpusName, currentAnnotationId, resourceManager

Fields
Modifier and Type	Field and Description
`protected String`	`fileId`
`protected String`	`newFileText`
`protected TextAnnotationBuilder`	`taBuilder`

Constructor Summary

Constructors
Constructor and Description
`XmlFragmentWhitespacingDocumentReader(String corpusName, String sourceDirectory, String sourceFileExtension, String annotationFileExtension)` assumes files are all from a single source directory, and that no extraneous files are included in that directory.

Method Summary

All Methods Instance Methods Concrete Methods
Modifier and Type	Method and Description
`String`	`generateReport()` generate a human-readable report of annotations read from the source file (plus whatever other relevant statistics the user should know about).
`List<TextAnnotation>`	`getAnnotationsFromFile(List<Path> corpusFileListEntry)` given an entry from the corpus file list generated by `getFileListing()` , parse its contents and get zero or more TextAnnotation objects.
`List<List<Path>>`	`getFileListing()` generate a list of files comprising the corpus.
`protected String`	`getRequiredFileExtension()` Exclude any files not possessing this extension.
`void`	`reset()` override this to conform to whatever the derived class's state mechanism requires.
`protected String`	`stripText(String original)` This method can be overridden to do a more complex parsing.

Methods inherited from class edu.illinois.cs.cogcomp.nlp.corpusreaders.AbstractIncrementalCorpusReader
getSourceDirectory, hasNext, initializeReader, next

Methods inherited from class edu.illinois.cs.cogcomp.nlp.corpusreaders.AnnotationReader
iterator, remove

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Methods inherited from interface java.lang.Iterable
forEach, spliterator

Methods inherited from interface java.util.Iterator
forEachRemaining

- Field Detail
  - taBuilder
```
protected TextAnnotationBuilder taBuilder
```
  - fileId
```
protected String fileId
```
  - newFileText
```
protected String newFileText
```
- Constructor Detail
  - XmlFragmentWhitespacingDocumentReader
```
public XmlFragmentWhitespacingDocumentReader(String corpusName,
                                             String sourceDirectory,
                                             String sourceFileExtension,
                                             String annotationFileExtension)
                                      throws Exception
```
    assumes files are all from a single source directory, and that no extraneous files are included in that directory.
    
    Parameters:
    
    corpusName -
    
    sourceDirectory -
    
    Throws:
    
    IOException
    
    Exception
- Method Detail
  - reset
```
public void reset()
```
    Description copied from class: AnnotationReader
    
    override this to conform to whatever the derived class's state mechanism requires.
    
    Specified by:
    
    reset in interface IResetableIterator<TextAnnotation>
    
    Overrides:
    
    reset in class AbstractIncrementalCorpusReader<TextAnnotation>
  - getRequiredFileExtension
```
protected String getRequiredFileExtension()
```
    Exclude any files not possessing this extension. TODO: make this configurable
    
    Returns:
    
    the required file extension.
  - getFileListing
```
public List<List<Path>> getFileListing()
                                throws IOException
```
    generate a list of files comprising the corpus. Each is expected to generate one or more TextAnnotation objects, though the way the iterator is implemented allows for corpus files to generate zero TextAnnotations if you are feeling picky.
    
    Specified by:
    
    getFileListing in class AbstractIncrementalCorpusReader<TextAnnotation>
    
    Returns:
    
    a list of Path objects corresponding to files containing corpus documents to process.
    
    Throws:
    
    IOException
  - stripText
```
protected String stripText(String original)
```
    This method can be overridden to do a more complex parsing.
    
    Parameters:
    
    original -
    
    Returns:
    
    the striped text.
  - getAnnotationsFromFile
```
public List<TextAnnotation> getAnnotationsFromFile(List<Path> corpusFileListEntry)
                                            throws Exception
```
    given an entry from the corpus file list generated by getFileListing() , parse its contents and get zero or more TextAnnotation objects.
    
    Specified by:
    
    getAnnotationsFromFile in class AbstractIncrementalCorpusReader<TextAnnotation>
    
    Parameters:
    
    corpusFileListEntry - corpus file containing content to be processed
    
    Returns:
    
    List of TextAnnotation objects extracted from the corpus file
    
    Throws:
    
    Exception
  - generateReport
```
public String generateReport()
```
    generate a human-readable report of annotations read from the source file (plus whatever other relevant statistics the user should know about).
    
    Overrides:
    
    generateReport in class AbstractIncrementalCorpusReader<TextAnnotation>

Class XmlFragmentWhitespacingDocumentReader

Field Summary

Fields inherited from class edu.illinois.cs.cogcomp.nlp.corpusreaders.AbstractIncrementalCorpusReader

Fields inherited from class edu.illinois.cs.cogcomp.nlp.corpusreaders.AnnotationReader

Constructor Summary

Method Summary

Methods inherited from class edu.illinois.cs.cogcomp.nlp.corpusreaders.AbstractIncrementalCorpusReader

Methods inherited from class edu.illinois.cs.cogcomp.nlp.corpusreaders.AnnotationReader

Methods inherited from class java.lang.Object

Methods inherited from interface java.lang.Iterable

Methods inherited from interface java.util.Iterator

Field Detail

taBuilder

fileId

newFileText

Constructor Detail

XmlFragmentWhitespacingDocumentReader

Method Detail

reset

getRequiredFileExtension

getFileListing

stripText

getAnnotationsFromFile

generateReport