XmlDocumentReader (illinois-cogcomp-nlp 3.1.29 API)

java.lang.Object
- edu.illinois.cs.cogcomp.nlp.corpusreaders.AnnotationReader<T>
- - edu.illinois.cs.cogcomp.nlp.corpusreaders.AbstractIncrementalCorpusReader<XmlTextAnnotation>
  - - edu.illinois.cs.cogcomp.nlp.corpusreaders.XmlDocumentReader

All Implemented Interfaces:

IResetableIterator<XmlTextAnnotation>, Iterable<XmlTextAnnotation>, Iterator<XmlTextAnnotation>

Direct Known Subclasses:

EREDocumentReader
```
public class XmlDocumentReader
extends AbstractIncrementalCorpusReader<XmlTextAnnotation>
```
Generates an XmlTextAnnotation object per file for a corpus consisting of files containing xml fragments or full xml trees. This implementation has been created for the DEFT ERE collection, but should generalize to other tasks by substituting an appropriately parameterized XmlDocumentReader. The ERE documents appear to be forum data, not full xml, but xml-ish. LDC README IN LDC2016E27 INDICATES THAT THESE DOCUMENTS ARE XML FRAGMENTS, NOT FULL XML. Therefore, they should be treated as raw text, even though they contain xml-escaped character forms: character offsets for standoff annotation will refer to these expanded forms. This reader generates a cleaned-up, text-only version of the document and also retrieves information from the xml markup. The StringTransformation that accompanies it maps the cleaned-up text offsets back to the original xml file. The xml markup offsets correspond to the original xml document. The Xml document structure consists of one or more "post" elements, each possibly containing one or more "quote" elements (which may be nested) and which may have other tags (image files and other url-like stuff, possibly html formatting), though these will generally be escaped. This reader handles these problems, internally normalizing these escaped tags and treating them like regular xml elements. The XmlTextAnnotations will be returned with TextID fields set to the name of the source file. While no effort is made to represent the inter-post/quoted segment structure, the xml markup information allows it to be reconstructed (look for entries with key SPAN_INFO, whose value set will contain one entry naming the xml tag. When accessing the TextAnnotation element of the XmlTextAnnotation, be aware that you must use the accompanying StringTransformation to recover the offsets from the source xml file.

Field Summary

Fields
Modifier and Type Field and Description

protected String fileId
- Fields inherited from class edu.illinois.cs.cogcomp.nlp.corpusreaders.AbstractIncrementalCorpusReader
  fileList, sourceDirectory
- Fields inherited from class edu.illinois.cs.cogcomp.nlp.corpusreaders.AnnotationReader
  corpusName, currentAnnotationId, resourceManager

Fields
Modifier and Type	Field and Description
`protected String`	`fileId`

Constructor Summary

Constructors
Constructor and Description

XmlDocumentReader(ResourceManager rm, XmlTextAnnotationMaker xmlTextAnnotationMaker)
Instantiate a reader for an xml corpus.

Constructors
Constructor and Description
`XmlDocumentReader(ResourceManager rm, XmlTextAnnotationMaker xmlTextAnnotationMaker)` Instantiate a reader for an xml corpus.

Method Summary

All Methods Instance Methods Concrete Methods
Modifier and Type	Method and Description
`String`	`generateReport()` generate a human-readable report of annotations read from the source file (plus whatever other relevant statistics the user should know about).
`List<XmlTextAnnotation>`	`getAnnotationsFromFile(List<Path> corpusFileListEntry)` given an entry from the corpus file list generated by `getFileListing()` , parse its contents and get zero or more TextAnnotation objects.
`List<List<Path>>`	`getFileListing()` generate a list of lists of files comprising the corpus.
`protected String`	`getRequiredAnnotationFileExtension()` Exclude any files not possessing this extension.
`protected String`	`getRequiredSourceFileExtension()` Exclude any files not possessing this extension.
`protected void`	`initializeReader()` this method is called by the base class constructor, so all subclass-specific object initialization must be done here.
`void`	`reset()` set the reader to start from the beginning of the corpus.

Methods inherited from class edu.illinois.cs.cogcomp.nlp.corpusreaders.AbstractIncrementalCorpusReader
getSourceDirectory, hasNext, next

Methods inherited from class edu.illinois.cs.cogcomp.nlp.corpusreaders.AnnotationReader
iterator, remove

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Methods inherited from interface java.lang.Iterable
forEach, spliterator

Methods inherited from interface java.util.Iterator
forEachRemaining

- Field Detail
  - fileId
```
protected String fileId
```
- Constructor Detail
  - XmlDocumentReader
```
public XmlDocumentReader(ResourceManager rm,
                         XmlTextAnnotationMaker xmlTextAnnotationMaker)
                  throws Exception
```
    Instantiate a reader for an xml corpus. Default implementation assumes a single source corpus from which user wants to strip xml markup, but record relevant xml markup info. The XmlTextAnnotationMaker should be configured to process the xml markup in the files you want to process.
    
    Parameters:
    
    rm - resourceManager with configuration specs (source and annotation directories, file extensions, etc.)
    
    xmlTextAnnotationMaker - parses xml text and generates an XmlTextAnnotation.
    
    Throws:
    
    IOException
    
    Exception
- Method Detail
  - initializeReader
```
protected void initializeReader()
```
    this method is called by the base class constructor, so all subclass-specific object initialization must be done here. This default implementation assumes that annotation and source are both provided in the same file.
    
    Overrides:
    
    initializeReader in class AbstractIncrementalCorpusReader<XmlTextAnnotation>
  - reset
```
public void reset()
```
    set the reader to start from the beginning of the corpus.
    
    Specified by:
    
    reset in interface IResetableIterator<XmlTextAnnotation>
    
    Overrides:
    
    reset in class AbstractIncrementalCorpusReader<XmlTextAnnotation>
  - getFileListing
```
public List<List<Path>> getFileListing()
                                throws IOException
```
    generate a list of lists of files comprising the corpus. Each entry is expected to generate one or more TextAnnotation objects, though the way the iterator is implemented allows for corpus files to generate zero TextAnnotations if you are feeling picky. Each entry in the list is itself a list in which the first file contains the source document. If that file does not also contain the annotation info, the remaining entries in the list name the file(s) containing the annotation markup. The default implementation assumes only a single self-contained file is provided for each document.
    
    Specified by:
    
    getFileListing in class AbstractIncrementalCorpusReader<XmlTextAnnotation>
    
    Returns:
    
    a List of Lists of Path objects, each containing a source file and corresponding markup files.
    
    Throws:
    
    IOException
  - getAnnotationsFromFile
```
public List<XmlTextAnnotation> getAnnotationsFromFile(List<Path> corpusFileListEntry)
                                               throws Exception
```
    given an entry from the corpus file list generated by getFileListing() , parse its contents and get zero or more TextAnnotation objects. This allows for the case where corpus annotations are provided in standoff format in one or more files separate from the source document. In such cases, the first file in the list should contain the source document and the rest should be the corresponding markup files. In this default implementation, it is assumed that a single file contains both source and markup.
    
    Specified by:
    
    getAnnotationsFromFile in class AbstractIncrementalCorpusReader<XmlTextAnnotation>
    
    Parameters:
    
    corpusFileListEntry - a list of files, the first of which is a source file.
    
    Returns:
    
    List of TextAnnotation objects extracted from the corpus file.
    
    Throws:
    
    Exception
  - generateReport
```
public String generateReport()
```
    generate a human-readable report of annotations read from the source file (plus whatever other relevant statistics the user should know about).
    
    Overrides:
    
    generateReport in class AbstractIncrementalCorpusReader<XmlTextAnnotation>
  - getRequiredSourceFileExtension
```
protected String getRequiredSourceFileExtension()
```
    Exclude any files not possessing this extension.
    
    Returns:
    
    the required file extension.
  - getRequiredAnnotationFileExtension
```
protected String getRequiredAnnotationFileExtension()
```
    Exclude any files not possessing this extension.
    
    Returns:
    
    the required file extension.

Class XmlDocumentReader

Field Summary

Fields inherited from class edu.illinois.cs.cogcomp.nlp.corpusreaders.AbstractIncrementalCorpusReader

Fields inherited from class edu.illinois.cs.cogcomp.nlp.corpusreaders.AnnotationReader

Constructor Summary

Method Summary

Methods inherited from class edu.illinois.cs.cogcomp.nlp.corpusreaders.AbstractIncrementalCorpusReader

Methods inherited from class edu.illinois.cs.cogcomp.nlp.corpusreaders.AnnotationReader

Methods inherited from class java.lang.Object

Methods inherited from interface java.lang.Iterable

Methods inherited from interface java.util.Iterator

Field Detail

fileId

Constructor Detail

XmlDocumentReader

Method Detail

initializeReader

reset

getFileListing

getAnnotationsFromFile

generateReport

getRequiredSourceFileExtension

getRequiredAnnotationFileExtension