public class XmlDocumentReader extends AbstractIncrementalCorpusReader<XmlTextAnnotation>
XmlTextAnnotation
object per file for a corpus consisting of files containing xml
fragments or full xml trees. This implementation has been created for the DEFT ERE collection, but should
generalize to other tasks by substituting an appropriately parameterized XmlDocumentReader.
The ERE documents appear to be forum data, not full xml, but xml-ish. LDC README IN LDC2016E27 INDICATES
THAT THESE DOCUMENTS ARE XML FRAGMENTS, NOT FULL XML. Therefore, they should be treated as raw
text, even though they contain xml-escaped character forms: character offsets for standoff
annotation will refer to these expanded forms. This reader generates a cleaned-up, text-only
version of the document and also retrieves information from the xml markup. The
StringTransformation
that accompanies it maps the
cleaned-up text offsets back to the original xml file. The xml markup offsets correspond to the
original xml document.
The Xml document structure consists of one or more "post" elements, each possibly containing one or
more "quote" elements (which may be nested) and which may have other tags (image files and other url-like
stuff, possibly html formatting), though these will generally be escaped. This reader handles
these problems, internally normalizing these escaped tags and treating them like regular xml elements.
The XmlTextAnnotations will be returned with TextID fields set to the name of the source file.
While no effort is made to represent the inter-post/quoted segment structure, the xml markup information
allows it to be reconstructed (look for entries with key SPAN_INFO, whose value set will contain one entry
naming the xml tag.
When accessing the TextAnnotation element of the XmlTextAnnotation, be aware that you must use the
accompanying StringTransformation to recover the offsets from the source xml file.Modifier and Type | Field and Description |
---|---|
protected String |
fileId |
fileList, sourceDirectory
corpusName, currentAnnotationId, resourceManager
Constructor and Description |
---|
XmlDocumentReader(ResourceManager rm,
XmlTextAnnotationMaker xmlTextAnnotationMaker)
Instantiate a reader for an xml corpus.
|
Modifier and Type | Method and Description |
---|---|
String |
generateReport()
generate a human-readable report of annotations read from the source file (plus whatever
other relevant statistics the user should know about).
|
List<XmlTextAnnotation> |
getAnnotationsFromFile(List<Path> corpusFileListEntry)
given an entry from the corpus file list generated by
getFileListing() , parse its
contents and get zero or more TextAnnotation objects. |
List<List<Path>> |
getFileListing()
generate a list of lists of files comprising the corpus.
|
protected String |
getRequiredAnnotationFileExtension()
Exclude any files not possessing this extension.
|
protected String |
getRequiredSourceFileExtension()
Exclude any files not possessing this extension.
|
protected void |
initializeReader()
this method is called by the base class constructor, so all subclass-specific object
initialization must be done here.
|
void |
reset()
set the reader to start from the beginning of the corpus.
|
getSourceDirectory, hasNext, next
iterator, remove
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
forEach, spliterator
forEachRemaining
protected String fileId
public XmlDocumentReader(ResourceManager rm, XmlTextAnnotationMaker xmlTextAnnotationMaker) throws Exception
XmlTextAnnotationMaker
should be configured to
process the xml markup in the files you want to process.rm
- resourceManager with configuration specs (source and annotation directories, file extensions, etc.)xmlTextAnnotationMaker
- parses xml text and generates an XmlTextAnnotation.IOException
Exception
protected void initializeReader()
initializeReader
in class AbstractIncrementalCorpusReader<XmlTextAnnotation>
public void reset()
reset
in interface IResetableIterator<XmlTextAnnotation>
reset
in class AbstractIncrementalCorpusReader<XmlTextAnnotation>
public List<List<Path>> getFileListing() throws IOException
getFileListing
in class AbstractIncrementalCorpusReader<XmlTextAnnotation>
IOException
public List<XmlTextAnnotation> getAnnotationsFromFile(List<Path> corpusFileListEntry) throws Exception
getFileListing()
, parse its
contents and get zero or more TextAnnotation objects. This allows for the case where corpus
annotations are provided in standoff format in one or more files separate from the source
document. In such cases, the first file in the list should contain the source document
and the rest should be the corresponding markup files.
In this default implementation, it is assumed that a single file contains both source and markup.getAnnotationsFromFile
in class AbstractIncrementalCorpusReader<XmlTextAnnotation>
corpusFileListEntry
- a list of files, the first of which is a source file.Exception
public String generateReport()
generateReport
in class AbstractIncrementalCorpusReader<XmlTextAnnotation>
protected String getRequiredSourceFileExtension()
protected String getRequiredAnnotationFileExtension()
Copyright © 2017. All rights reserved.