public class XmlFragmentWhitespacingDocumentReader extends AbstractIncrementalCorpusReader<TextAnnotation>
Because each CMP document is extracted verbatim from source XML files, certain characters in its content (ampersands, angle brackets, etc.) are escaped according to the XML specification. The offsets of text extents are based on treating this escaped text as-is (e.g. "&" in a cmp.txt file is counted as five characters). Whenever any such string of "raw" text is included in a .rich_ere.xml file (as the text extent to which an annotation is applied), a second level of escaping has been applied, so that XML parsing of the ERE XML file will produce a string that exactly matches the source text.
Modifier and Type | Field and Description |
---|---|
protected String |
fileId |
protected String |
newFileText |
protected TextAnnotationBuilder |
taBuilder |
fileList, sourceDirectory
corpusName, currentAnnotationId, resourceManager
Constructor and Description |
---|
XmlFragmentWhitespacingDocumentReader(String corpusName,
String sourceDirectory,
String sourceFileExtension,
String annotationFileExtension)
assumes files are all from a single source directory, and that no extraneous files are included in that directory.
|
Modifier and Type | Method and Description |
---|---|
String |
generateReport()
generate a human-readable report of annotations read from the source file (plus whatever
other relevant statistics the user should know about).
|
List<TextAnnotation> |
getAnnotationsFromFile(List<Path> corpusFileListEntry)
given an entry from the corpus file list generated by
getFileListing() , parse its
contents and get zero or more TextAnnotation objects. |
List<List<Path>> |
getFileListing()
generate a list of files comprising the corpus.
|
protected String |
getRequiredFileExtension()
Exclude any files not possessing this extension.
|
void |
reset()
override this to conform to whatever the derived class's state mechanism requires.
|
protected String |
stripText(String original)
This method can be overridden to do a more complex parsing.
|
getSourceDirectory, hasNext, initializeReader, next
iterator, remove
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
forEach, spliterator
forEachRemaining
protected TextAnnotationBuilder taBuilder
protected String fileId
protected String newFileText
public XmlFragmentWhitespacingDocumentReader(String corpusName, String sourceDirectory, String sourceFileExtension, String annotationFileExtension) throws Exception
corpusName
- sourceDirectory
- IOException
Exception
public void reset()
AnnotationReader
reset
in interface IResetableIterator<TextAnnotation>
reset
in class AbstractIncrementalCorpusReader<TextAnnotation>
protected String getRequiredFileExtension()
public List<List<Path>> getFileListing() throws IOException
getFileListing
in class AbstractIncrementalCorpusReader<TextAnnotation>
IOException
protected String stripText(String original)
original
- public List<TextAnnotation> getAnnotationsFromFile(List<Path> corpusFileListEntry) throws Exception
getFileListing()
, parse its
contents and get zero or more TextAnnotation objects.getAnnotationsFromFile
in class AbstractIncrementalCorpusReader<TextAnnotation>
corpusFileListEntry
- corpus file containing content to be processedException
public String generateReport()
generateReport
in class AbstractIncrementalCorpusReader<TextAnnotation>
Copyright © 2017. All rights reserved.