public class EREDocumentReader extends XmlDocumentReader
Modifier and Type | Class and Description |
---|---|
static class |
EREDocumentReader.EreCorpus
prefix indicates language; suffix indicates release
ENRX, ESRX, ZHRX are ERE releases in English, Spanish, and Chinese from LDC/DEFT
KBPXX is a Knowledge Base Population corpus from year XX
|
fileId
fileList, sourceDirectory
corpusName, currentAnnotationId, resourceManager
Constructor and Description |
---|
EREDocumentReader(EREDocumentReader.EreCorpus ereCorpus,
String corpusRoot,
boolean throwExceptionOnXmlParseFailure)
build an EREDocumentReader configured for the specified ERE release.
|
EREDocumentReader(EREDocumentReader.EreCorpus ereCorpus,
TextAnnotationBuilder taBuilder,
String corpusRoot,
boolean throwExceptionOnXmlParseFailure)
build an EREDocumentReader configured for the specified ERE release, using provided TextAnnotationBuilder
(allows for non-English, non-UIUC tokenizer)
|
EREDocumentReader(ResourceManager rm,
XmlTextAnnotationMaker xmlTextAnnotationMaker) |
Modifier and Type | Method and Description |
---|---|
static ResourceManager |
buildEreConfig(String ereCorpusVal,
String corpusRoot)
This method sets a range of configuration parameters based on which ERE release user specifies
|
static XmlTextAnnotationMaker |
buildEreXmlTextAnnotationMaker(String ereCorpusVal,
boolean throwExceptionOnXmlParseFailure)
builds an XmlTextAnnotationMaker that handles the source files from the specified ERE release
|
static XmlTextAnnotationMaker |
buildXmlTextAnnotationMaker(EREDocumentReader.EreCorpus ereCorpus,
boolean throwExceptionOnXmlParseFail)
builds an
XmlTextAnnotationMaker for reading ERE format English corpus. |
static XmlTextAnnotationMaker |
buildXmlTextAnnotationMaker(TextAnnotationBuilder textAnnotationBuilder,
EREDocumentReader.EreCorpus ereCorpus,
boolean throwExceptionOnXmlParseFail)
builds an
XmlTextAnnotationMaker expecting ERE annotation. |
List<XmlTextAnnotation> |
getAnnotationsFromFile(List<Path> corpusFileListEntry)
given an entry from the corpus file list generated by
getFileListing() , parse its
contents and get zero or more TextAnnotation objects. |
List<List<Path>> |
getFileListing()
ERE corpus directory has two directories: source/ and ere/.
|
static String |
getPostViewName() |
generateReport, getRequiredAnnotationFileExtension, getRequiredSourceFileExtension, initializeReader, reset
getSourceDirectory, hasNext, next
iterator, remove
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
forEach, spliterator
forEachRemaining
public static final String QUOTE
public static final String AUTHOR
public static final String ID
public static final String DATETIME
public static final String POST
public static final String DOC
public static final String ORIG_AUTHOR
public static final String HEADLINE
public static final String IMG
public static final String SNIP
public static final String SQUISH
public static final String STUFF
public static final String SARCASM
public static final String ENTITIES
public static final String ENTITY
public static final String FILLERS
public static final String FILLER
public static final String OFFSET
public static final String TYPE
public static final String ENTITY_MENTION
public static final String NOUN_TYPE
public static final String PRO
public static final String NOM
public static final String NAM
public static final String FILL
public static final String LENGTH
public static final String MENTION_TEXT
public static final String MENTION_HEAD
public static final String SPECIFICITY
public static final String REALIS
public static final String RELATIONS
public static final String RELATION
public static final String RELATION_MENTION
public static final String HOPPERS
public static final String HOPPER
public static final String EVENT_MENTION
public static final String EVENT_ARGUMENT
public static final String WAYS
public static final String SUBTYPE
public static final String ARG_ONE
public static final String ARG_TWO
public static final String ENTITY_MENTION_ID
public static final String ENTITY_ID
public static final String ROLE
public static final String FILLER_ID
public static final String DATELINE
public static final String CORPUS_TYPE
public static final String SOURCE
public static final String TRIGGER
public static final String ORIGIN
public static final String UNKNOWN_KBID
public static final String KBID
public static final String EntityMentionTypeAttribute
public static final String EntityIdAttribute
public static final String EntityMentionIdAttribute
public static final String EntityHeadStartCharOffset
public static final String EntityHeadEndCharOffset
public static final String EntitySpecificityAttribute
public static final String RelationIdAttribute
public static final String RelationMentionIdAttribute
public static final String RelationSubtypeAttribute
public static final String RelationTypeAttribute
public static final String RelationRealisAttribute
public static final String RelationSourceRoleAttribute
public static final String RelationTargetRoleAttribute
public static final String EventIdAttribute
public static final String EventMentionIdAttribute
public static final String EntityKbIdAttribute
public static final String NAME_START
public static final String NAME_END
public static final String UNSPECIFIED
public final Map<String,Set<String>> tagsWithAtts
public EREDocumentReader(EREDocumentReader.EreCorpus ereCorpus, TextAnnotationBuilder taBuilder, String corpusRoot, boolean throwExceptionOnXmlParseFailure) throws Exception
ereCorpus
- a value from enum EreCorpus (e.g. 'ENR1', 'ENR2', or 'ENR3')taBuilder
- TextAnnotationBuilder for target/language of choicethrowExceptionOnXmlParseFailure
- Exception
public EREDocumentReader(EREDocumentReader.EreCorpus ereCorpus, String corpusRoot, boolean throwExceptionOnXmlParseFailure) throws Exception
ereCorpus
- a value from enum EreCorpus (e.g. 'ENR1', 'ENR2', or 'ENR3')throwExceptionOnXmlParseFailure
- Exception
public EREDocumentReader(ResourceManager rm, XmlTextAnnotationMaker xmlTextAnnotationMaker) throws Exception
Exception
public static XmlTextAnnotationMaker buildEreXmlTextAnnotationMaker(String ereCorpusVal, boolean throwExceptionOnXmlParseFailure) throws Exception
ereCorpusVal
- a value corresponding to enum EreCorpus (e.g. 'ENR1', 'ENR2', or 'ENR3')throwExceptionOnXmlParseFailure
- if 'true', xml reader will throw an exception if it finds e.g.
mismatched xml tag open/closeException
public static ResourceManager buildEreConfig(String ereCorpusVal, String corpusRoot) throws Exception
ereCorpusVal
- a value corresponding to enum EreCorpus ('ENR1', 'ENR2', or 'ENR3')corpusRoot
- the root directory of the corpus on your file systemException
public static XmlTextAnnotationMaker buildXmlTextAnnotationMaker(EREDocumentReader.EreCorpus ereCorpus, boolean throwExceptionOnXmlParseFail)
XmlTextAnnotationMaker
for reading ERE format English corpus.ereCorpus
- which ERE release is being processed -- affects which tag blocks are markedthrowExceptionOnXmlParseFail
- if 'true', throw an exception if xml parser failspublic static XmlTextAnnotationMaker buildXmlTextAnnotationMaker(TextAnnotationBuilder textAnnotationBuilder, EREDocumentReader.EreCorpus ereCorpus, boolean throwExceptionOnXmlParseFail)
XmlTextAnnotationMaker
expecting ERE annotation. TextAnnotationBuilder
must be
configured for the target language.textAnnotationBuilder
- a TextAnnotationBuilder with tokenizer suited to target language.throwExceptionOnXmlParseFail
- if 'true', the XmlTextAnnotationMaker will throw an exception if any
errors are found in the source xml.public static String getPostViewName()
public List<List<Path>> getFileListing() throws IOException
super.getSourceDirectory()
to return the root directory of the ERE corpus, under which
should be data/source/ and data/ere/ directories containing source files and annotation files
respectively.getFileListing
in class XmlDocumentReader
IOException
public List<XmlTextAnnotation> getAnnotationsFromFile(List<Path> corpusFileListEntry) throws Exception
getFileListing()
, parse its
contents and get zero or more TextAnnotation objects. This allows for the case where corpus
annotations are provided in standoff format in one or more files separate from the source
document. In such cases, the first file in the list should contain the source document
and the rest should be the corresponding markup files.
In this default implementation, it is assumed that a single file contains both source and markup.getAnnotationsFromFile
in class XmlDocumentReader
corpusFileListEntry
- a list of files, the first of which is a source file.Exception
Copyright © 2017. All rights reserved.