public class ERENerReader extends EREDocumentReader
EREDocumentReader.EreCorpus
Modifier and Type | Field and Description |
---|---|
protected int[] |
ends |
static String |
IS_FOUND |
protected int[] |
starts |
ARG_ONE, ARG_TWO, AUTHOR, CORPUS_TYPE, DATELINE, DATETIME, deletableSpanTags, DOC, ENTITIES, ENTITY, ENTITY_ID, ENTITY_MENTION, ENTITY_MENTION_ID, EntityHeadEndCharOffset, EntityHeadStartCharOffset, EntityIdAttribute, EntityKbIdAttribute, EntityMentionIdAttribute, EntityMentionTypeAttribute, EntitySpecificityAttribute, EVENT_ARGUMENT, EVENT_MENTION, EventIdAttribute, EventMentionIdAttribute, FILL, FILLER, FILLER_ID, FILLERS, HEADLINE, HOPPER, HOPPERS, ID, IMG, KBID, LENGTH, MENTION_HEAD, MENTION_TEXT, NAM, NAME_END, NAME_START, NOM, NOUN_TYPE, OFFSET, ORIG_AUTHOR, ORIGIN, POST, PRO, QUOTE, REALIS, RELATION, RELATION_MENTION, RelationIdAttribute, RelationMentionIdAttribute, RelationRealisAttribute, RELATIONS, RelationSourceRoleAttribute, RelationSubtypeAttribute, RelationTargetRoleAttribute, RelationTypeAttribute, ROLE, SARCASM, SNIP, SOURCE, SPECIFICITY, SQUISH, STUFF, SUBTYPE, tagsToIgnore, tagsWithAtts, TRIGGER, TYPE, UNKNOWN_KBID, UNSPECIFIED, WAYS
fileId
fileList, sourceDirectory
corpusName, currentAnnotationId, resourceManager
Constructor and Description |
---|
ERENerReader(EREDocumentReader.EreCorpus ereCorpus,
String corpusRoot,
boolean throwExceptionOnXmlParseFailure,
boolean addNominalMentions,
boolean addFillers)
Reads Named Entity -- and possibly nominal mention -- annotation from an English ERE-format corpus.
|
ERENerReader(EREDocumentReader.EreCorpus ereCorpus,
TextAnnotationBuilder textAnnotationBuilder,
String corpusRoot,
boolean throwExceptionOnXmlParseFailure,
boolean addNominalMentions,
boolean addFillers)
Reads Named Entity -- and possibly nominal mention -- annotation from an ERE-format corpus.
|
Modifier and Type | Method and Description |
---|---|
protected void |
compileOffsets(SpanLabelView tokens)
get the start and end offsets of all constituents and store them
note that these are based on the cleaned-up text, so need to be mapped back
to the original text.
|
protected int |
findEndIndex(int endOffset,
String rawText)
Find the index of the first token constituent that has end char offset "endOffset" and return
the value one higher than that index (to instantiate Constituents, which use one-past-the-end
indexing).
|
protected int |
findEndIndexIgnoreError(int endOffset)
Find the index of the first constituent at startOffset.
|
protected int |
findStartIndex(int startOffset)
Find the index of the first constituent at startOffset.
|
protected int |
findStartIndexIgnoreError(int startOffset)
Find the index of the first constituent *near* startOffset.
|
String |
generateReport()
Generates report of Entities and Mentions read and generated.
|
List<XmlTextAnnotation> |
getAnnotationsFromFile(List<Path> corpusFileListEntry)
given an entry from the corpus file list generated by
EREDocumentReader.getFileListing() , parse its
contents and get zero or more TextAnnotation objects. |
String |
getCorefViewName() |
protected void |
getEntitiesFromFile(Document doc,
View nerView,
XmlTextAnnotation xmlTa)
Read entity mentions and populate the view provided.
|
protected void |
getFillersFromFile(Document doc,
View nerView,
XmlTextAnnotation xmlTa) |
protected Constituent |
getMentionConstituent(String mentionId)
after reading a file's entity information, allows the client to find a Constituent
corresponding to an entity mention id.
|
String |
getMentionViewName() |
protected IntPair |
getTokenOffsets(int origStartOffset,
int origEndOffset,
String mentionForm,
XmlTextAnnotation xmlTa)
find the token offsets in the TextAnnotation that correspond to the source character offsets for the given
mention
|
void |
readEntity(Node eNode,
View view,
XmlTextAnnotation xmlTa)
read the entities from the gold standard xml and produce appropriate constituents in the
view.
|
void |
reset()
set the reader to start from the beginning of the corpus.
|
buildEreConfig, buildEreXmlTextAnnotationMaker, buildXmlTextAnnotationMaker, buildXmlTextAnnotationMaker, getFileListing, getPostViewName
getRequiredAnnotationFileExtension, getRequiredSourceFileExtension, initializeReader
getSourceDirectory, hasNext, next
iterator, remove
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
forEach, spliterator
forEachRemaining
public static final String IS_FOUND
protected int[] starts
protected int[] ends
public ERENerReader(EREDocumentReader.EreCorpus ereCorpus, String corpusRoot, boolean throwExceptionOnXmlParseFailure, boolean addNominalMentions, boolean addFillers) throws Exception
ereCorpus
- corpusRoot
- throwExceptionOnXmlParseFailure
- addNominalMentions
- a flag that if true, indicates that nominal mentions should be read,
and that the view created should be named {#ViewNames.MENTION_ERE}.addFillers
- if 'true', indicates that non-coreferable mentions should be added.Exception
public ERENerReader(EREDocumentReader.EreCorpus ereCorpus, TextAnnotationBuilder textAnnotationBuilder, String corpusRoot, boolean throwExceptionOnXmlParseFailure, boolean addNominalMentions, boolean addFillers) throws Exception
ereCorpus
- specifies ERE release -- and therefore, corpus directory structure and xml tag settextAnnotationBuilder
- TextAnnotationBuilder for target languagecorpusRoot
- data root of corpus directorythrowExceptionOnXmlParseFailure
- addNominalMentions
- a flag that if true, indicates that nominal mentions should be read,
and that the view created should be named {#ViewNames.MENTION_ERE}.addFillers
- if 'true', indicates that non-coreferable mentions should be added.Exception
public void reset()
XmlDocumentReader
reset
in interface IResetableIterator<XmlTextAnnotation>
reset
in class XmlDocumentReader
public List<XmlTextAnnotation> getAnnotationsFromFile(List<Path> corpusFileListEntry) throws Exception
EREDocumentReader
EREDocumentReader.getFileListing()
, parse its
contents and get zero or more TextAnnotation objects. This allows for the case where corpus
annotations are provided in standoff format in one or more files separate from the source
document. In such cases, the first file in the list should contain the source document
and the rest should be the corresponding markup files.
In this default implementation, it is assumed that a single file contains both source and markup.getAnnotationsFromFile
in class EREDocumentReader
corpusFileListEntry
- a list of files, the first of which is a source file.Exception
protected void getFillersFromFile(Document doc, View nerView, XmlTextAnnotation xmlTa) throws XMLException
XMLException
protected void getEntitiesFromFile(Document doc, View nerView, XmlTextAnnotation xmlTa) throws XMLException
doc
- XML document containing entity information.nerView
- View to populate with new entity mentionsXMLException
protected void compileOffsets(SpanLabelView tokens)
tokens
- SpanLabelView containing Token info (from TextAnnotation)protected int findStartIndex(int startOffset)
startOffset
- the character offset we want.protected int findStartIndexIgnoreError(int startOffset)
startOffset
- the character offset we want.protected int findEndIndex(int endOffset, String rawText)
endOffset
- the character offset for which we want a corresponding token index.protected int findEndIndexIgnoreError(int endOffset)
endOffset
- the character offset we want.public void readEntity(Node eNode, View view, XmlTextAnnotation xmlTa) throws XMLException
eNode
- the entity node, contains the more specific mentions of that entity.view
- the span label view we will add the labels to.XMLException
protected IntPair getTokenOffsets(int origStartOffset, int origEndOffset, String mentionForm, XmlTextAnnotation xmlTa)
origStartOffset
- start character offset from xml markuporigEndOffset
- end character offset from xml markupmentionForm
- mention form from xml markupxmlTa
- XmlTextAnnotation object storing original xml, transformed text, extracted xml markup,
and corresponding TextAnnotationpublic String getMentionViewName()
public String getCorefViewName()
protected Constituent getMentionConstituent(String mentionId)
mentionId
- mentionId parsed from the annotation filepublic String generateReport()
generateReport
in class XmlDocumentReader
Copyright © 2017. All rights reserved.