ERENerReader (illinois-cogcomp-nlp 3.1.29 API)

java.lang.Object
- edu.illinois.cs.cogcomp.nlp.corpusreaders.AnnotationReader<T>
- - edu.illinois.cs.cogcomp.nlp.corpusreaders.AbstractIncrementalCorpusReader<XmlTextAnnotation>
  - - edu.illinois.cs.cogcomp.nlp.corpusreaders.XmlDocumentReader
    - - edu.illinois.cs.cogcomp.nlp.corpusreaders.ereReader.EREDocumentReader
      - edu.illinois.cs.cogcomp.nlp.corpusreaders.ereReader.ERENerReader

All Implemented Interfaces:

IResetableIterator<XmlTextAnnotation>, Iterable<XmlTextAnnotation>, Iterator<XmlTextAnnotation>

Direct Known Subclasses:

EREMentionRelationReader
```
public class ERENerReader
extends EREDocumentReader
```
Reads ERE data and instantiates TextAnnotations with the corresponding NER view. Also provides functionality to support combination with readers of other ERE annotations from the same source. ERE annotations are provided in stand-off form: each source file (in xml, and from which character offsets are computed) has one or more corresponding annotation files (also in xml). Each annotation file corresponds to a span of the source file, and contains all information about entities, relations, and events for that span. Entity and event identifiers presumably carry across spans from the same document. This reader allows the user to generate either a mention view or an NER view. NERs can be identified in a mention view via its type attribute. TODO: ascertain whether NER mentions can overlap. Probably not. TODO: allow non-token-level annotations (i.e. subtokens) This code is based on Tom Redman's code for generating CoNLL-format ERE NER data.

Author:

mssammon

Nested Class Summary
- Nested classes/interfaces inherited from class edu.illinois.cs.cogcomp.nlp.corpusreaders.ereReader.EREDocumentReader
  EREDocumentReader.EreCorpus

Field Summary

Fields
Modifier and Type Field and Description

protected int[] ends

static String IS_FOUND

protected int[] starts
- Fields inherited from class edu.illinois.cs.cogcomp.nlp.corpusreaders.ereReader.EREDocumentReader
  ARG_ONE, ARG_TWO, AUTHOR, CORPUS_TYPE, DATELINE, DATETIME, deletableSpanTags, DOC, ENTITIES, ENTITY, ENTITY_ID, ENTITY_MENTION, ENTITY_MENTION_ID, EntityHeadEndCharOffset, EntityHeadStartCharOffset, EntityIdAttribute, EntityKbIdAttribute, EntityMentionIdAttribute, EntityMentionTypeAttribute, EntitySpecificityAttribute, EVENT_ARGUMENT, EVENT_MENTION, EventIdAttribute, EventMentionIdAttribute, FILL, FILLER, FILLER_ID, FILLERS, HEADLINE, HOPPER, HOPPERS, ID, IMG, KBID, LENGTH, MENTION_HEAD, MENTION_TEXT, NAM, NAME_END, NAME_START, NOM, NOUN_TYPE, OFFSET, ORIG_AUTHOR, ORIGIN, POST, PRO, QUOTE, REALIS, RELATION, RELATION_MENTION, RelationIdAttribute, RelationMentionIdAttribute, RelationRealisAttribute, RELATIONS, RelationSourceRoleAttribute, RelationSubtypeAttribute, RelationTargetRoleAttribute, RelationTypeAttribute, ROLE, SARCASM, SNIP, SOURCE, SPECIFICITY, SQUISH, STUFF, SUBTYPE, tagsToIgnore, tagsWithAtts, TRIGGER, TYPE, UNKNOWN_KBID, UNSPECIFIED, WAYS
- Fields inherited from class edu.illinois.cs.cogcomp.nlp.corpusreaders.XmlDocumentReader
  fileId
- Fields inherited from class edu.illinois.cs.cogcomp.nlp.corpusreaders.AbstractIncrementalCorpusReader
  fileList, sourceDirectory
- Fields inherited from class edu.illinois.cs.cogcomp.nlp.corpusreaders.AnnotationReader
  corpusName, currentAnnotationId, resourceManager

Fields
Modifier and Type	Field and Description
`protected int[]`	`ends`
`static String`	`IS_FOUND`
`protected int[]`	`starts`

Constructor Summary

Constructors
Constructor and Description
`ERENerReader(EREDocumentReader.EreCorpus ereCorpus, String corpusRoot, boolean throwExceptionOnXmlParseFailure, boolean addNominalMentions, boolean addFillers)` Reads Named Entity -- and possibly nominal mention -- annotation from an English ERE-format corpus.
`ERENerReader(EREDocumentReader.EreCorpus ereCorpus, TextAnnotationBuilder textAnnotationBuilder, String corpusRoot, boolean throwExceptionOnXmlParseFailure, boolean addNominalMentions, boolean addFillers)` Reads Named Entity -- and possibly nominal mention -- annotation from an ERE-format corpus.

Method Summary

All Methods Instance Methods Concrete Methods
Modifier and Type	Method and Description
`protected void`	`compileOffsets(SpanLabelView tokens)` get the start and end offsets of all constituents and store them note that these are based on the cleaned-up text, so need to be mapped back to the original text.
`protected int`	`findEndIndex(int endOffset, String rawText)` Find the index of the first token constituent that has end char offset "endOffset" and return the value one higher than that index (to instantiate Constituents, which use one-past-the-end indexing).
`protected int`	`findEndIndexIgnoreError(int endOffset)` Find the index of the first constituent at startOffset.
`protected int`	`findStartIndex(int startOffset)` Find the index of the first constituent at startOffset.
`protected int`	`findStartIndexIgnoreError(int startOffset)` Find the index of the first constituent near startOffset.
`String`	`generateReport()` Generates report of Entities and Mentions read and generated.
`List<XmlTextAnnotation>`	`getAnnotationsFromFile(List<Path> corpusFileListEntry)` given an entry from the corpus file list generated by `EREDocumentReader.getFileListing()` , parse its contents and get zero or more TextAnnotation objects.
`String`	`getCorefViewName()`
`protected void`	`getEntitiesFromFile(Document doc, View nerView, XmlTextAnnotation xmlTa)` Read entity mentions and populate the view provided.
`protected void`	`getFillersFromFile(Document doc, View nerView, XmlTextAnnotation xmlTa)`
`protected Constituent`	`getMentionConstituent(String mentionId)` after reading a file's entity information, allows the client to find a Constituent corresponding to an entity mention id.
`String`	`getMentionViewName()`
`protected IntPair`	`getTokenOffsets(int origStartOffset, int origEndOffset, String mentionForm, XmlTextAnnotation xmlTa)` find the token offsets in the TextAnnotation that correspond to the source character offsets for the given mention
`void`	`readEntity(Node eNode, View view, XmlTextAnnotation xmlTa)` read the entities from the gold standard xml and produce appropriate constituents in the view.
`void`	`reset()` set the reader to start from the beginning of the corpus.

Methods inherited from class edu.illinois.cs.cogcomp.nlp.corpusreaders.ereReader.EREDocumentReader
buildEreConfig, buildEreXmlTextAnnotationMaker, buildXmlTextAnnotationMaker, buildXmlTextAnnotationMaker, getFileListing, getPostViewName

Methods inherited from class edu.illinois.cs.cogcomp.nlp.corpusreaders.XmlDocumentReader
getRequiredAnnotationFileExtension, getRequiredSourceFileExtension, initializeReader

Methods inherited from class edu.illinois.cs.cogcomp.nlp.corpusreaders.AbstractIncrementalCorpusReader
getSourceDirectory, hasNext, next

Methods inherited from class edu.illinois.cs.cogcomp.nlp.corpusreaders.AnnotationReader
iterator, remove

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Methods inherited from interface java.lang.Iterable
forEach, spliterator

Methods inherited from interface java.util.Iterator
forEachRemaining

- Field Detail
  - IS_FOUND
```
public static final String IS_FOUND
```
    See Also:
    
    Constant Field Values
  - starts
```
protected int[] starts
```
  - ends
```
protected int[] ends
```
- Constructor Detail
  - ERENerReader
```
public ERENerReader(EREDocumentReader.EreCorpus ereCorpus,
                    String corpusRoot,
                    boolean throwExceptionOnXmlParseFailure,
                    boolean addNominalMentions,
                    boolean addFillers)
             throws Exception
```
    Reads Named Entity -- and possibly nominal mention -- annotation from an English ERE-format corpus.
    
    Parameters:
    
    ereCorpus -
    
    corpusRoot -
    
    throwExceptionOnXmlParseFailure -
    
    addNominalMentions - a flag that if true, indicates that nominal mentions should be read, and that the view created should be named {#ViewNames.MENTION_ERE}.
    
    addFillers - if 'true', indicates that non-coreferable mentions should be added.
    
    Throws:
    
    Exception
  - ERENerReader
```
public ERENerReader(EREDocumentReader.EreCorpus ereCorpus,
                    TextAnnotationBuilder textAnnotationBuilder,
                    String corpusRoot,
                    boolean throwExceptionOnXmlParseFailure,
                    boolean addNominalMentions,
                    boolean addFillers)
             throws Exception
```
    Reads Named Entity -- and possibly nominal mention -- annotation from an ERE-format corpus.
    
    Parameters:
    
    ereCorpus - specifies ERE release -- and therefore, corpus directory structure and xml tag set
    
    textAnnotationBuilder - TextAnnotationBuilder for target language
    
    corpusRoot - data root of corpus directory
    
    throwExceptionOnXmlParseFailure -
    
    addNominalMentions - a flag that if true, indicates that nominal mentions should be read, and that the view created should be named {#ViewNames.MENTION_ERE}.
    
    addFillers - if 'true', indicates that non-coreferable mentions should be added.
    
    Throws:
    
    Exception
- Method Detail
  - reset
```
public void reset()
```
    Description copied from class: XmlDocumentReader
    
    set the reader to start from the beginning of the corpus.
    
    Specified by:
    
    reset in interface IResetableIterator<XmlTextAnnotation>
    
    Overrides:
    
    reset in class XmlDocumentReader
  - getAnnotationsFromFile
```
public List<XmlTextAnnotation> getAnnotationsFromFile(List<Path> corpusFileListEntry)
                                               throws Exception
```
    Description copied from class: EREDocumentReader
    
    given an entry from the corpus file list generated by EREDocumentReader.getFileListing() , parse its contents and get zero or more TextAnnotation objects. This allows for the case where corpus annotations are provided in standoff format in one or more files separate from the source document. In such cases, the first file in the list should contain the source document and the rest should be the corresponding markup files. In this default implementation, it is assumed that a single file contains both source and markup.
    
    Overrides:
    
    getAnnotationsFromFile in class EREDocumentReader
    
    Parameters:
    
    corpusFileListEntry - a list of files, the first of which is a source file.
    
    Returns:
    
    List of TextAnnotation objects extracted from the corpus file.
    
    Throws:
    
    Exception
  - getFillersFromFile
```
protected void getFillersFromFile(Document doc,
                                  View nerView,
                                  XmlTextAnnotation xmlTa)
                           throws XMLException
```
    Throws:
    
    XMLException
  - getEntitiesFromFile
```
protected void getEntitiesFromFile(Document doc,
                                   View nerView,
                                   XmlTextAnnotation xmlTa)
                            throws XMLException
```
    Read entity mentions and populate the view provided.
    
    Parameters:
    
    doc - XML document containing entity information.
    
    nerView - View to populate with new entity mentions
    
    Throws:
    
    XMLException
  - compileOffsets
```
protected void compileOffsets(SpanLabelView tokens)
```
    get the start and end offsets of all constituents and store them note that these are based on the cleaned-up text, so need to be mapped back to the original text.
    
    Parameters:
    
    tokens - SpanLabelView containing Token info (from TextAnnotation)
  - findStartIndex
```
protected int findStartIndex(int startOffset)
```
    Find the index of the first constituent at startOffset.
    
    Parameters:
    
    startOffset - the character offset we want.
    
    Returns:
    
    the index of the first constituent.
  - findStartIndexIgnoreError
```
protected int findStartIndexIgnoreError(int startOffset)
```
    Find the index of the first constituent *near* startOffset.
    
    Parameters:
    
    startOffset - the character offset we want.
    
    Returns:
    
    the index of the first constituent.
  - findEndIndex
```
protected int findEndIndex(int endOffset,
                           String rawText)
```
    Find the index of the first token constituent that has end char offset "endOffset" and return the value one higher than that index (to instantiate Constituents, which use one-past-the-end indexing).
    
    Parameters:
    
    endOffset - the character offset for which we want a corresponding token index.
    
    Returns:
    
    the index of the token.
  - findEndIndexIgnoreError
```
protected int findEndIndexIgnoreError(int endOffset)
```
    Find the index of the first constituent at startOffset. Return that index + 1 (for past-the-end indexing used by Constituents)
    
    Parameters:
    
    endOffset - the character offset we want.
    
    Returns:
    
    one plus the index of the first token that has that end character offset.
  - readEntity
```
public void readEntity(Node eNode,
                       View view,
                       XmlTextAnnotation xmlTa)
                throws XMLException
```
    read the entities from the gold standard xml and produce appropriate constituents in the view. NOTE: the constituents will not be ordered when we are done.
    restaurants restaurants
    
    Parameters:
    
    eNode - the entity node, contains the more specific mentions of that entity.
    
    view - the span label view we will add the labels to.
    
    Throws:
    
    XMLException
  - getTokenOffsets
```
protected IntPair getTokenOffsets(int origStartOffset,
                                  int origEndOffset,
                                  String mentionForm,
                                  XmlTextAnnotation xmlTa)
```
    find the token offsets in the TextAnnotation that correspond to the source character offsets for the given mention
    
    Parameters:
    
    origStartOffset - start character offset from xml markup
    
    origEndOffset - end character offset from xml markup
    
    mentionForm - mention form from xml markup
    
    xmlTa - XmlTextAnnotation object storing original xml, transformed text, extracted xml markup, and corresponding TextAnnotation
    
    Returns:
    
    Intpair(-1, -1) if the specified offsets correspond to deleted span (and hence likely a name mention in xml metadata, e.g. post author); null if no mapped tokens could be found (possibly, indexes refer to the middle of a single token because tokenizer can't segment some strings); or the corresponding token indexes
  - getMentionViewName
```
public String getMentionViewName()
```
  - getCorefViewName
```
public String getCorefViewName()
```
  - getMentionConstituent
```
protected Constituent getMentionConstituent(String mentionId)
```
    after reading a file's entity information, allows the client to find a Constituent corresponding to an entity mention id. Returns 'null' if the Constituent does not exist (due to a problem with the annotation file (inaccurate offsets), or tokenization is incorrect (target name is part of compound term), or other constraints apply (e.g. if overlapping entity mentions are prohibited)
    
    Parameters:
    
    mentionId - mentionId parsed from the annotation file
    
    Returns:
    
    Constituent corresponding to the mentionId, or null if it is not found
  - generateReport
```
public String generateReport()
```
    Generates report of Entities and Mentions read and generated. Note that these may differ: this reader relies on its own tokenization (none is provided in the source corpus) and if token segmentation differs, mentions specified in the source may not be found in the text by this reader.
    
    Overrides:
    
    generateReport in class XmlDocumentReader
    
    Returns:
    
    String describing annotations read and generated.

Class ERENerReader

Nested Class Summary

Nested classes/interfaces inherited from class edu.illinois.cs.cogcomp.nlp.corpusreaders.ereReader.EREDocumentReader

Field Summary

Fields inherited from class edu.illinois.cs.cogcomp.nlp.corpusreaders.ereReader.EREDocumentReader

Fields inherited from class edu.illinois.cs.cogcomp.nlp.corpusreaders.XmlDocumentReader

Fields inherited from class edu.illinois.cs.cogcomp.nlp.corpusreaders.AbstractIncrementalCorpusReader

Fields inherited from class edu.illinois.cs.cogcomp.nlp.corpusreaders.AnnotationReader

Constructor Summary

Method Summary

Methods inherited from class edu.illinois.cs.cogcomp.nlp.corpusreaders.ereReader.EREDocumentReader

Methods inherited from class edu.illinois.cs.cogcomp.nlp.corpusreaders.XmlDocumentReader

Methods inherited from class edu.illinois.cs.cogcomp.nlp.corpusreaders.AbstractIncrementalCorpusReader

Methods inherited from class edu.illinois.cs.cogcomp.nlp.corpusreaders.AnnotationReader

Methods inherited from class java.lang.Object

Methods inherited from interface java.lang.Iterable

Methods inherited from interface java.util.Iterator

Field Detail

IS_FOUND

starts

ends

Constructor Detail

ERENerReader

ERENerReader

Method Detail

reset

getAnnotationsFromFile

getFillersFromFile

getEntitiesFromFile

compileOffsets

findStartIndex

findStartIndexIgnoreError

findEndIndex

findEndIndexIgnoreError

readEntity

getTokenOffsets

getMentionViewName

getCorefViewName

getMentionConstituent

generateReport