edu.illinois.cs.cogcomp.lbj.coref.ir.docs
Class DocBase

java.lang.Object
  extended by edu.illinois.cs.cogcomp.lbj.coref.ir.docs.DocBase
All Implemented Interfaces:
Doc, java.io.Serializable
Direct Known Subclasses:
DocPlainText, DocXMLBase

public abstract class DocBase
extends java.lang.Object
implements Doc, java.io.Serializable

Represents one document from a corpus, including the text, annotations of coreference, relations, entities, and other relevant information. Also contains methods to load input from XML files.

Author:
Eric Bengtson
See Also:
Serialized Form

Nested Class Summary
static class DocBase.PosSource
           
 
Field Summary
(package private) static int goodEnds
           
(package private) static int goodStarts
           
protected  java.lang.String m_annotationAuthor
           
protected  java.lang.String m_baseFN
           
private  java.util.Map<Mention,Mention> m_bestMentionMap
           
protected  boolean m_bNeedsCasing
           
private  boolean m_bUsePredEntities
           
private  boolean m_bUsePredMentions
           
protected  LBJ2.classify.Classifier m_caser
           
private  java.util.Map<Pair<Mention,Mention>,CExample> m_cExMap
           
private  java.util.Map<java.lang.Integer,java.lang.Integer> m_charWordMap
           
private  ChainSolution<Mention> m_corefChains
           
private  java.util.Map<java.lang.String,java.lang.Integer> m_corpusWordCounts
           
private  java.lang.String m_countingText
           
protected  java.lang.String m_dateTime
           
private  Aligner<Mention> m_defaultAligner
           
protected  java.lang.String m_docID
           
protected  java.lang.String m_docType
           
private  java.util.Map<java.lang.String,java.lang.Integer> m_docWordCounts
           
protected  java.lang.String m_encoding
           
private  java.util.Map<java.lang.Integer,java.util.Set<Mention>> m_extentStartWordNumMentionMap
           
private  java.util.Map<Mention,GExample> m_gExMap
           
protected  java.lang.String m_headline
           
private  java.util.Map<java.lang.Integer,java.util.Map<java.lang.Integer,java.lang.Boolean>> m_headPredictionMap
           
private  java.util.Map<java.lang.Integer,java.util.Set<Mention>> m_headStartWordNumMentionMap
           
private  java.lang.Boolean m_isCaseSensitive
           
private  java.util.Map<Mention,java.util.Set<Mention>> m_mentionsContaining
           
private  java.util.List<java.util.List<Mention>> m_mentsInSents
           
private  java.util.Map<java.lang.String,java.lang.Integer> m_mentWordCounts
           
private  int m_nSents
           
private  java.util.List<java.util.List<java.lang.String>> m_phrases
           
private  java.util.List<java.lang.String> m_pos
           
private  java.util.List<Entity> m_predEntities
           
private  java.util.List<Mention> m_predMentions
           
private  java.util.Map<Mention,Mention> m_predToTrueMention
           
private  java.util.List<java.lang.Integer> m_quoteNestLevel
           
private  java.util.List<Relation> m_relations
           
private  java.util.Map<Pair<java.lang.Integer,java.lang.Integer>,Pair<java.util.List<Mention>,java.util.List<Mention>>> m_sentenceMentionsPair
           
protected  java.lang.String m_slug
           
protected  java.lang.String m_source
           
protected  java.lang.String m_text
           
private  int m_textStartCharNum
           
private  java.util.List<Entity> m_trueEntities
           
private  java.util.List<Mention> m_trueMentions
           
private  boolean m_trueMentionsSorted
           
protected  java.lang.String m_version
           
private  java.util.Map<java.lang.Integer,java.lang.Integer> m_wordNumCharNumMap
           
private  java.util.Map<java.lang.Integer,java.lang.Integer> m_wordNumSentNumMap
           
private  java.util.List<java.lang.String> m_words
           
(package private) static int medEnds
           
private static long serialVersionUID
           
(package private) static int totalMentions
           
 
Constructor Summary
DocBase()
          Basic constructor: Not recommended.
 
Method Summary
 void addHeadPrediction(int firstWN, int lastWN, boolean pred)
           
protected  void addPredEntities(java.util.List<Entity> ents)
          Backed internally.
protected  void addRelation(Relation r)
           
protected  void addTrueEntity(Entity e)
          Can be made public, but then need to ensure that e's mentions are all added.
protected  void addTrueMention(Mention m)
           
protected  void alignPredMentsToTrue()
           
protected  void buildMentionsContaining()
           
protected  void buildMentionsInSents()
           
 void calcAndSetQuotes()
          Determines the location of quotes and sets them.
 Mention getBestMentionFor(Mention m)
          Gets the canonical mention of the entity containing m.
 CExample getCExampleFor(Mention m1, Mention m2)
          Returns the unique CExample for the given pair of mentions in the given order.
 java.util.Map<Entity,java.util.Map<java.lang.Integer,java.lang.String>> getCoherenceInfo()
          Gets the coherence info using the value of usePredictedEntities() to determine whether to use predicted entities.
 java.util.Map<Entity,java.util.Map<java.lang.Integer,java.lang.String>> getCoherenceInfo(boolean usePred)
          Gets a grid indicating the mention type for each combination of entities and sentences.
 ChainSolution<Mention> getCorefChains()
          Gets the partition of mentions into coreferential sets.
 java.lang.String getDocID()
          Gets the ID for this document, as a string.
 java.util.List<Entity> getEntities()
          Gets the entities, in no particular order.
 Entity getEntityFor(Mention m)
          Currently implemented slowly.
 Entity getEntityFor(Mention m, boolean usePred)
          Currently implemented slowly.
protected  Entity getEntityFor(Mention m, java.util.List<Entity> entities)
          Currently implemented slowly.
 GExample getGExampleFor(Mention m)
          Returns the unique GExample for the given pair of mentions in the given order.
 boolean getHeadPrediction(int firstWN, int lastWN)
           
 double getInCorpusInverseFreq(java.lang.String word)
          Gets the inverse of the number of occurrences of the specified word in the corpus.
 double getInDocInverseFreq(java.lang.String word)
          Gets the inverse of the number of occurrences of the specified word in the document.
 double getInverseTrueHeadFreq(int wordNum)
          Gets the inverse true head frequency of the word at the specified position.
 double getInverseTrueHeadFreq(java.lang.String word)
          Gets the inverse of the number of occurrences of the specified word in the heads of the true mentions in the document.
 Mention getMention(int n)
           
 java.util.List<Mention> getMentions()
          Gets the mentions of the document, sorted (typically in document order).
 java.util.Set<Mention> getMentionsContainedIn(Mention m)
          Gets the set of mentions whose head is entirely contained within a specified mention's extent, including the specified mention itself.
 java.util.Set<Mention> getMentionsContaining(Mention m)
          Gets the set of mentions whose extents entirely contain a specified mention's extent, including the specified mention itself.
 java.util.List<Mention> getMentionsInSent(int sentNum)
          Gets a list of the mentions in a specified sentence in order.
 Pair<java.util.List<Mention>,java.util.List<Mention>> getMentionsInSentences(int s1, int s2)
          Gets a pair of lists of mentions, one for each of the two specified sentences.
 java.util.Set<Mention> getMentionsWithExtentStartingAt(int startWord)
          Returns the set of mentions whose extents start at the specified word number, or an empty set if none are found.
 java.util.Set<Mention> getMentionsWithHeadStartingAt(int startWord)
          Returns the set of mentions whose heads start at the specified word number, or an empty set if none are found.
 int getNumMentions()
           
 int getNumRelations()
          Gets the number of relations.
 int getNumSentences()
          Returns the number of sentences in the document.
 java.lang.String getPlainText()
          Gets the text that is the basis for counting, including the start/end characters in Chunk objects.
 java.util.List<java.lang.String> getPOS()
          Gets a list of the Part-Of-Speech tags for the words of the document.
 java.lang.String getPOS(int posNum)
          Gets the Part-Of-Speech tag for the word at the posNum position in the document.
 java.util.List<Entity> getPredEntities()
          Gets a list of predicted entities, in no particular order.
 Mention getPredMention(int n)
           
 java.util.List<Mention> getPredMentions()
          Gets a sorted list of predicted mentions.
 int getQuoteNestLevel(int wordNum)
          Indicates the number of nested quotes the specified word is in.
 Relation getRelation(int number)
          Gets the specified relation.
 int getSentNum(int wordNum)
          Gets the sentence number for the specified word.
 java.lang.String getShortEID(java.lang.String longID)
           
 int getStartCharNum(int wordNum)
          Gets the zero-based position of the first character of a word.
 int getTextFirstWordNum()
          Gets the word number of the first word in the main text of the document (as distinguished from headlines and metadata that may be included in the plain text.)
 java.util.List<Entity> getTrueEntities()
          Gets a list of true entities, in no particular order.
 Mention getTrueMention(int n)
           
 Mention getTrueMentionFor(Mention pred)
          Gets the true mention aligned with the specified mention.
 java.util.List<Mention> getTrueMentions()
          Gets a sorted list of true mentions.
 java.util.Map<java.lang.String,java.lang.Integer> getWholeDocCounts()
          Gets the counts for the words in the document.
 java.lang.String getWord(int wordNum)
          Gets the specified word.
 int getWordNum(int charNum)
          Determines the word number (zero-based) of the word at charNum, or if no word is at charNum, return the word number of the closest word appearing after charNum, or if no such word exists, return -1.
 java.util.List<java.lang.String> getWords()
          Gets a list of the surface forms of the words of the document.
 boolean hasHeadPrediction(int firstWN, int lastWN)
          Checks to see whether a prediction has been stored for whether the closed interval [firstWN, lastWN] word sequence is a head.
 boolean hasPredEntities()
          Indicates whether predicted entities are available.
 boolean hasPredMentions()
          Indicates whether predicted mentions have been set.
 boolean hasTrueEntities()
          Indicates whether true entities are available.
 boolean hasTrueMentions()
          Indicates whether true mentions have been set.
protected  void initMembersDefault()
           
 boolean isCaseSensitive()
          Indicates whether the document is case sensitive.
protected  void loadChunkedText(java.lang.String filename)
          Loads text that has been preprocessed offline.
 void loadFromText(java.lang.String plainText)
          Builds the document from the given plain text, automatically splitting sentences, determining quote levels, determining part-of-speech tags, and splitting words by an automatic word-splitting algorithm.
 void loadFromText(java.lang.String plainText, boolean doWordSplit, boolean doPOSTag)
          Builds the document from the given plain text, automatically splitting sentences, determining quote levels, determining part-of-speech tags, and either splitting words by whitespace or using a word-splitter.
protected  java.lang.String loadPOSTaggerOutput()
          Loads the output of the SNoW-based POS tagger.
protected  void loadPOSTags(java.lang.String content)
          Loads text that has been preprocessed.
 void loadSGMText(java.lang.String filename)
           
protected  java.util.Map<Mention,Mention> makeBestMentionMap()
           
 Chunk makeChunk(int startWord, int endWord)
          Create a chunk spanning the specified words in this document.
static void printChunkValidity()
          Verify that all mentions start and end on phrase boundaries.
protected  void recordWordLocation(int wn, int startCN, int endCN)
          Records the fact that a word is located at characters startCN through endCN (inclusive).
protected  java.lang.String removeTagsAndExtraNL(java.lang.String a)
           
protected  java.lang.String repeat(java.lang.String s, int n)
           
 void save()
          Writes the document to a file using serialization.
 void setCorpusCounts(java.util.Map<java.lang.String,java.lang.Integer> counts)
          Sets the corpus counts for the words in the corpus.
protected  void setPlainText(java.lang.String text)
          Should be set before words are set.
protected  void setPOSTags(java.util.List<java.lang.String> tags)
          Sets the POS tags.
 void setPredEntities(ChainSolution<Mention> sol)
          Sets the predicted entities to be those specified by sol.
 void setPredictedMentions(java.util.Collection<Mention> ments)
          Sets the predicted mentions and records a preference for using them.
 void setQuoteLevels(java.util.List<java.lang.Integer> quoteLevels)
          Sets the quote levels, which indicate the number of nested quotations in which each word is embedded.
protected  void setSentenceNumbers(java.util.List<java.lang.Integer> sentNums)
          Sets the sentence numbers for each word.
 void setUsePredictedEntities(boolean usePred)
          Sets the preference for using predicted entities or true entities.
 void setUsePredictedMentions(boolean usePred)
          Sets the preference for using predicted mentions or true mentions.
 void setWords(java.util.List<java.lang.String> words)
           
 void setWords(java.util.List<java.lang.String> words, boolean backwardsCompatible)
          Sets the words, aligns them with the plain text, and records statistics about them.
protected  java.util.List<Entity> sortEntitiesByListOrder(java.util.List<Entity> ents, java.util.List<Entity> ordered)
          Does NOT modify in place (but this may change).
protected  void sortPredictedMentions()
          Sorts predicted mentions in natural order, which is the textual order by default.
protected  void sortTrueMentions()
          Sorts true mentions in natural order, which is the textual order by default.
 java.lang.String toAnnotatedString(boolean showPOS)
          Gets the document as a string annotated with mention boundaries, with square brackets for true mentions, asterisks for false alarms, and triangle brackets for missed mentions, and optionally annotated with Part-Of-Speech tags.
 java.lang.String toAnnotatedString(boolean showPOS, boolean showMTypes, boolean showETypes, boolean showEIDs)
          Gets the document as a string annotated with mention boundaries, with square brackets for true mentions, asterisks for false alarms, and triangle brackets for missed mentions, and optionally annotated with Part-Of-Speech tags, mention types, entity types, and entity IDs.
 java.lang.String toCoherenceTableString()
          Gets the coherence grid represented as a string, laid out in a grid.
 java.lang.String toCoherenceTableString(boolean usePred)
          Gets the coherence grid represented as a string, laid out in a grid.
 java.lang.String toString()
           
 java.lang.String toSubstituteString()
          Gets the document as a string where each mention has been replaced by the most specific mention coreferential with it.
protected  java.lang.String translateEscaped(java.lang.String escaped, int cursor)
          Translates an escaped round, square, or curly brace escaped as -LBR- or -RBR-, or an escaped pair of quotes, escaped as a double quote charaacter.
 boolean usePredictedEntities()
          Indicates whether requests for entities will return predicted entities or true entities.
 boolean usePredictedMentions()
          Indicates whether requests for mentions will return predicted mentions or true mentions.
 void write(boolean usePredictions)
          Writes this Doc in the appropriate format.
abstract  void write(java.lang.String filename, boolean usePredictions)
          Writes this Doc in the appropriate format.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Field Detail

serialVersionUID

private static final long serialVersionUID
See Also:
Constant Field Values

m_baseFN

protected java.lang.String m_baseFN

totalMentions

static int totalMentions

goodStarts

static int goodStarts

goodEnds

static int goodEnds

medEnds

static int medEnds

m_bUsePredEntities

private boolean m_bUsePredEntities

m_trueEntities

private java.util.List<Entity> m_trueEntities

m_predEntities

private java.util.List<Entity> m_predEntities

m_corefChains

private ChainSolution<Mention> m_corefChains

m_relations

private java.util.List<Relation> m_relations

m_bUsePredMentions

private boolean m_bUsePredMentions

m_trueMentions

private java.util.List<Mention> m_trueMentions

m_trueMentionsSorted

private boolean m_trueMentionsSorted

m_predMentions

private java.util.List<Mention> m_predMentions

m_defaultAligner

private Aligner<Mention> m_defaultAligner

m_predToTrueMention

private java.util.Map<Mention,Mention> m_predToTrueMention

m_caser

protected LBJ2.classify.Classifier m_caser

m_bNeedsCasing

protected boolean m_bNeedsCasing

m_phrases

private java.util.List<java.util.List<java.lang.String>> m_phrases

m_source

protected java.lang.String m_source

m_docType

protected java.lang.String m_docType

m_version

protected java.lang.String m_version

m_annotationAuthor

protected java.lang.String m_annotationAuthor

m_encoding

protected java.lang.String m_encoding

m_docID

protected java.lang.String m_docID

m_slug

protected java.lang.String m_slug

m_dateTime

protected java.lang.String m_dateTime

m_headline

protected java.lang.String m_headline

m_text

protected java.lang.String m_text

m_textStartCharNum

private int m_textStartCharNum

m_words

private java.util.List<java.lang.String> m_words

m_pos

private java.util.List<java.lang.String> m_pos

m_quoteNestLevel

private java.util.List<java.lang.Integer> m_quoteNestLevel

m_mentWordCounts

private java.util.Map<java.lang.String,java.lang.Integer> m_mentWordCounts

m_docWordCounts

private java.util.Map<java.lang.String,java.lang.Integer> m_docWordCounts

m_corpusWordCounts

private java.util.Map<java.lang.String,java.lang.Integer> m_corpusWordCounts

m_countingText

private java.lang.String m_countingText

m_headStartWordNumMentionMap

private java.util.Map<java.lang.Integer,java.util.Set<Mention>> m_headStartWordNumMentionMap

m_extentStartWordNumMentionMap

private java.util.Map<java.lang.Integer,java.util.Set<Mention>> m_extentStartWordNumMentionMap

m_charWordMap

private java.util.Map<java.lang.Integer,java.lang.Integer> m_charWordMap

m_nSents

private int m_nSents

m_wordNumSentNumMap

private java.util.Map<java.lang.Integer,java.lang.Integer> m_wordNumSentNumMap

m_wordNumCharNumMap

private java.util.Map<java.lang.Integer,java.lang.Integer> m_wordNumCharNumMap

m_sentenceMentionsPair

private java.util.Map<Pair<java.lang.Integer,java.lang.Integer>,Pair<java.util.List<Mention>,java.util.List<Mention>>> m_sentenceMentionsPair

m_mentsInSents

private java.util.List<java.util.List<Mention>> m_mentsInSents

m_mentionsContaining

private java.util.Map<Mention,java.util.Set<Mention>> m_mentionsContaining

m_bestMentionMap

private java.util.Map<Mention,Mention> m_bestMentionMap

m_headPredictionMap

private java.util.Map<java.lang.Integer,java.util.Map<java.lang.Integer,java.lang.Boolean>> m_headPredictionMap

m_cExMap

private java.util.Map<Pair<Mention,Mention>,CExample> m_cExMap

m_gExMap

private java.util.Map<Mention,GExample> m_gExMap

m_isCaseSensitive

private java.lang.Boolean m_isCaseSensitive
Constructor Detail

DocBase

public DocBase()
Basic constructor: Not recommended.

Method Detail

initMembersDefault

protected void initMembersDefault()

loadSGMText

public void loadSGMText(java.lang.String filename)
Parameters:
filename - The file containing the text of the document.
Throws:
XMLException

removeTagsAndExtraNL

protected java.lang.String removeTagsAndExtraNL(java.lang.String a)

loadPOSTags

protected void loadPOSTags(java.lang.String content)
Loads text that has been preprocessed. Derives word splits, sentence splits, and POS tags from the specified string.

Parameters:
content - The text annotated with part of speech tags.

loadPOSTaggerOutput

protected java.lang.String loadPOSTaggerOutput()
Loads the output of the SNoW-based POS tagger. Uses the SNoW-based POS tagger given on the command line. Should not be called until the text has been set.

Returns:
The POS-tagged content, or null on failure. The format of the output is as follows: (WORD POS) ... with one line per sentence.

loadChunkedText

protected void loadChunkedText(java.lang.String filename)
Loads text that has been preprocessed offline. Derives word splits, sentence splits, and POS tags from the specified file.

Parameters:
filename - The name of a file containing the chunked text.

translateEscaped

protected java.lang.String translateEscaped(java.lang.String escaped,
                                            int cursor)
Translates an escaped round, square, or curly brace escaped as -LBR- or -RBR-, or an escaped pair of quotes, escaped as a double quote charaacter.

Returns:
a brace or escaped if no matching brace recognized.

calcAndSetQuotes

public void calcAndSetQuotes()
Determines the location of quotes and sets them. Must be called after plain text and words are set.


loadFromText

public void loadFromText(java.lang.String plainText)
Builds the document from the given plain text, automatically splitting sentences, determining quote levels, determining part-of-speech tags, and splitting words by an automatic word-splitting algorithm. Mentions and entities will not be set here.

Parameters:
plainText - The text of the document.

loadFromText

public void loadFromText(java.lang.String plainText,
                         boolean doWordSplit,
                         boolean doPOSTag)
Builds the document from the given plain text, automatically splitting sentences, determining quote levels, determining part-of-speech tags, and either splitting words by whitespace or using a word-splitter. Mentions and entities will not be set here.

Parameters:
plainText - The text of the document.
doWordSplit - If true, words will be split by an automatic word-splitting algorithm; otherwise words will be assumed to be separated by whitespace.
doPOSTag - If true, POS tags will be generated by the LBJPOS algorithm. Otherwise, no tags will be set.

setPlainText

protected void setPlainText(java.lang.String text)
Should be set before words are set.

Parameters:
text - The plain text, used for determining character counts.

setWords

public void setWords(java.util.List<java.lang.String> words)

setWords

public void setWords(java.util.List<java.lang.String> words,
                     boolean backwardsCompatible)
Sets the words, aligns them with the plain text, and records statistics about them. Must be called after setPlainText() has been called.

Parameters:
words - The words (copied defensively).
backwardsCompatible - Attempt to alter the algorithm to conform to behavior in previous published paper.

setPOSTags

protected void setPOSTags(java.util.List<java.lang.String> tags)
Sets the POS tags. The number of POS tags must equal the number of words already set. Thus, this method must be called after setWords()

Parameters:
tags - A list of POS tags, in the same order as the words (copied defensively).
Throws:
java.lang.IllegalArgumentException - if tags.size() != words.size()

setQuoteLevels

public void setQuoteLevels(java.util.List<java.lang.Integer> quoteLevels)
Sets the quote levels, which indicate the number of nested quotations in which each word is embedded. The number of elements in the List should equal the number of words already set. Must be called after setWords()

Parameters:
quoteLevels - A list of quote levels, in the same order as the words (copied defensively).
Throws:
java.lang.IllegalArgumentException - if quoteLevels.size() != words.size()

setSentenceNumbers

protected void setSentenceNumbers(java.util.List<java.lang.Integer> sentNums)
Sets the sentence numbers for each word. The number of elements should equal the number of words already set. Must be called after setWords()

Parameters:
sentNums - A list of sentence numbers, in the same order as the words (copied defensively).
Throws:
java.lang.IllegalArgumentException - if sentNums.size() != words.size() or if sentNums is non-monotonic.

recordWordLocation

protected void recordWordLocation(int wn,
                                  int startCN,
                                  int endCN)
Records the fact that a word is located at characters startCN through endCN (inclusive).


getPlainText

public java.lang.String getPlainText()
Description copied from interface: Doc
Gets the text that is the basis for counting, including the start/end characters in Chunk objects.

Specified by:
getPlainText in interface Doc
Returns:
The plain text.

getDocID

public java.lang.String getDocID()
Description copied from interface: Doc
Gets the ID for this document, as a string.

Specified by:
getDocID in interface Doc
Returns:
The document ID.

isCaseSensitive

public boolean isCaseSensitive()
Description copied from interface: Doc
Indicates whether the document is case sensitive.

Specified by:
isCaseSensitive in interface Doc
Returns:
Whether the document is case sensitive.

getSentNum

public int getSentNum(int wordNum)
Description copied from interface: Doc
Gets the sentence number for the specified word.

Specified by:
getSentNum in interface Doc
Parameters:
wordNum - the zero-based position of the word whose sentence number is desired.
Returns:
The zero-based sentence number.

getNumSentences

public int getNumSentences()
Description copied from interface: Doc
Returns the number of sentences in the document.

Specified by:
getNumSentences in interface Doc
Returns:
The number of sentences.

setUsePredictedEntities

public void setUsePredictedEntities(boolean usePred)
Description copied from interface: Doc
Sets the preference for using predicted entities or true entities.

Specified by:
setUsePredictedEntities in interface Doc
Parameters:
usePred - if true, prefer to use predicted entities, otherwise, prefer true entities.

usePredictedEntities

public boolean usePredictedEntities()
Description copied from interface: Doc
Indicates whether requests for entities will return predicted entities or true entities.

Specified by:
usePredictedEntities in interface Doc
Returns:
Whether predicted or true entities are to be used.

getEntities

public java.util.List<Entity> getEntities()
Description copied from interface: Doc
Gets the entities, in no particular order. If Doc.usePredictedEntities() and predicted entities are available, return them; otherwise return true entities.

Specified by:
getEntities in interface Doc
Returns:
An unmodifiable view of the entities.

getPredEntities

public java.util.List<Entity> getPredEntities()
Description copied from interface: Doc
Gets a list of predicted entities, in no particular order.

Specified by:
getPredEntities in interface Doc
Returns:
An unmodifiable view of the predicted entities or an empty list.

getTrueEntities

public java.util.List<Entity> getTrueEntities()
Description copied from interface: Doc
Gets a list of true entities, in no particular order.

Specified by:
getTrueEntities in interface Doc
Returns:
An unmodifiable view of the true entities or an empty list.

getCorefChains

public ChainSolution<Mention> getCorefChains()
Description copied from interface: Doc
Gets the partition of mentions into coreferential sets.

Specified by:
getCorefChains in interface Doc
Returns:
A reference to the chain solution representing the predicted partitioning of mentions into entities, or null if none has been set.

getEntityFor

public Entity getEntityFor(Mention m)
Currently implemented slowly.

Specified by:
getEntityFor in interface Doc
Parameters:
m - The mention whose entity is desired.
Returns:
The entity containing m, or null if not found.

getEntityFor

public Entity getEntityFor(Mention m,
                           boolean usePred)
Currently implemented slowly.

Specified by:
getEntityFor in interface Doc
Parameters:
m - The mention whose entity is desired.
usePred - Whether to return a predicted entity or a true entity.
Returns:
The entity containing m, or null if the entity of the specified type is not available.

getEntityFor

protected Entity getEntityFor(Mention m,
                              java.util.List<Entity> entities)
Currently implemented slowly.


addTrueEntity

protected void addTrueEntity(Entity e)
Can be made public, but then need to ensure that e's mentions are all added.


setPredEntities

public void setPredEntities(ChainSolution<Mention> sol)
Description copied from interface: Doc
Sets the predicted entities to be those specified by sol. Entity IDs are automatically created, and each mention's setPredictedEntityID() method is called. Also sets usePredictedEntities to true. The entities are backed internally, but the mentions are not duplicated.

Specified by:
setPredEntities in interface Doc
Parameters:
sol - The partition of mentions from which to derive entities.

hasPredEntities

public boolean hasPredEntities()
Description copied from interface: Doc
Indicates whether predicted entities are available.

Specified by:
hasPredEntities in interface Doc
Returns:
Whether predicted entities have been set.

hasTrueEntities

public boolean hasTrueEntities()
Description copied from interface: Doc
Indicates whether true entities are available.

Specified by:
hasTrueEntities in interface Doc
Returns:
Whether true entities have been set.

addPredEntities

protected void addPredEntities(java.util.List<Entity> ents)
Backed internally.


getCExampleFor

public CExample getCExampleFor(Mention m1,
                               Mention m2)
Description copied from interface: Doc
Returns the unique CExample for the given pair of mentions in the given order. Doc is the head of a collection of related examples; as such, it needs to return the same CExample any time an inference-based classifier is used.

Specified by:
getCExampleFor in interface Doc
Parameters:
m1 - The first mention.
m2 - The second mention.
Returns:
The unique CExample referring to the ordered pair m1, m2.

getGExampleFor

public GExample getGExampleFor(Mention m)
Description copied from interface: Doc
Returns the unique GExample for the given pair of mentions in the given order. Doc is the head of a collection of related examples; as such, it needs to return the same GExample any time an inference-based classifier is used.

Specified by:
getGExampleFor in interface Doc
Parameters:
m - The mention.
Returns:
The unique GExample referring to the ordered pair m1, m2.

setUsePredictedMentions

public void setUsePredictedMentions(boolean usePred)
Description copied from interface: Doc
Sets the preference for using predicted mentions or true mentions.

Specified by:
setUsePredictedMentions in interface Doc
Parameters:
usePred - if true, prefer to use predicted mentions, otherwise, prefer true mentions.

usePredictedMentions

public boolean usePredictedMentions()
Description copied from interface: Doc
Indicates whether requests for mentions will return predicted mentions or true mentions.

Specified by:
usePredictedMentions in interface Doc
Returns:
Whether predicted or true mentions are to be used.

getMentions

public java.util.List<Mention> getMentions()
Description copied from interface: Doc
Gets the mentions of the document, sorted (typically in document order). Returns predicted mentions or true mentions depending on the result of usePredictedMentions().

Specified by:
getMentions in interface Doc
Returns:
mentions sorted by their natural ordering (usually document ordering).

getPredMentions

public java.util.List<Mention> getPredMentions()
Description copied from interface: Doc
Gets a sorted list of predicted mentions.

Specified by:
getPredMentions in interface Doc
Returns:
sorted predicted mentions, or an empty list if none available.

getTrueMentions

public java.util.List<Mention> getTrueMentions()
Description copied from interface: Doc
Gets a sorted list of true mentions.

Specified by:
getTrueMentions in interface Doc
Returns:
sorted true mentions, or an empty list if none available.

hasPredMentions

public boolean hasPredMentions()
Description copied from interface: Doc
Indicates whether predicted mentions have been set.

Specified by:
hasPredMentions in interface Doc
Returns:
Whether predicted mentions have been set.

hasTrueMentions

public boolean hasTrueMentions()
Description copied from interface: Doc
Indicates whether true mentions have been set.

Specified by:
hasTrueMentions in interface Doc
Returns:
Whether true mentions have been set.

setPredictedMentions

public void setPredictedMentions(java.util.Collection<Mention> ments)
Description copied from interface: Doc
Sets the predicted mentions and records a preference for using them.

Specified by:
setPredictedMentions in interface Doc
Parameters:
ments - The predicted mentions (copied defensively).

alignPredMentsToTrue

protected void alignPredMentsToTrue()

addTrueMention

protected void addTrueMention(Mention m)

getNumMentions

public int getNumMentions()

sortTrueMentions

protected void sortTrueMentions()
Sorts true mentions in natural order, which is the textual order by default.

See Also:
Mention.compareTo(Mention)

sortPredictedMentions

protected void sortPredictedMentions()
Sorts predicted mentions in natural order, which is the textual order by default.

See Also:
Mention.compareTo(Mention)

getMention

public Mention getMention(int n)

getPredMention

public Mention getPredMention(int n)

getTrueMention

public Mention getTrueMention(int n)

getTrueMentionFor

public Mention getTrueMentionFor(Mention pred)
Description copied from interface: Doc
Gets the true mention aligned with the specified mention.

Specified by:
getTrueMentionFor in interface Doc
Parameters:
pred - A predicted mention.
Returns:
The true mention aligned with pred.

getBestMentionFor

public Mention getBestMentionFor(Mention m)
Description copied from interface: Doc
Gets the canonical mention of the entity containing m.

Specified by:
getBestMentionFor in interface Doc
Parameters:
m - A mention.
Returns:
The canonical mention for m.

getMentionsWithHeadStartingAt

public java.util.Set<Mention> getMentionsWithHeadStartingAt(int startWord)
Description copied from interface: Doc
Returns the set of mentions whose heads start at the specified word number, or an empty set if none are found. May be backed internally or not: no guarantees are made (yet).

Specified by:
getMentionsWithHeadStartingAt in interface Doc
Parameters:
startWord - A word number.
Returns:
The set of mentions whose heads start at startWord.

getMentionsWithExtentStartingAt

public java.util.Set<Mention> getMentionsWithExtentStartingAt(int startWord)
Description copied from interface: Doc
Returns the set of mentions whose extents start at the specified word number, or an empty set if none are found. May be backed internally or not: no guarantees are made (yet).

Specified by:
getMentionsWithExtentStartingAt in interface Doc
Parameters:
startWord - A word number.
Returns:
The set of mentions whose extents start at startWord.

getMentionsContainedIn

public java.util.Set<Mention> getMentionsContainedIn(Mention m)
Description copied from interface: Doc
Gets the set of mentions whose head is entirely contained within a specified mention's extent, including the specified mention itself. Returns predicted or true mentions according to the result of getMentions().

Specified by:
getMentionsContainedIn in interface Doc
Parameters:
m - The specified mention.
Returns:
The set of mentions contained in m.

getMentionsContaining

public java.util.Set<Mention> getMentionsContaining(Mention m)
Description copied from interface: Doc
Gets the set of mentions whose extents entirely contain a specified mention's extent, including the specified mention itself. Returns predicted or true mentions according to the result of getMentions().

Specified by:
getMentionsContaining in interface Doc
Parameters:
m - The specified mention.
Returns:
The Set of Mention objects whose extent is contained in (or equal to) the extent of m. Returns predicted or true mentions according to what getMentions() returns.

buildMentionsContaining

protected void buildMentionsContaining()

getMentionsInSent

public java.util.List<Mention> getMentionsInSent(int sentNum)
Description copied from interface: Doc
Gets a list of the mentions in a specified sentence in order. Returns true or predicted mentions according to the value of usePredictedMentions().

Specified by:
getMentionsInSent in interface Doc
Parameters:
sentNum - The number of the specified sentence.
Returns:
A List of the mentions in the specified sentence, in the order that they appear in the sentence.

buildMentionsInSents

protected void buildMentionsInSents()

getMentionsInSentences

public Pair<java.util.List<Mention>,java.util.List<Mention>> getMentionsInSentences(int s1,
                                                                                    int s2)
Description copied from interface: Doc
Gets a pair of lists of mentions, one for each of the two specified sentences. Gets all the mentions in the specified sentences.

Specified by:
getMentionsInSentences in interface Doc
Parameters:
s1 - The number of the first sentence.
s2 - The number of the second sentence.
Returns:
A pair of lists of mentions, one for each sentence.

makeChunk

public Chunk makeChunk(int startWord,
                       int endWord)
Description copied from interface: Doc
Create a chunk spanning the specified words in this document.

Specified by:
makeChunk in interface Doc
Parameters:
startWord - The position of the first word in desired chunk.
endWord - The position of the last word in the desired chunk.
Returns:
The desired chunk.

getWords

public java.util.List<java.lang.String> getWords()
Description copied from interface: Doc
Gets a list of the surface forms of the words of the document.

Specified by:
getWords in interface Doc
Returns:
A list of Strings of words, in the order they appear.

getWord

public java.lang.String getWord(int wordNum)
Description copied from interface: Doc
Gets the specified word.

Specified by:
getWord in interface Doc
Parameters:
wordNum - The position of the specified word (as an index into a List).
Returns:
The wordNumth word as a string.

getPOS

public java.util.List<java.lang.String> getPOS()
Description copied from interface: Doc
Gets a list of the Part-Of-Speech tags for the words of the document. The tag set is that output by the LBJ POS tagger.

Specified by:
getPOS in interface Doc
Returns:
A list of Part-Of-Speech tags corresponding to the words of the document.
See Also:
POSTagger

getPOS

public java.lang.String getPOS(int posNum)
Description copied from interface: Doc
Gets the Part-Of-Speech tag for the word at the posNum position in the document.

Specified by:
getPOS in interface Doc
Parameters:
posNum - The position of the word whose POS tag should be returned.
Returns:
The Part-Of-Speech tag for the desired word position.
See Also:
POSTagger

getWordNum

public int getWordNum(int charNum)
Description copied from interface: Doc
Determines the word number (zero-based) of the word at charNum, or if no word is at charNum, return the word number of the closest word appearing after charNum, or if no such word exists, return -1.

Specified by:
getWordNum in interface Doc
Parameters:
charNum - The character number.
Returns:
The word number corresponding to the specified character number.

getTextFirstWordNum

public int getTextFirstWordNum()
Description copied from interface: Doc
Gets the word number of the first word in the main text of the document (as distinguished from headlines and metadata that may be included in the plain text.)

Specified by:
getTextFirstWordNum in interface Doc
Returns:
The word number of the first word in the main text.

getStartCharNum

public int getStartCharNum(int wordNum)
Description copied from interface: Doc
Gets the zero-based position of the first character of a word.

Specified by:
getStartCharNum in interface Doc
Parameters:
wordNum - The zero-based position of the word in the document.
Returns:
The zero-based position of the first character in the word within into the plain text, or -1 if wordNum is invalid.

getQuoteNestLevel

public int getQuoteNestLevel(int wordNum)
Description copied from interface: Doc
Indicates the number of nested quotes the specified word is in. 0 is the base level of the text.

Specified by:
getQuoteNestLevel in interface Doc
Parameters:
wordNum - The position of the specified word.
Returns:
The number of nested quotes.

getInverseTrueHeadFreq

public double getInverseTrueHeadFreq(int wordNum)
Description copied from interface: Doc
Gets the inverse true head frequency of the word at the specified position.

Specified by:
getInverseTrueHeadFreq in interface Doc
Parameters:
wordNum - The position in the document of the specified word.
Returns:
The inverse true head frequency of the specified word, or 1.0 if the word is not in a true head.
See Also:
Doc.getInverseTrueHeadFreq(String)

getInverseTrueHeadFreq

public double getInverseTrueHeadFreq(java.lang.String word)
Description copied from interface: Doc
Gets the inverse of the number of occurrences of the specified word in the heads of the true mentions in the document.

Specified by:
getInverseTrueHeadFreq in interface Doc
Parameters:
word - The specified word.
Returns:
The inverse true head frequency of the specified word, or 1.0 if the word is not found in any heads.

getInDocInverseFreq

public double getInDocInverseFreq(java.lang.String word)
Description copied from interface: Doc
Gets the inverse of the number of occurrences of the specified word in the document. Not normalized.

Specified by:
getInDocInverseFreq in interface Doc
Parameters:
word - The specified word.
Returns:
The inverse of the number of times the word occurs in the document, or 1.0 if the word does not occur.

getInCorpusInverseFreq

public double getInCorpusInverseFreq(java.lang.String word)
Description copied from interface: Doc
Gets the inverse of the number of occurrences of the specified word in the corpus. Not normalized.

Specified by:
getInCorpusInverseFreq in interface Doc
Parameters:
word - The specified word.
Returns:
The inverse of the number of times the word occurs in the corpus, or 1.0 if the word does not occur.

getWholeDocCounts

public java.util.Map<java.lang.String,java.lang.Integer> getWholeDocCounts()
Description copied from interface: Doc
Gets the counts for the words in the document. Returns a copy, which may be slow or space consuming.

Specified by:
getWholeDocCounts in interface Doc
Returns:
A map from words to counts of words in the document.

setCorpusCounts

public void setCorpusCounts(java.util.Map<java.lang.String,java.lang.Integer> counts)
Description copied from interface: Doc
Sets the corpus counts for the words in the corpus. Makes a copy of the map, which may be slow or space consuming.

Specified by:
setCorpusCounts in interface Doc
Parameters:
counts - A map from words to counts of words in the corpus.

getNumRelations

public int getNumRelations()
Description copied from interface: Doc
Gets the number of relations.

Specified by:
getNumRelations in interface Doc
Returns:
The number of relations.

getRelation

public Relation getRelation(int number)
Description copied from interface: Doc
Gets the specified relation. Relations are not yet emphasized.

Specified by:
getRelation in interface Doc
Parameters:
number - the number of the desired relation.
Returns:
The desired relation.

addRelation

protected void addRelation(Relation r)

hasHeadPrediction

public boolean hasHeadPrediction(int firstWN,
                                 int lastWN)
Checks to see whether a prediction has been stored for whether the closed interval [firstWN, lastWN] word sequence is a head. (Does NOT return whether it is a head)


getHeadPrediction

public boolean getHeadPrediction(int firstWN,
                                 int lastWN)

addHeadPrediction

public void addHeadPrediction(int firstWN,
                              int lastWN,
                              boolean pred)

toString

public java.lang.String toString()
Overrides:
toString in class java.lang.Object

toAnnotatedString

public java.lang.String toAnnotatedString(boolean showPOS,
                                          boolean showMTypes,
                                          boolean showETypes,
                                          boolean showEIDs)
Description copied from interface: Doc
Gets the document as a string annotated with mention boundaries, with square brackets for true mentions, asterisks for false alarms, and triangle brackets for missed mentions, and optionally annotated with Part-Of-Speech tags, mention types, entity types, and entity IDs. Predicted Entity IDs will be shown if available.

Specified by:
toAnnotatedString in interface Doc
Parameters:
showPOS - Whether the Part-Of-Speech tags should be shown.
showMTypes - Whether mention types should be shown.
showETypes - Whether entity types should be shown.
showEIDs - Whether entity IDs should be shown.
Returns:
The text of the document, annotated.

toAnnotatedString

public java.lang.String toAnnotatedString(boolean showPOS)
Description copied from interface: Doc
Gets the document as a string annotated with mention boundaries, with square brackets for true mentions, asterisks for false alarms, and triangle brackets for missed mentions, and optionally annotated with Part-Of-Speech tags.

Specified by:
toAnnotatedString in interface Doc
Parameters:
showPOS - Whether the Part-Of-Speech tags should be shown.
Returns:
The text of the document, annotated.

toSubstituteString

public java.lang.String toSubstituteString()
Description copied from interface: Doc
Gets the document as a string where each mention has been replaced by the most specific mention coreferential with it.

Specified by:
toSubstituteString in interface Doc
Returns:
The text of the document with the extent of each mention replaced by the most specific mention in its entity. Uses the mentions supplied by getMentions().

makeBestMentionMap

protected java.util.Map<Mention,Mention> makeBestMentionMap()

getCoherenceInfo

public java.util.Map<Entity,java.util.Map<java.lang.Integer,java.lang.String>> getCoherenceInfo(boolean usePred)
Description copied from interface: Doc
Gets a grid indicating the mention type for each combination of entities and sentences. If a mention is predicted to belong to its true entity, its mention type will be uppercase; but if it is predicted to be in the wrong entity (due to coreference mistake) its mention type will be lowercase and the mention's entity ID will be appended after its mention type.

Specified by:
getCoherenceInfo in interface Doc
Parameters:
usePred - Whether predicted entities should be used.
Returns:
A map from entities to a map from sentence numbers to strings, representing the grid described above.

getCoherenceInfo

public java.util.Map<Entity,java.util.Map<java.lang.Integer,java.lang.String>> getCoherenceInfo()
Description copied from interface: Doc
Gets the coherence info using the value of usePredictedEntities() to determine whether to use predicted entities.

Specified by:
getCoherenceInfo in interface Doc
Returns:
Coherence info as described in the one parameter version of this method.

toCoherenceTableString

public java.lang.String toCoherenceTableString(boolean usePred)
Description copied from interface: Doc
Gets the coherence grid represented as a string, laid out in a grid.

Specified by:
toCoherenceTableString in interface Doc
Returns:
A coherence grid as a string.
See Also:
Doc.getCoherenceInfo()

toCoherenceTableString

public java.lang.String toCoherenceTableString()
Description copied from interface: Doc
Gets the coherence grid represented as a string, laid out in a grid. Predicted entities will be used as determined by the value of usePredictedEntities().

Specified by:
toCoherenceTableString in interface Doc
Returns:
A coherence grid as a string.
See Also:
Doc.getCoherenceInfo()

repeat

protected java.lang.String repeat(java.lang.String s,
                                  int n)

sortEntitiesByListOrder

protected java.util.List<Entity> sortEntitiesByListOrder(java.util.List<Entity> ents,
                                                         java.util.List<Entity> ordered)
Does NOT modify in place (but this may change).


getShortEID

public java.lang.String getShortEID(java.lang.String longID)

save

public void save()
          throws java.io.IOException
Description copied from interface: Doc
Writes the document to a file using serialization.

Specified by:
save in interface Doc
Throws:
java.io.IOException

write

public void write(boolean usePredictions)
Description copied from interface: Doc
Writes this Doc in the appropriate format.

Specified by:
write in interface Doc
Parameters:
usePredictions - Whether predicted mentions and entities should be written.

write

public abstract void write(java.lang.String filename,
                           boolean usePredictions)
Description copied from interface: Doc
Writes this Doc in the appropriate format.

Specified by:
write in interface Doc
Parameters:
filename - The name of the target file.
usePredictions - Whether predicted mentions and entities should be written.

printChunkValidity

public static void printChunkValidity()
Verify that all mentions start and end on phrase boundaries.