DocBase

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

edu.illinois.cs.cogcomp.lbj.coref.ir.docs
Class DocBase

java.lang.Object
  edu.illinois.cs.cogcomp.lbj.coref.ir.docs.DocBase

All Implemented Interfaces:: Doc, java.io.Serializable

Direct Known Subclasses:: DocPlainText, DocXMLBase

public abstract class DocBase
extends java.lang.Object
implements Doc, java.io.Serializable
extends java.lang.Object
implements Doc, java.io.Serializable

Represents one document from a corpus, including the text, annotations of coreference, relations, entities, and other relevant information. Also contains methods to load input from XML files.

Author:: Eric Bengtson
See Also:: Serialized Form

Nested Class Summary
`static class`	`DocBase.PosSource`

Field Summary
`(package private) static int`	`goodEnds`
`(package private) static int`	`goodStarts`
`protected java.lang.String`	`m_annotationAuthor`
`protected java.lang.String`	`m_baseFN`
`private java.util.Map<Mention,Mention>`	`m_bestMentionMap`
`protected boolean`	`m_bNeedsCasing`
`private boolean`	`m_bUsePredEntities`
`private boolean`	`m_bUsePredMentions`
`protected LBJ2.classify.Classifier`	`m_caser`
`private java.util.Map<Pair<Mention,Mention>,CExample>`	`m_cExMap`
`private java.util.Map<java.lang.Integer,java.lang.Integer>`	`m_charWordMap`
`private ChainSolution<Mention>`	`m_corefChains`
`private java.util.Map<java.lang.String,java.lang.Integer>`	`m_corpusWordCounts`
`private java.lang.String`	`m_countingText`
`protected java.lang.String`	`m_dateTime`
`private Aligner<Mention>`	`m_defaultAligner`
`protected java.lang.String`	`m_docID`
`protected java.lang.String`	`m_docType`
`private java.util.Map<java.lang.String,java.lang.Integer>`	`m_docWordCounts`
`protected java.lang.String`	`m_encoding`
`private java.util.Map<java.lang.Integer,java.util.Set<Mention>>`	`m_extentStartWordNumMentionMap`
`private java.util.Map<Mention,GExample>`	`m_gExMap`
`protected java.lang.String`	`m_headline`
`private java.util.Map<java.lang.Integer,java.util.Map<java.lang.Integer,java.lang.Boolean>>`	`m_headPredictionMap`
`private java.util.Map<java.lang.Integer,java.util.Set<Mention>>`	`m_headStartWordNumMentionMap`
`private java.lang.Boolean`	`m_isCaseSensitive`
`private java.util.Map<Mention,java.util.Set<Mention>>`	`m_mentionsContaining`
`private java.util.List<java.util.List<Mention>>`	`m_mentsInSents`
`private java.util.Map<java.lang.String,java.lang.Integer>`	`m_mentWordCounts`
`private int`	`m_nSents`
`private java.util.List<java.util.List<java.lang.String>>`	`m_phrases`
`private java.util.List<java.lang.String>`	`m_pos`
`private java.util.List<Entity>`	`m_predEntities`
`private java.util.List<Mention>`	`m_predMentions`
`private java.util.Map<Mention,Mention>`	`m_predToTrueMention`
`private java.util.List<java.lang.Integer>`	`m_quoteNestLevel`
`private java.util.List<Relation>`	`m_relations`
`private java.util.Map<Pair<java.lang.Integer,java.lang.Integer>,Pair<java.util.List<Mention>,java.util.List<Mention>>>`	`m_sentenceMentionsPair`
`protected java.lang.String`	`m_slug`
`protected java.lang.String`	`m_source`
`protected java.lang.String`	`m_text`
`private int`	`m_textStartCharNum`
`private java.util.List<Entity>`	`m_trueEntities`
`private java.util.List<Mention>`	`m_trueMentions`
`private boolean`	`m_trueMentionsSorted`
`protected java.lang.String`	`m_version`
`private java.util.Map<java.lang.Integer,java.lang.Integer>`	`m_wordNumCharNumMap`
`private java.util.Map<java.lang.Integer,java.lang.Integer>`	`m_wordNumSentNumMap`
`private java.util.List<java.lang.String>`	`m_words`
`(package private) static int`	`medEnds`
`private static long`	`serialVersionUID`
`(package private) static int`	`totalMentions`

Constructor Summary
`DocBase()` Basic constructor: Not recommended.

Method Summary
`void`	`addHeadPrediction(int firstWN, int lastWN, boolean pred)`
`protected void`	`addPredEntities(java.util.List<Entity> ents)` Backed internally.
`protected void`	`addRelation(Relation r)`
`protected void`	`addTrueEntity(Entity e)` Can be made public, but then need to ensure that e's mentions are all added.
`protected void`	`addTrueMention(Mention m)`
`protected void`	`alignPredMentsToTrue()`
`protected void`	`buildMentionsContaining()`
`protected void`	`buildMentionsInSents()`
`void`	`calcAndSetQuotes()` Determines the location of quotes and sets them.
`Mention`	`getBestMentionFor(Mention m)` Gets the canonical mention of the entity containing `m`.
`CExample`	`getCExampleFor(Mention m1, Mention m2)` Returns the unique `CExample` for the given pair of mentions in the given order.
`java.util.Map<Entity,java.util.Map<java.lang.Integer,java.lang.String>>`	`getCoherenceInfo()` Gets the coherence info using the value of `usePredictedEntities()` to determine whether to use predicted entities.
`java.util.Map<Entity,java.util.Map<java.lang.Integer,java.lang.String>>`	`getCoherenceInfo(boolean usePred)` Gets a grid indicating the mention type for each combination of entities and sentences.
`ChainSolution<Mention>`	`getCorefChains()` Gets the partition of mentions into coreferential sets.
`java.lang.String`	`getDocID()` Gets the ID for this document, as a string.
`java.util.List<Entity>`	`getEntities()` Gets the entities, in no particular order.
`Entity`	`getEntityFor(Mention m)` Currently implemented slowly.
`Entity`	`getEntityFor(Mention m, boolean usePred)` Currently implemented slowly.
`protected Entity`	`getEntityFor(Mention m, java.util.List<Entity> entities)` Currently implemented slowly.
`GExample`	`getGExampleFor(Mention m)` Returns the unique `GExample` for the given pair of mentions in the given order.
`boolean`	`getHeadPrediction(int firstWN, int lastWN)`
`double`	`getInCorpusInverseFreq(java.lang.String word)` Gets the inverse of the number of occurrences of the specified word in the corpus.
`double`	`getInDocInverseFreq(java.lang.String word)` Gets the inverse of the number of occurrences of the specified word in the document.
`double`	`getInverseTrueHeadFreq(int wordNum)` Gets the inverse true head frequency of the word at the specified position.
`double`	`getInverseTrueHeadFreq(java.lang.String word)` Gets the inverse of the number of occurrences of the specified word in the heads of the true mentions in the document.
`Mention`	`getMention(int n)`
`java.util.List<Mention>`	`getMentions()` Gets the mentions of the document, sorted (typically in document order).
`java.util.Set<Mention>`	`getMentionsContainedIn(Mention m)` Gets the set of mentions whose head is entirely contained within a specified mention's extent, including the specified mention itself.
`java.util.Set<Mention>`	`getMentionsContaining(Mention m)` Gets the set of mentions whose extents entirely contain a specified mention's extent, including the specified mention itself.
`java.util.List<Mention>`	`getMentionsInSent(int sentNum)` Gets a list of the mentions in a specified sentence in order.
`Pair<java.util.List<Mention>,java.util.List<Mention>>`	`getMentionsInSentences(int s1, int s2)` Gets a pair of lists of mentions, one for each of the two specified sentences.
`java.util.Set<Mention>`	`getMentionsWithExtentStartingAt(int startWord)` Returns the set of mentions whose extents start at the specified word number, or an empty set if none are found.
`java.util.Set<Mention>`	`getMentionsWithHeadStartingAt(int startWord)` Returns the set of mentions whose heads start at the specified word number, or an empty set if none are found.
`int`	`getNumMentions()`
`int`	`getNumRelations()` Gets the number of relations.
`int`	`getNumSentences()` Returns the number of sentences in the document.
`java.lang.String`	`getPlainText()` Gets the text that is the basis for counting, including the start/end characters in Chunk objects.
`java.util.List<java.lang.String>`	`getPOS()` Gets a list of the Part-Of-Speech tags for the words of the document.
`java.lang.String`	`getPOS(int posNum)` Gets the Part-Of-Speech tag for the word at the `posNum` position in the document.
`java.util.List<Entity>`	`getPredEntities()` Gets a list of predicted entities, in no particular order.
`Mention`	`getPredMention(int n)`
`java.util.List<Mention>`	`getPredMentions()` Gets a sorted list of predicted mentions.
`int`	`getQuoteNestLevel(int wordNum)` Indicates the number of nested quotes the specified word is in.
`Relation`	`getRelation(int number)` Gets the specified relation.
`int`	`getSentNum(int wordNum)` Gets the sentence number for the specified word.
`java.lang.String`	`getShortEID(java.lang.String longID)`
`int`	`getStartCharNum(int wordNum)` Gets the zero-based position of the first character of a word.
`int`	`getTextFirstWordNum()` Gets the word number of the first word in the main text of the document (as distinguished from headlines and metadata that may be included in the plain text.)
`java.util.List<Entity>`	`getTrueEntities()` Gets a list of true entities, in no particular order.
`Mention`	`getTrueMention(int n)`
`Mention`	`getTrueMentionFor(Mention pred)` Gets the true mention aligned with the specified mention.
`java.util.List<Mention>`	`getTrueMentions()` Gets a sorted list of true mentions.
`java.util.Map<java.lang.String,java.lang.Integer>`	`getWholeDocCounts()` Gets the counts for the words in the document.
`java.lang.String`	`getWord(int wordNum)` Gets the specified word.
`int`	`getWordNum(int charNum)` Determines the word number (zero-based) of the word at `charNum`, or if no word is at charNum, return the word number of the closest word appearing after charNum, or if no such word exists, return -1.
`java.util.List<java.lang.String>`	`getWords()` Gets a list of the surface forms of the words of the document.
`boolean`	`hasHeadPrediction(int firstWN, int lastWN)` Checks to see whether a prediction has been stored for whether the closed interval [firstWN, lastWN] word sequence is a head.
`boolean`	`hasPredEntities()` Indicates whether predicted entities are available.
`boolean`	`hasPredMentions()` Indicates whether predicted mentions have been set.
`boolean`	`hasTrueEntities()` Indicates whether true entities are available.
`boolean`	`hasTrueMentions()` Indicates whether true mentions have been set.
`protected void`	`initMembersDefault()`
`boolean`	`isCaseSensitive()` Indicates whether the document is case sensitive.
`protected void`	`loadChunkedText(java.lang.String filename)` Loads text that has been preprocessed offline.
`void`	`loadFromText(java.lang.String plainText)` Builds the document from the given plain text, automatically splitting sentences, determining quote levels, determining part-of-speech tags, and splitting words by an automatic word-splitting algorithm.
`void`	`loadFromText(java.lang.String plainText, boolean doWordSplit, boolean doPOSTag)` Builds the document from the given plain text, automatically splitting sentences, determining quote levels, determining part-of-speech tags, and either splitting words by whitespace or using a word-splitter.
`protected java.lang.String`	`loadPOSTaggerOutput()` Loads the output of the SNoW-based POS tagger.
`protected void`	`loadPOSTags(java.lang.String content)` Loads text that has been preprocessed.
`void`	`loadSGMText(java.lang.String filename)`
`protected java.util.Map<Mention,Mention>`	`makeBestMentionMap()`
`Chunk`	`makeChunk(int startWord, int endWord)` Create a chunk spanning the specified words in this document.
`static void`	`printChunkValidity()` Verify that all mentions start and end on phrase boundaries.
`protected void`	`recordWordLocation(int wn, int startCN, int endCN)` Records the fact that a word is located at characters `startCN` through `endCN` (inclusive).
`protected java.lang.String`	`removeTagsAndExtraNL(java.lang.String a)`
`protected java.lang.String`	`repeat(java.lang.String s, int n)`
`void`	`save()` Writes the document to a file using serialization.
`void`	`setCorpusCounts(java.util.Map<java.lang.String,java.lang.Integer> counts)` Sets the corpus counts for the words in the corpus.
`protected void`	`setPlainText(java.lang.String text)` Should be set before words are set.
`protected void`	`setPOSTags(java.util.List<java.lang.String> tags)` Sets the POS tags.
`void`	`setPredEntities(ChainSolution<Mention> sol)` Sets the predicted entities to be those specified by `sol`.
`void`	`setPredictedMentions(java.util.Collection<Mention> ments)` Sets the predicted mentions and records a preference for using them.
`void`	`setQuoteLevels(java.util.List<java.lang.Integer> quoteLevels)` Sets the quote levels, which indicate the number of nested quotations in which each word is embedded.
`protected void`	`setSentenceNumbers(java.util.List<java.lang.Integer> sentNums)` Sets the sentence numbers for each word.
`void`	`setUsePredictedEntities(boolean usePred)` Sets the preference for using predicted entities or true entities.
`void`	`setUsePredictedMentions(boolean usePred)` Sets the preference for using predicted mentions or true mentions.
`void`	`setWords(java.util.List<java.lang.String> words)`
`void`	`setWords(java.util.List<java.lang.String> words, boolean backwardsCompatible)` Sets the words, aligns them with the plain text, and records statistics about them.
`protected java.util.List<Entity>`	`sortEntitiesByListOrder(java.util.List<Entity> ents, java.util.List<Entity> ordered)` Does NOT modify in place (but this may change).
`protected void`	`sortPredictedMentions()` Sorts predicted mentions in natural order, which is the textual order by default.
`protected void`	`sortTrueMentions()` Sorts true mentions in natural order, which is the textual order by default.
`java.lang.String`	`toAnnotatedString(boolean showPOS)` Gets the document as a string annotated with mention boundaries, with square brackets for true mentions, asterisks for false alarms, and triangle brackets for missed mentions, and optionally annotated with Part-Of-Speech tags.
`java.lang.String`	`toAnnotatedString(boolean showPOS, boolean showMTypes, boolean showETypes, boolean showEIDs)` Gets the document as a string annotated with mention boundaries, with square brackets for true mentions, asterisks for false alarms, and triangle brackets for missed mentions, and optionally annotated with Part-Of-Speech tags, mention types, entity types, and entity IDs.
`java.lang.String`	`toCoherenceTableString()` Gets the coherence grid represented as a string, laid out in a grid.
`java.lang.String`	`toCoherenceTableString(boolean usePred)` Gets the coherence grid represented as a string, laid out in a grid.
`java.lang.String`	`toString()`
`java.lang.String`	`toSubstituteString()` Gets the document as a string where each mention has been replaced by the most specific mention coreferential with it.
`protected java.lang.String`	`translateEscaped(java.lang.String escaped, int cursor)` Translates an escaped round, square, or curly brace escaped as -LBR- or -RBR-, or an escaped pair of quotes, escaped as a double quote charaacter.
`boolean`	`usePredictedEntities()` Indicates whether requests for entities will return predicted entities or true entities.
`boolean`	`usePredictedMentions()` Indicates whether requests for mentions will return predicted mentions or true mentions.
`void`	`write(boolean usePredictions)` Writes this Doc in the appropriate format.
`abstract void`	`write(java.lang.String filename, boolean usePredictions)` Writes this Doc in the appropriate format.

Methods inherited from class java.lang.Object
`clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait`

Field Detail

serialVersionUID

private static final long serialVersionUID

See Also:: Constant Field Values

m_baseFN

protected java.lang.String m_baseFN

totalMentions

static int totalMentions

goodStarts

static int goodStarts

goodEnds

static int goodEnds

medEnds

static int medEnds

m_bUsePredEntities

private boolean m_bUsePredEntities

m_trueEntities

private java.util.List<Entity> m_trueEntities

m_predEntities

private java.util.List<Entity> m_predEntities

m_corefChains

private ChainSolution<Mention> m_corefChains

m_relations

private java.util.List<Relation> m_relations

m_bUsePredMentions

private boolean m_bUsePredMentions

m_trueMentions

private java.util.List<Mention> m_trueMentions

m_trueMentionsSorted

private boolean m_trueMentionsSorted

m_predMentions

private java.util.List<Mention> m_predMentions

m_defaultAligner

private Aligner<Mention> m_defaultAligner

m_predToTrueMention

private java.util.Map<Mention,Mention> m_predToTrueMention

m_caser

protected LBJ2.classify.Classifier m_caser

m_bNeedsCasing

protected boolean m_bNeedsCasing

m_phrases

private java.util.List<java.util.List<java.lang.String>> m_phrases

m_source

protected java.lang.String m_source

m_docType

protected java.lang.String m_docType

m_version

protected java.lang.String m_version

m_annotationAuthor

protected java.lang.String m_annotationAuthor

m_encoding

protected java.lang.String m_encoding

m_docID

protected java.lang.String m_docID

m_slug

protected java.lang.String m_slug

m_dateTime

protected java.lang.String m_dateTime

m_headline

protected java.lang.String m_headline

m_text

protected java.lang.String m_text

m_textStartCharNum

private int m_textStartCharNum

m_words

private java.util.List<java.lang.String> m_words

m_pos

private java.util.List<java.lang.String> m_pos

m_quoteNestLevel

private java.util.List<java.lang.Integer> m_quoteNestLevel

m_mentWordCounts

private java.util.Map<java.lang.String,java.lang.Integer> m_mentWordCounts

m_docWordCounts

private java.util.Map<java.lang.String,java.lang.Integer> m_docWordCounts

m_corpusWordCounts

private java.util.Map<java.lang.String,java.lang.Integer> m_corpusWordCounts

m_countingText

private java.lang.String m_countingText

m_headStartWordNumMentionMap

private java.util.Map<java.lang.Integer,java.util.Set<Mention>> m_headStartWordNumMentionMap

m_extentStartWordNumMentionMap

private java.util.Map<java.lang.Integer,java.util.Set<Mention>> m_extentStartWordNumMentionMap

m_charWordMap

private java.util.Map<java.lang.Integer,java.lang.Integer> m_charWordMap

m_nSents

private int m_nSents

m_wordNumSentNumMap

private java.util.Map<java.lang.Integer,java.lang.Integer> m_wordNumSentNumMap

m_wordNumCharNumMap

private java.util.Map<java.lang.Integer,java.lang.Integer> m_wordNumCharNumMap

m_sentenceMentionsPair

private java.util.Map<Pair<java.lang.Integer,java.lang.Integer>,Pair<java.util.List<Mention>,java.util.List<Mention>>> m_sentenceMentionsPair

m_mentsInSents

private java.util.List<java.util.List<Mention>> m_mentsInSents

m_mentionsContaining

private java.util.Map<Mention,java.util.Set<Mention>> m_mentionsContaining

m_bestMentionMap

private java.util.Map<Mention,Mention> m_bestMentionMap

m_headPredictionMap

private java.util.Map<java.lang.Integer,java.util.Map<java.lang.Integer,java.lang.Boolean>> m_headPredictionMap

m_cExMap

private java.util.Map<Pair<Mention,Mention>,CExample> m_cExMap

m_gExMap

private java.util.Map<Mention,GExample> m_gExMap

m_isCaseSensitive

private java.lang.Boolean m_isCaseSensitive

Constructor Detail

DocBase

public DocBase()

Basic constructor: Not recommended.

Method Detail

initMembersDefault

protected void initMembersDefault()

loadSGMText

public void loadSGMText(java.lang.String filename)

Parameters:: filename - The file containing the text of the document.
Throws:: XMLException

removeTagsAndExtraNL

protected java.lang.String removeTagsAndExtraNL(java.lang.String a)

loadPOSTags

protected void loadPOSTags(java.lang.String content)

Loads text that has been preprocessed. Derives word splits, sentence splits, and POS tags from the specified string.

Parameters:: content - The text annotated with part of speech tags.

loadPOSTaggerOutput

protected java.lang.String loadPOSTaggerOutput()

Loads the output of the SNoW-based POS tagger. Uses the SNoW-based POS tagger given on the command line. Should not be called until the text has been set.

Returns:: The POS-tagged content, or null on failure. The format of the output is as follows: (WORD POS) ... with one line per sentence.

loadChunkedText

protected void loadChunkedText(java.lang.String filename)

Loads text that has been preprocessed offline. Derives word splits, sentence splits, and POS tags from the specified file.

Parameters:: filename - The name of a file containing the chunked text.

translateEscaped

protected java.lang.String translateEscaped(java.lang.String escaped,
                                            int cursor)

Translates an escaped round, square, or curly brace escaped as -LBR- or -RBR-, or an escaped pair of quotes, escaped as a double quote charaacter.

Returns:: a brace or escaped if no matching brace recognized.

calcAndSetQuotes

public void calcAndSetQuotes()

Determines the location of quotes and sets them. Must be called after plain text and words are set.

loadFromText

public void loadFromText(java.lang.String plainText)

Builds the document from the given plain text, automatically splitting sentences, determining quote levels, determining part-of-speech tags, and splitting words by an automatic word-splitting algorithm. Mentions and entities will not be set here.

Parameters:: plainText - The text of the document.

loadFromText

public void loadFromText(java.lang.String plainText,
                         boolean doWordSplit,
                         boolean doPOSTag)

Builds the document from the given plain text, automatically splitting sentences, determining quote levels, determining part-of-speech tags, and either splitting words by whitespace or using a word-splitter. Mentions and entities will not be set here.

Parameters:: plainText - The text of the document.; doWordSplit - If true, words will be split by an automatic word-splitting algorithm; otherwise words will be assumed to be separated by whitespace.; doPOSTag - If true, POS tags will be generated by the LBJPOS algorithm. Otherwise, no tags will be set.

setPlainText

protected void setPlainText(java.lang.String text)

Should be set before words are set.

Parameters:: text - The plain text, used for determining character counts.

setWords

public void setWords(java.util.List<java.lang.String> words)

setWords

public void setWords(java.util.List<java.lang.String> words,
                     boolean backwardsCompatible)

Sets the words, aligns them with the plain text, and records statistics about them. Must be called after setPlainText() has been called.

Parameters:: words - The words (copied defensively).; backwardsCompatible - Attempt to alter the algorithm to conform to behavior in previous published paper.

setPOSTags

protected void setPOSTags(java.util.List<java.lang.String> tags)

Sets the POS tags. The number of POS tags must equal the number of words already set. Thus, this method must be called after setWords()

Parameters:: tags - A list of POS tags, in the same order as the words (copied defensively).
Throws:: java.lang.IllegalArgumentException - if tags.size() != words.size()

setQuoteLevels

public void setQuoteLevels(java.util.List<java.lang.Integer> quoteLevels)

Sets the quote levels, which indicate the number of nested quotations in which each word is embedded. The number of elements in the List should equal the number of words already set. Must be called after setWords()

Parameters:: quoteLevels - A list of quote levels, in the same order as the words (copied defensively).
Throws:: java.lang.IllegalArgumentException - if quoteLevels.size() != words.size()

setSentenceNumbers

protected void setSentenceNumbers(java.util.List<java.lang.Integer> sentNums)

Sets the sentence numbers for each word. The number of elements should equal the number of words already set. Must be called after setWords()

Parameters:: sentNums - A list of sentence numbers, in the same order as the words (copied defensively).
Throws:: java.lang.IllegalArgumentException - if sentNums.size() != words.size() or if sentNums is non-monotonic.

recordWordLocation

protected void recordWordLocation(int wn,
                                  int startCN,
                                  int endCN)

Records the fact that a word is located at characters startCN through endCN (inclusive).

getPlainText

public java.lang.String getPlainText()

Description copied from interface: Doc

Gets the text that is the basis for counting, including the start/end characters in Chunk objects.

Specified by:: getPlainText in interface Doc

Returns:: The plain text.

getDocID

public java.lang.String getDocID()

Description copied from interface: Doc

Gets the ID for this document, as a string.

Specified by:: getDocID in interface Doc

Returns:: The document ID.

isCaseSensitive

public boolean isCaseSensitive()

Description copied from interface: Doc

Indicates whether the document is case sensitive.

Specified by:: isCaseSensitive in interface Doc

Returns:: Whether the document is case sensitive.

getSentNum

public int getSentNum(int wordNum)

Description copied from interface: Doc

Gets the sentence number for the specified word.

Specified by:: getSentNum in interface Doc

Parameters:: wordNum - the zero-based position of the word whose sentence number is desired.
Returns:: The zero-based sentence number.

getNumSentences

public int getNumSentences()

Description copied from interface: Doc

Returns the number of sentences in the document.

Specified by:: getNumSentences in interface Doc

Returns:: The number of sentences.

setUsePredictedEntities

public void setUsePredictedEntities(boolean usePred)

Description copied from interface: Doc

Sets the preference for using predicted entities or true entities.

Specified by:: setUsePredictedEntities in interface Doc

Parameters:: usePred - if true, prefer to use predicted entities, otherwise, prefer true entities.

usePredictedEntities

public boolean usePredictedEntities()

Description copied from interface: Doc

Indicates whether requests for entities will return predicted entities or true entities.

Specified by:: usePredictedEntities in interface Doc

Returns:: Whether predicted or true entities are to be used.

getEntities

public java.util.List<Entity> getEntities()

Description copied from interface: Doc

Gets the entities, in no particular order. If Doc.usePredictedEntities() and predicted entities are available, return them; otherwise return true entities.

Specified by:: getEntities in interface Doc

Returns:: An unmodifiable view of the entities.

getPredEntities

public java.util.List<Entity> getPredEntities()

Description copied from interface: Doc

Gets a list of predicted entities, in no particular order.

Specified by:: getPredEntities in interface Doc

Returns:: An unmodifiable view of the predicted entities or an empty list.

getTrueEntities

public java.util.List<Entity> getTrueEntities()

Description copied from interface: Doc

Gets a list of true entities, in no particular order.

Specified by:: getTrueEntities in interface Doc

Returns:: An unmodifiable view of the true entities or an empty list.

getCorefChains

public ChainSolution<Mention> getCorefChains()

Description copied from interface: Doc

Gets the partition of mentions into coreferential sets.

Specified by:: getCorefChains in interface Doc

Returns:: A reference to the chain solution representing the predicted partitioning of mentions into entities, or null if none has been set.

getEntityFor

public Entity getEntityFor(Mention m)

Currently implemented slowly.

Specified by:: getEntityFor in interface Doc

Parameters:: m - The mention whose entity is desired.
Returns:: The entity containing m, or null if not found.

getEntityFor

public Entity getEntityFor(Mention m,
                           boolean usePred)

Currently implemented slowly.

Specified by:: getEntityFor in interface Doc

Parameters:: m - The mention whose entity is desired.; usePred - Whether to return a predicted entity or a true entity.
Returns:: The entity containing m, or null if the entity of the specified type is not available.

getEntityFor

protected Entity getEntityFor(Mention m,
                              java.util.List<Entity> entities)

Currently implemented slowly.

addTrueEntity

protected void addTrueEntity(Entity e)

Can be made public, but then need to ensure that e's mentions are all added.

setPredEntities

public void setPredEntities(ChainSolution<Mention> sol)

Description copied from interface: Doc

Sets the predicted entities to be those specified by sol. Entity IDs are automatically created, and each mention's setPredictedEntityID() method is called. Also sets usePredictedEntities to true. The entities are backed internally, but the mentions are not duplicated.

Specified by:: setPredEntities in interface Doc

Parameters:: sol - The partition of mentions from which to derive entities.

hasPredEntities

public boolean hasPredEntities()

Description copied from interface: Doc

Indicates whether predicted entities are available.

Specified by:: hasPredEntities in interface Doc

Returns:: Whether predicted entities have been set.

hasTrueEntities

public boolean hasTrueEntities()

Description copied from interface: Doc

Indicates whether true entities are available.

Specified by:: hasTrueEntities in interface Doc

Returns:: Whether true entities have been set.

addPredEntities

protected void addPredEntities(java.util.List<Entity> ents)

Backed internally.

getCExampleFor

public CExample getCExampleFor(Mention m1,
                               Mention m2)

Description copied from interface: Doc

Returns the unique CExample for the given pair of mentions in the given order. Doc is the head of a collection of related examples; as such, it needs to return the same CExample any time an inference-based classifier is used.

Specified by:: getCExampleFor in interface Doc

Parameters:: m1 - The first mention.; m2 - The second mention.
Returns:: The unique CExample referring to the ordered pair m1, m2.

getGExampleFor

public GExample getGExampleFor(Mention m)

Description copied from interface: Doc

Returns the unique GExample for the given pair of mentions in the given order. Doc is the head of a collection of related examples; as such, it needs to return the same GExample any time an inference-based classifier is used.

Specified by:: getGExampleFor in interface Doc

Parameters:: m - The mention.
Returns:: The unique GExample referring to the ordered pair m1, m2.

setUsePredictedMentions

public void setUsePredictedMentions(boolean usePred)

Description copied from interface: Doc

Sets the preference for using predicted mentions or true mentions.

Specified by:: setUsePredictedMentions in interface Doc

Parameters:: usePred - if true, prefer to use predicted mentions, otherwise, prefer true mentions.

usePredictedMentions

public boolean usePredictedMentions()

Description copied from interface: Doc

Indicates whether requests for mentions will return predicted mentions or true mentions.

Specified by:: usePredictedMentions in interface Doc

Returns:: Whether predicted or true mentions are to be used.

getMentions

public java.util.List<Mention> getMentions()

Description copied from interface: Doc

Gets the mentions of the document, sorted (typically in document order). Returns predicted mentions or true mentions depending on the result of usePredictedMentions().

Specified by:: getMentions in interface Doc

Returns:: mentions sorted by their natural ordering (usually document ordering).

getPredMentions

public java.util.List<Mention> getPredMentions()

Description copied from interface: Doc

Gets a sorted list of predicted mentions.

Specified by:: getPredMentions in interface Doc

Returns:: sorted predicted mentions, or an empty list if none available.

getTrueMentions

public java.util.List<Mention> getTrueMentions()

Description copied from interface: Doc

Gets a sorted list of true mentions.

Specified by:: getTrueMentions in interface Doc

Returns:: sorted true mentions, or an empty list if none available.

hasPredMentions

public boolean hasPredMentions()

Description copied from interface: Doc

Indicates whether predicted mentions have been set.

Specified by:: hasPredMentions in interface Doc

Returns:: Whether predicted mentions have been set.

hasTrueMentions

public boolean hasTrueMentions()

Description copied from interface: Doc

Indicates whether true mentions have been set.

Specified by:: hasTrueMentions in interface Doc

Returns:: Whether true mentions have been set.

setPredictedMentions

public void setPredictedMentions(java.util.Collection<Mention> ments)

Description copied from interface: Doc

Sets the predicted mentions and records a preference for using them.

Specified by:: setPredictedMentions in interface Doc

Parameters:: ments - The predicted mentions (copied defensively).

alignPredMentsToTrue

protected void alignPredMentsToTrue()

addTrueMention

protected void addTrueMention(Mention m)

getNumMentions

public int getNumMentions()

sortTrueMentions

protected void sortTrueMentions()

Sorts true mentions in natural order, which is the textual order by default.

See Also:: Mention.compareTo(Mention)

sortPredictedMentions

protected void sortPredictedMentions()

Sorts predicted mentions in natural order, which is the textual order by default.

See Also:: Mention.compareTo(Mention)

getMention

public Mention getMention(int n)

getPredMention

public Mention getPredMention(int n)

getTrueMention

public Mention getTrueMention(int n)

getTrueMentionFor

public Mention getTrueMentionFor(Mention pred)

Description copied from interface: Doc

Gets the true mention aligned with the specified mention.

Specified by:: getTrueMentionFor in interface Doc

Parameters:: pred - A predicted mention.
Returns:: The true mention aligned with pred.

getBestMentionFor

public Mention getBestMentionFor(Mention m)

Description copied from interface: Doc

Gets the canonical mention of the entity containing m.

Specified by:: getBestMentionFor in interface Doc

Parameters:: m - A mention.
Returns:: The canonical mention for m.

getMentionsWithHeadStartingAt

public java.util.Set<Mention> getMentionsWithHeadStartingAt(int startWord)

Description copied from interface: Doc

Returns the set of mentions whose heads start at the specified word number, or an empty set if none are found. May be backed internally or not: no guarantees are made (yet).

Specified by:: getMentionsWithHeadStartingAt in interface Doc

Parameters:: startWord - A word number.
Returns:: The set of mentions whose heads start at startWord.

getMentionsWithExtentStartingAt

public java.util.Set<Mention> getMentionsWithExtentStartingAt(int startWord)

Description copied from interface: Doc

Returns the set of mentions whose extents start at the specified word number, or an empty set if none are found. May be backed internally or not: no guarantees are made (yet).

Specified by:: getMentionsWithExtentStartingAt in interface Doc

Parameters:: startWord - A word number.
Returns:: The set of mentions whose extents start at startWord.

getMentionsContainedIn

public java.util.Set<Mention> getMentionsContainedIn(Mention m)

Description copied from interface: Doc

Gets the set of mentions whose head is entirely contained within a specified mention's extent, including the specified mention itself. Returns predicted or true mentions according to the result of getMentions().

Specified by:: getMentionsContainedIn in interface Doc

Parameters:: m - The specified mention.
Returns:: The set of mentions contained in m.

getMentionsContaining

public java.util.Set<Mention> getMentionsContaining(Mention m)

Description copied from interface: Doc

Gets the set of mentions whose extents entirely contain a specified mention's extent, including the specified mention itself. Returns predicted or true mentions according to the result of getMentions().

Specified by:: getMentionsContaining in interface Doc

Parameters:: m - The specified mention.
Returns:: The Set of Mention objects whose extent is contained in (or equal to) the extent of m. Returns predicted or true mentions according to what getMentions() returns.

buildMentionsContaining

protected void buildMentionsContaining()

getMentionsInSent

public java.util.List<Mention> getMentionsInSent(int sentNum)

Description copied from interface: Doc

Gets a list of the mentions in a specified sentence in order. Returns true or predicted mentions according to the value of usePredictedMentions().

Specified by:: getMentionsInSent in interface Doc

Parameters:: sentNum - The number of the specified sentence.
Returns:: A List of the mentions in the specified sentence, in the order that they appear in the sentence.

buildMentionsInSents

protected void buildMentionsInSents()

getMentionsInSentences

public Pair<java.util.List<Mention>,java.util.List<Mention>> getMentionsInSentences(int s1,
                                                                                    int s2)

Description copied from interface: Doc

Gets a pair of lists of mentions, one for each of the two specified sentences. Gets all the mentions in the specified sentences.

Specified by:: getMentionsInSentences in interface Doc

Parameters:: s1 - The number of the first sentence.; s2 - The number of the second sentence.
Returns:: A pair of lists of mentions, one for each sentence.

makeChunk

public Chunk makeChunk(int startWord,
                       int endWord)

Description copied from interface: Doc

Create a chunk spanning the specified words in this document.

Specified by:: makeChunk in interface Doc

Parameters:: startWord - The position of the first word in desired chunk.; endWord - The position of the last word in the desired chunk.
Returns:: The desired chunk.

getWords

public java.util.List<java.lang.String> getWords()

Description copied from interface: Doc

Gets a list of the surface forms of the words of the document.

Specified by:: getWords in interface Doc

Returns:: A list of Strings of words, in the order they appear.

getWord

public java.lang.String getWord(int wordNum)

Description copied from interface: Doc

Gets the specified word.

Specified by:: getWord in interface Doc

Parameters:: wordNum - The position of the specified word (as an index into a List).
Returns:: The wordNumth word as a string.

getPOS

public java.util.List<java.lang.String> getPOS()

Description copied from interface: Doc

Gets a list of the Part-Of-Speech tags for the words of the document. The tag set is that output by the LBJ POS tagger.

Specified by:: getPOS in interface Doc

Returns:: A list of Part-Of-Speech tags corresponding to the words of the document.
See Also:: POSTagger

getPOS

public java.lang.String getPOS(int posNum)

Description copied from interface: Doc

Gets the Part-Of-Speech tag for the word at the posNum position in the document.

Specified by:: getPOS in interface Doc

Parameters:: posNum - The position of the word whose POS tag should be returned.
Returns:: The Part-Of-Speech tag for the desired word position.
See Also:: POSTagger

getWordNum

public int getWordNum(int charNum)

Description copied from interface: Doc

Determines the word number (zero-based) of the word at charNum, or if no word is at charNum, return the word number of the closest word appearing after charNum, or if no such word exists, return -1.

Specified by:: getWordNum in interface Doc

Parameters:: charNum - The character number.
Returns:: The word number corresponding to the specified character number.

getTextFirstWordNum

public int getTextFirstWordNum()

Description copied from interface: Doc

Gets the word number of the first word in the main text of the document (as distinguished from headlines and metadata that may be included in the plain text.)

Specified by:: getTextFirstWordNum in interface Doc

Returns:: The word number of the first word in the main text.

getStartCharNum

public int getStartCharNum(int wordNum)

Description copied from interface: Doc

Gets the zero-based position of the first character of a word.

Specified by:: getStartCharNum in interface Doc

Parameters:: wordNum - The zero-based position of the word in the document.
Returns:: The zero-based position of the first character in the word within into the plain text, or -1 if wordNum is invalid.

getQuoteNestLevel

public int getQuoteNestLevel(int wordNum)

Description copied from interface: Doc

Indicates the number of nested quotes the specified word is in. 0 is the base level of the text.

Specified by:: getQuoteNestLevel in interface Doc

Parameters:: wordNum - The position of the specified word.
Returns:: The number of nested quotes.

getInverseTrueHeadFreq

public double getInverseTrueHeadFreq(int wordNum)

Description copied from interface: Doc

Gets the inverse true head frequency of the word at the specified position.

Specified by:: getInverseTrueHeadFreq in interface Doc

Parameters:: wordNum - The position in the document of the specified word.
Returns:: The inverse true head frequency of the specified word, or 1.0 if the word is not in a true head.
See Also:: Doc.getInverseTrueHeadFreq(String)

getInverseTrueHeadFreq

public double getInverseTrueHeadFreq(java.lang.String word)

Description copied from interface: Doc

Gets the inverse of the number of occurrences of the specified word in the heads of the true mentions in the document.

Specified by:: getInverseTrueHeadFreq in interface Doc

Parameters:: word - The specified word.
Returns:: The inverse true head frequency of the specified word, or 1.0 if the word is not found in any heads.

getInDocInverseFreq

public double getInDocInverseFreq(java.lang.String word)

Description copied from interface: Doc

Gets the inverse of the number of occurrences of the specified word in the document. Not normalized.

Specified by:: getInDocInverseFreq in interface Doc

Parameters:: word - The specified word.
Returns:: The inverse of the number of times the word occurs in the document, or 1.0 if the word does not occur.

getInCorpusInverseFreq

public double getInCorpusInverseFreq(java.lang.String word)

Description copied from interface: Doc

Gets the inverse of the number of occurrences of the specified word in the corpus. Not normalized.

Specified by:: getInCorpusInverseFreq in interface Doc

Parameters:: word - The specified word.
Returns:: The inverse of the number of times the word occurs in the corpus, or 1.0 if the word does not occur.

getWholeDocCounts

public java.util.Map<java.lang.String,java.lang.Integer> getWholeDocCounts()

Description copied from interface: Doc

Gets the counts for the words in the document. Returns a copy, which may be slow or space consuming.

Specified by:: getWholeDocCounts in interface Doc

Returns:: A map from words to counts of words in the document.

setCorpusCounts

public void setCorpusCounts(java.util.Map<java.lang.String,java.lang.Integer> counts)

Description copied from interface: Doc

Sets the corpus counts for the words in the corpus. Makes a copy of the map, which may be slow or space consuming.

Specified by:: setCorpusCounts in interface Doc

Parameters:: counts - A map from words to counts of words in the corpus.

getNumRelations

public int getNumRelations()

Description copied from interface: Doc

Gets the number of relations.

Specified by:: getNumRelations in interface Doc

Returns:: The number of relations.

getRelation

public Relation getRelation(int number)

Description copied from interface: Doc

Gets the specified relation. Relations are not yet emphasized.

Specified by:: getRelation in interface Doc

Parameters:: number - the number of the desired relation.
Returns:: The desired relation.

addRelation

protected void addRelation(Relation r)

hasHeadPrediction

public boolean hasHeadPrediction(int firstWN,
                                 int lastWN)

Checks to see whether a prediction has been stored for whether the closed interval [firstWN, lastWN] word sequence is a head. (Does NOT return whether it is a head)

getHeadPrediction

public boolean getHeadPrediction(int firstWN,
                                 int lastWN)

addHeadPrediction

public void addHeadPrediction(int firstWN,
                              int lastWN,
                              boolean pred)

toString

public java.lang.String toString()

Overrides:: toString in class java.lang.Object

toAnnotatedString

public java.lang.String toAnnotatedString(boolean showPOS,
                                          boolean showMTypes,
                                          boolean showETypes,
                                          boolean showEIDs)

Description copied from interface: Doc

Gets the document as a string annotated with mention boundaries, with square brackets for true mentions, asterisks for false alarms, and triangle brackets for missed mentions, and optionally annotated with Part-Of-Speech tags, mention types, entity types, and entity IDs. Predicted Entity IDs will be shown if available.

Specified by:: toAnnotatedString in interface Doc

Parameters:: showPOS - Whether the Part-Of-Speech tags should be shown.; showMTypes - Whether mention types should be shown.; showETypes - Whether entity types should be shown.; showEIDs - Whether entity IDs should be shown.
Returns:: The text of the document, annotated.

toAnnotatedString

public java.lang.String toAnnotatedString(boolean showPOS)

Description copied from interface: Doc

Specified by:: toAnnotatedString in interface Doc

Parameters:: showPOS - Whether the Part-Of-Speech tags should be shown.
Returns:: The text of the document, annotated.

toSubstituteString

public java.lang.String toSubstituteString()

Description copied from interface: Doc

Gets the document as a string where each mention has been replaced by the most specific mention coreferential with it.

Specified by:: toSubstituteString in interface Doc

Returns:: The text of the document with the extent of each mention replaced by the most specific mention in its entity. Uses the mentions supplied by getMentions().

makeBestMentionMap

protected java.util.Map<Mention,Mention> makeBestMentionMap()

getCoherenceInfo

public java.util.Map<Entity,java.util.Map<java.lang.Integer,java.lang.String>> getCoherenceInfo(boolean usePred)

Description copied from interface: Doc

Gets a grid indicating the mention type for each combination of entities and sentences. If a mention is predicted to belong to its true entity, its mention type will be uppercase; but if it is predicted to be in the wrong entity (due to coreference mistake) its mention type will be lowercase and the mention's entity ID will be appended after its mention type.

Specified by:: getCoherenceInfo in interface Doc

Parameters:: usePred - Whether predicted entities should be used.
Returns:: A map from entities to a map from sentence numbers to strings, representing the grid described above.

getCoherenceInfo

public java.util.Map<Entity,java.util.Map<java.lang.Integer,java.lang.String>> getCoherenceInfo()

Description copied from interface: Doc

Gets the coherence info using the value of usePredictedEntities() to determine whether to use predicted entities.

Specified by:: getCoherenceInfo in interface Doc

Returns:: Coherence info as described in the one parameter version of this method.

toCoherenceTableString

public java.lang.String toCoherenceTableString(boolean usePred)

Description copied from interface: Doc

Gets the coherence grid represented as a string, laid out in a grid.

Specified by:: toCoherenceTableString in interface Doc

Returns:: A coherence grid as a string.
See Also:: Doc.getCoherenceInfo()

toCoherenceTableString

public java.lang.String toCoherenceTableString()

Description copied from interface: Doc

Gets the coherence grid represented as a string, laid out in a grid. Predicted entities will be used as determined by the value of usePredictedEntities().

Specified by:: toCoherenceTableString in interface Doc

Returns:: A coherence grid as a string.
See Also:: Doc.getCoherenceInfo()

repeat

protected java.lang.String repeat(java.lang.String s,
                                  int n)

sortEntitiesByListOrder

protected java.util.List<Entity> sortEntitiesByListOrder(java.util.List<Entity> ents,
                                                         java.util.List<Entity> ordered)

Does NOT modify in place (but this may change).

getShortEID

public java.lang.String getShortEID(java.lang.String longID)

save

public void save()
          throws java.io.IOException

Description copied from interface: Doc

Writes the document to a file using serialization.

Specified by:: save in interface Doc

Throws:: java.io.IOException

write

public void write(boolean usePredictions)

Description copied from interface: Doc

Writes this Doc in the appropriate format.

Specified by:: write in interface Doc

Parameters:: usePredictions - Whether predicted mentions and entities should be written.

write

public abstract void write(java.lang.String filename,
                           boolean usePredictions)

Description copied from interface: Doc

Writes this Doc in the appropriate format.

Specified by:: write in interface Doc

Parameters:: filename - The name of the target file.; usePredictions - Whether predicted mentions and entities should be written.

printChunkValidity

public static void printChunkValidity()

Verify that all mentions start and end on phrase boundaries.

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

edu.illinois.cs.cogcomp.lbj.coref.ir.docs Class DocBase

serialVersionUID

m_baseFN

totalMentions

goodStarts

goodEnds

medEnds

m_bUsePredEntities

m_trueEntities

m_predEntities

m_corefChains

m_relations

m_bUsePredMentions

m_trueMentions

m_trueMentionsSorted

m_predMentions

m_defaultAligner

m_predToTrueMention

m_caser

m_bNeedsCasing

m_phrases

m_source

m_docType

m_version

m_annotationAuthor

m_encoding

m_docID

m_slug

m_dateTime

m_headline

m_text

m_textStartCharNum

m_words

m_pos

m_quoteNestLevel

m_mentWordCounts

m_docWordCounts

m_corpusWordCounts

m_countingText

m_headStartWordNumMentionMap

m_extentStartWordNumMentionMap

m_charWordMap

m_nSents

m_wordNumSentNumMap

m_wordNumCharNumMap

m_sentenceMentionsPair

m_mentsInSents

m_mentionsContaining

m_bestMentionMap

m_headPredictionMap

m_cExMap

m_gExMap

m_isCaseSensitive

DocBase

initMembersDefault

loadSGMText

removeTagsAndExtraNL

loadPOSTags

loadPOSTaggerOutput

loadChunkedText

translateEscaped

calcAndSetQuotes

loadFromText

loadFromText

setPlainText

setWords

setWords

setPOSTags

setQuoteLevels

setSentenceNumbers

recordWordLocation

getPlainText

getDocID

isCaseSensitive

getSentNum

getNumSentences

setUsePredictedEntities

usePredictedEntities

getEntities

getPredEntities

edu.illinois.cs.cogcomp.lbj.coref.ir.docs
Class DocBase