edu.illinois.cs.cogcomp.lbj.coref.io.loaders
Class DocLoader

java.lang.Object
  extended by edu.illinois.cs.cogcomp.lbj.coref.io.loaders.DocLoader
Direct Known Subclasses:
DocAPFLoader, DocFromTextLoader, DocPlainTextLoader

public abstract class DocLoader
extends java.lang.Object

Loads a corpus of documents.

To load a document, construct a subclass of this and then call the loadDocs() method or the loadDoc(java.lang.String) method called with the correct type of input (see the relevant subclass for details.

To get the default document loader (which currently loads documents from annotated .apf.xml files, use getDefaultLoader(java.lang.String)


Field Summary
protected  LBJ2.classify.Classifier m_caser
          Classifier that decides the true case (uppercase, etc) of text.
protected  java.lang.String m_fileListFN
          Name of file containing list of document filenames, one per line.
protected  MentionDecoder m_mdDecoder
          Decoder that extracts predicted mentions from a document
protected  LBJ2.classify.Classifier m_mTypeClassifier
          Classifier that determines the mention types of a mention.
 
Constructor Summary
DocLoader()
          Default constructor.
DocLoader(MentionDecoder mentionDecoder, LBJ2.classify.Classifier mTyper)
          Construct a loader for use when no file is used.
DocLoader(java.lang.String fileListFN)
          Construct a loader that loads a list of documents from a file.
DocLoader(java.lang.String fileListFN, MentionDecoder mentionDecoder, LBJ2.classify.Classifier mTyper)
          Construct a loader that loads a list of documents from a file.
 
Method Summary
protected abstract  Doc createDoc(java.lang.String inputString)
          Create a document from the given string, treating inputString as a filename or as text depending on the subclass.
static DocLoader getDefaultLoader()
          Gets the default loader.
static DocLoader getDefaultLoader(java.lang.String fileList)
          Gets the default loader.
 java.lang.String[] getFilenames(java.lang.String fileListFN)
          Opens the given file and reads a list of filenames from it, one per line.
protected  java.util.List<Mention> getPredMents(Doc doc)
          Predict mentions using predicted mention decoder, sets mention types predicted by mention type classifier, and sets entity types using the entity type feature.
 Doc loadDoc(java.lang.String inputString)
          Loads a document.
 java.util.List<Doc> loadDocs()
          Load all the documents using filename and utilities already set.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

m_caser

protected LBJ2.classify.Classifier m_caser
Classifier that decides the true case (uppercase, etc) of text. Not currently used.


m_fileListFN

protected java.lang.String m_fileListFN
Name of file containing list of document filenames, one per line.


m_mdDecoder

protected MentionDecoder m_mdDecoder
Decoder that extracts predicted mentions from a document


m_mTypeClassifier

protected LBJ2.classify.Classifier m_mTypeClassifier
Classifier that determines the mention types of a mention. Takes Mention objects as input and returns the type as a string, "NAM", "NOM", "PRE", or "PRO".

Constructor Detail

DocLoader

public DocLoader(java.lang.String fileListFN,
                 MentionDecoder mentionDecoder,
                 LBJ2.classify.Classifier mTyper)
Construct a loader that loads a list of documents from a file. The file contains a list of filenames, one per line. Mentions will be predicted using the provided decoders and classifiers.

Parameters:
fileListFN - The name of the corpus file, containing a list of document filenames, one per line.
mentionDecoder - The mention decoder extracts mentions from a document.
mTyper - Determines the mention types of each mention. Takes Mention objects as input and returns the type as a string, "NAM", "NOM", "PRE", or "PRO".

DocLoader

public DocLoader(MentionDecoder mentionDecoder,
                 LBJ2.classify.Classifier mTyper)
Construct a loader for use when no file is used. In this case, do not call loadDocs(), but rather call loadDoc(String inputString) using the text as the input. Mentions will be predicted using the provided decoders and classifiers. containing a list of filenames corresponding to documents.

Parameters:
mentionDecoder - The mention decoder extracts mentions from a document.
mTyper - Determines the mention types of each mention. Takes Mention objects as input and returns the type as a string, "NAM", "NOM", "PRE", or "PRO".

DocLoader

public DocLoader(java.lang.String fileListFN)
Construct a loader that loads a list of documents from a file. The file contains a list of filenames, one per line. The resulting documents will have true mentions but no predicted mentions.

Parameters:
fileListFN - The name of the corpus file, containing a list of document filenames, one per line.

DocLoader

public DocLoader()
Default constructor. For use when no file is used. In this case, do not call loadDocs(), but rather call loadDoc(String inputString) using the text as the input.

Method Detail

loadDocs

public java.util.List<Doc> loadDocs()
Load all the documents using filename and utilities already set.

Returns:
A list of documents, possibly empty if IO problems..

loadDoc

public Doc loadDoc(java.lang.String inputString)
Loads a document. Delegates to the createDoc method, which may treat inputString as a filename or as text.

Parameters:
inputString - The filename or text, depending on the subclass. If a filename, it may end with the appropriate extension.
Returns:
a document corresponding to the inputString, either representing the text of inputString or saved in the file named by inputString

createDoc

protected abstract Doc createDoc(java.lang.String inputString)
Create a document from the given string, treating inputString as a filename or as text depending on the subclass.

Parameters:
inputString - The filename or text, depending on the subclass. If a filename, it may end with the appropriate extension.
Returns:
a document corresponding to the inputString, either representing the text of inputString or saved in the file named by inputString

getFilenames

public java.lang.String[] getFilenames(java.lang.String fileListFN)
Opens the given file and reads a list of filenames from it, one per line.

Parameters:
fileListFN - The name of a file, relative to the "fileLists" directory in the classpath, containing a list of filenames.
Returns:
An array of strings corresponding to filenames read from the specified file, or an empty array on failure.

getPredMents

protected java.util.List<Mention> getPredMents(Doc doc)
Predict mentions using predicted mention decoder, sets mention types predicted by mention type classifier, and sets entity types using the entity type feature. To be called by the loadDoc or loadDocs methods.

Parameters:
doc - The document whose mentions should be predicted.
Returns:
The predicted mentions.

getDefaultLoader

public static DocLoader getDefaultLoader(java.lang.String fileList)
Gets the default loader. This version is used when a list of files is specified.

Parameters:
fileList - The name of the file list @see DocAPFLoader constructor.
Returns:
the default DocLoader, which is currently DocAPFLoader.

getDefaultLoader

public static DocLoader getDefaultLoader()
Gets the default loader. This version is used when the loader does not take parameters.

Returns:
the default DocLoader, which is currently DocAPFLoader.