public class WordsInDocumentByDirectory extends Object implements edu.illinois.cs.cogcomp.lbjava.parse.Parser
String
s,
each representing all the words in a document. It assumes the documents
can be found in the subdirectories of a user supplied directory. The
names of each subdirectory becomes the label for every document it
contains. That label appears as the first element of the returned array.
The documents are returned in a randomized order by default, but that
behavior is configurable. The "words" in each document are computed
simply by splitting on whitespace.Modifier and Type | Field and Description |
---|---|
protected List<File> |
files
The list of all files to be parsed.
|
protected int |
filesIndex
Points to the next element of
files to be parsed. |
Constructor and Description |
---|
WordsInDocumentByDirectory(String directory)
Creates a new parser that reads all subdirectories and randomizes the
order in which their contents are returned.
|
WordsInDocumentByDirectory(String directory,
String[] exceptions)
Creates a new parser that reads all subdirectories except for the named
exceptions and randomizes the order in which their contents are
returned.
|
WordsInDocumentByDirectory(String directory,
String[] exceptions,
boolean shuffle)
Creates a new parser that reads all subdirectories except for the named
exceptions.
|
WordsInDocumentByDirectory(String directory,
String[] exceptions,
boolean shuffle,
long seed)
Creates a new parser that reads all subdirectories except for the named
exceptions.
|
Modifier and Type | Method and Description |
---|---|
void |
close()
Frees any resources this parser may be holding.
|
static String[] |
fileToArray(File file,
String label)
Reads in the specified file, splits it on whitespace, and adds all
resulting words to an array which it returns.
|
Object |
next()
Returns the next labeled array of words.
|
void |
reset()
Sets
filesIndex back to 0. |
protected int filesIndex
files
to be parsed.public WordsInDocumentByDirectory(String directory)
directory
- This directory contains subdirectories (whose names
will be used as labels) which contain the documents to
be parsed.public WordsInDocumentByDirectory(String directory, String[] exceptions)
directory
- This directory contains subdirectories (whose names
will be used as labels) which contain the documents to
be parsed.exceptions
- None of the subdirectories whose names appear in this
array will be parsed. It may be null if there are no
exceptions.public WordsInDocumentByDirectory(String directory, String[] exceptions, boolean shuffle)
directory
- This directory contains subdirectories (whose names
will be used as labels) which contain the documents to
be parsed.exceptions
- None of the subdirectories whose names appear in this
array will be parsed. It may be null if there are no
exceptions.shuffle
- Whether or not to randomly shuffle the order in which
examples are returned.public WordsInDocumentByDirectory(String directory, String[] exceptions, boolean shuffle, long seed)
directory
- This directory contains subdirectories (whose names
will be used as labels) which contain the documents to
be parsed.exceptions
- None of the subdirectories whose names appear in this
array will be parsed. It may be null if there are no
exceptions.shuffle
- Whether or not to randomly shuffle the order in which
examples are returned.seed
- For the random number generator. If set to -1, no
seed is used.public void reset()
filesIndex
back to 0.reset
in interface edu.illinois.cs.cogcomp.lbjava.parse.Parser
public Object next()
next
in interface edu.illinois.cs.cogcomp.lbjava.parse.Parser
public void close()
close
in interface edu.illinois.cs.cogcomp.lbjava.parse.Parser
public static String[] fileToArray(File file, String label)
file
- The file to read in.label
- The label associated with this file, which should appear
as the first element of the returned array.label
followed by all the words
in the file in the order they appeared.Copyright © 2017. All rights reserved.