public class SentenceSplitter
extends edu.illinois.cs.cogcomp.lbjava.parse.LineByLine
The user can then retrieve Sentence
s one at a time with
the next()
method, or all at once with the
splitAll()
method. The returned Sentence
s'
start
and end
fields represent offsets into the
file they were extracted from. Every character in between those two
offsets inclusive, including extra spaces, newlines, etc., is included in
the Sentence
as it appeared in the paragraph.
A main(String[])
method is also implemented which applies
this class to plain text in a straight-forward way.
Sentence
Modifier and Type | Field and Description |
---|---|
protected int |
currentOffset
Contains the offset of a paragraph currently being processed.
|
protected int |
index
When the constructor taking an array argument is used, this variable
keeps track of the element in the array currently being used.
|
protected String[] |
input
When the constructor taking an array argument is used, this variable
stores that array.
|
protected LinkedList<Sentence> |
sentences
Contains sentences ready to be returned to the user upon request.
|
Constructor and Description |
---|
SentenceSplitter(String file)
Sentence splits the given file.
|
SentenceSplitter(String[] input)
Sentence splits the given input.
|
Modifier and Type | Method and Description |
---|---|
protected boolean |
boundary(int index,
Word word,
Word next1,
Word next2)
Determines whether the given punctuation represents the end of a
sentence based on elements of the paragraph immediately surrounding the
punctuation.
|
protected boolean |
endsWithCloseBracket(Word w)
Determines whether the argument ends with any of the following
varieties of open bracket: ) } ] -RBR- .
|
protected boolean |
endsWithQuote(Word w)
Determines whether the argument ends with any of the following varieties
of closing quote: ' '' ''' " '" .
|
protected String |
getParagraph()
This method is used to extract a paragraph at a time from the input.
|
protected boolean |
hasStartMarker(Word w)
Determines whether the argument contains any of the following varieties
of "start marker" at its beginning: an open quote, and open bracket, or
a capital letter.
|
protected boolean |
isClose(Word w)
Determines whether the argument represents a closing bracket or a
closing quote.
|
protected boolean |
isClosingBracket(Word w)
Determines whether the argument is exactly equal to any of the following
varieties of closing bracket: ) } ] -RBR- .
|
protected boolean |
isClosingQuote(Word w)
Determines whether the argument is exactly equal to any of the following
varieties of closing quote: ' '' ''' " '" .
|
protected boolean |
isHonorific(Word w)
Determines whether the argument is exactly equal to any of the honorifics
listed below.
|
protected boolean |
isTerminal(Word w)
Determines whether the argument is exactly equal to any of the following
terminal abbreviations: Esq Jr Sr M.D Ph.D .
|
protected boolean |
isTimeZone(Word w)
Determines whether the argument is a United States time zone
abbreviation (AST, CST, EST, HST, MST, PST, ADT, CDT, EDT, HDT, MDT,
PDT, or UTC-11).
|
static void |
main(String[] args)
Run this program on a file containing plain text, and it will produce
the same text rearranged so that each line contains exactly one sentence
on
STDOUT . |
Object |
next()
Retrieves the next sentence off the queue and returns it.
|
protected void |
process(String paragraph)
This method does the actual work, deciding where sentences begin and end
and populating the
sentences member variable. |
protected String |
readLine()
If constructor taking a file name as input was used, this method simply
calls the method of the same name in
LineByLine ; otherwise,
it returns the next element of the array. |
protected boolean |
sentenceBeginner(Word word)
Simple check to see if the given word can reliably be identified as the
first word of a sentence.
|
Sentence[] |
splitAll()
Retrieves every sentence found in the input paragraphs that have been
provided so far in array form.
|
protected boolean |
startsWithOpenBracket(Word w)
Determines whether the argument starts with any of the following
varieties of open bracket: ( { [ -LBR- .
|
protected boolean |
startsWithOpenQuote(Word w)
Determines whether the argument starts with any of the following
varieties of open quote: ` `` ``` " "` .
|
protected boolean |
startsWithQuote(Word w)
Determines whether the first character of the argument is any of the
three varieties of quotes: ' " `.
|
protected int currentOffset
protected LinkedList<Sentence> sentences
protected int index
protected String[] input
public SentenceSplitter(String file)
file
- The name of the file to sentence split.public SentenceSplitter(String[] input)
input
- Plain text. Each element of this array represents a line,
with any line termination characters removed.public static void main(String[] args)
STDOUT
.
Usage:
java edu.illinois.cs.cogcomp.lbjava.edu.illinois.cs.cogcomp.lbjava.nlp.SentenceSplitter <file name>
args
- The command line arguments.protected String readLine()
LineByLine
; otherwise,
it returns the next element of the array.readLine
in class edu.illinois.cs.cogcomp.lbjava.parse.LineByLine
protected String getParagraph()
public Object next()
null
if there are no
more sentences.public Sentence[] splitAll()
protected void process(String paragraph)
sentences
member variable.paragraph
- The paragraph to process.protected boolean boundary(int index, Word word, Word next1, Word next2)
index
- The index of the punctuation in question in its word.word
- The word containing the punctuation.next1
- The word one after the word containing the
punctuation.next2
- The word two after the word containing the
punctuation.protected boolean sentenceBeginner(Word word)
word
- The word in question.protected boolean startsWithQuote(Word w)
w
- The word in question.true
if and only if the first character of the
argument is any of the three varieties of quotes.protected boolean endsWithQuote(Word w)
w
- The word in question.true
if and only if the argument ends with any of
the varieties of quotes named above.protected boolean isClose(Word w)
w
- The word in question.true
if and only if the argument represents either
a closing bracket or a closing quote.protected boolean isClosingBracket(Word w)
w
- The word in question.true
if and only if the argument is exactly equal
to any of the above varieties of closing bracket.protected boolean isClosingQuote(Word w)
w
- The word in question.true
if and only if the argument is exactly equal
to any of the above varieties of closing quote.protected boolean hasStartMarker(Word w)
w
- The word in question.true
if and only if the argument starts with a
"start marker".protected boolean startsWithOpenQuote(Word w)
w
- The word in question.true
if and only if the argument starts with one of
the varieties of open quote named above.protected boolean startsWithOpenBracket(Word w)
w
- The word in question.true
if and only if the argument starts with any of
the varieties of open bracket named above.protected boolean endsWithCloseBracket(Word w)
w
- The word in question.true
if and only if the argument starts with any of
the varieties of open bracket named above.protected boolean isTimeZone(Word w)
w
- The word in question.true
if and only if the argument matches any of the
above time zone abbreviations.protected boolean isTerminal(Word w)
w
- The word in question.true
if and only if the argument matches any of the
above terminal abbreviations.protected boolean isHonorific(Word w)
w
- The word in question.true
if and only if the argument is exactly equal
to any of the honorifics listed above.Copyright © 2017. All rights reserved.