public class TokenizerStateMachine extends Object
This is the state machine used to parse text.
The parsing states are defined by the TokenizerState
enumeration, the state transition
set is defined by the TokenType
enumeration. These two enumerations define the dimensions
of the state machine matrix.
States are assumed to be nested rather than linear, so there is a state stack that contains
State
objects that encapsulate the state object as well as logic to manage poping and
pushing that state. So the state objects do the work of actually creating the TextAnnotation
objects.
The StateProcessor interface defines the interface class instances that operate on state transitions must adhere to to process new tokens as they are presented to the state machine
Potential issues:Modifier and Type | Field and Description |
---|---|
protected ArrayList<edu.illinois.cs.cogcomp.nlp.tokenizer.TokenizerStateMachine.State> |
completed
the state stack, since state can be nested.
|
protected int |
current
the character offset.
|
protected ArrayList<edu.illinois.cs.cogcomp.nlp.tokenizer.TokenizerStateMachine.State> |
stack
the state stack, since state can be nested.
|
protected int |
state
the state we are in currently.
|
protected StateProcessor[][] |
statemachine
This is the state machine.
|
protected char[] |
text
the text to process.
|
protected String |
textstring
the text to process.
|
Constructor and Description |
---|
TokenizerStateMachine(boolean splitOnDash)
Init the state machine decision matrix and the text annotation.
|
Modifier and Type | Method and Description |
---|---|
protected edu.illinois.cs.cogcomp.nlp.tokenizer.TokenizerStateMachine.State |
getCurrent()
get the current state.
|
protected boolean |
isContinue()
Any number of periods beyond two will continue the sentence rather than ending it..
|
protected boolean |
isURL()
We have encountered a colon in the input data stream, check to see if it is a URL, and if it
is, advance the cursor and return true, or return false.
|
protected void |
parseText(String intext)
Process the input text delineating sentences and words.
|
protected char |
peek(int offset)
get the character at the given offset from the current position.
|
protected boolean |
pop(int where)
Pop the current state identifier off the stack.
|
protected void |
push(edu.illinois.cs.cogcomp.nlp.tokenizer.TokenizerStateMachine.State newState,
int where)
Push a new state identifier off the stack.
|
protected ArrayList<edu.illinois.cs.cogcomp.nlp.tokenizer.TokenizerStateMachine.State> stack
protected ArrayList<edu.illinois.cs.cogcomp.nlp.tokenizer.TokenizerStateMachine.State> completed
protected StateProcessor[][] statemachine
This is the state machine. Cardinality of 1st dim indexed by tokenizer states(TokenizerState), 2nd dimension is indexed by of token types (TokenType enum). The values of this array are implementations of StateProcessor interface, are responsible for individually processing a character at a time, but they call also look ahead and back, as they have full access to the contents of this class.
This class is initialized using Java's new lambda expressions, solely for the brevity, and
considering the number of entries here, the expressiveness of a verbose interface
implementation does not seem necessary. The StateProcessor
interface has only one
method, that method takes a char.
protected char[] text
protected String textstring
protected int current
protected int state
public TokenizerStateMachine(boolean splitOnDash)
protected boolean isContinue()
protected boolean isURL()
protected char peek(int offset)
offset
- the offset from the current position.protected edu.illinois.cs.cogcomp.nlp.tokenizer.TokenizerStateMachine.State getCurrent()
protected boolean pop(int where)
where
- the position to terminate the previous token and start the new one.protected void push(edu.illinois.cs.cogcomp.nlp.tokenizer.TokenizerStateMachine.State newState, int where)
newState
- the new state to push.where
- the start position.protected void parseText(String intext)
intext
- the text to parse.Copyright © 2017. All rights reserved.