public class TokenizerTextAnnotationBuilder extends Object implements TextAnnotationBuilder
SPLIT_ON_DASH
Constructor and Description |
---|
TokenizerTextAnnotationBuilder(Tokenizer tokenizer)
instantiate a TextAnnotationBuilder.
|
Modifier and Type | Method and Description |
---|---|
TextAnnotation |
buildTextAnnotation(String corpusId,
String id,
String text,
String[] tokens,
int[] sentenceEndPositions)
Create a new text annotation using the given text, the tokens and the sentence boundary
positions (only the ending positions), specified in terms of the tokens.
|
static TextAnnotation |
buildTextAnnotation(String corpusId,
String textId,
String text,
String[] tokens,
int[] sentenceEndPositions,
String sentenceViewGenerator,
double sentenceViewScore)
instantiate a TextAnnotation using a SentenceViewGenerator to create an explicit Sentence
view
|
TextAnnotation |
createTextAnnotation(String text)
create a TextAnnotation for the text argument, using the Tokenizer provided at construction.
|
TextAnnotation |
createTextAnnotation(String corpusId,
String textId,
String text)
Tokenize the input text (split into sentences and "words" within sentences) and populate a
TextAnnotation object.
|
TextAnnotation |
createTextAnnotation(String corpusId,
String textId,
String text,
Tokenizer.Tokenization tokenization)
A stub method that should not be called with this Builder.
|
String |
getName() |
public TokenizerTextAnnotationBuilder(Tokenizer tokenizer)
tokenizer
- The Tokenizer that will split text into sentences and words.public static TextAnnotation buildTextAnnotation(String corpusId, String textId, String text, String[] tokens, int[] sentenceEndPositions, String sentenceViewGenerator, double sentenceViewScore)
corpusId
- a field in TextAnnotation that can be used by the client for book-keeping
(e.g. track texts from the same corpus)textId
- a field in TextAnnotation that can be used by the client for book-keeping (e.g.
identify a specific document by some reference string)text
- the plain English text to processtokens
- the token Strings, in order from original textsentenceEndPositions
- token offsets of sentence ends (one-past-the-end indexing)sentenceViewGenerator
- the name of the source of the sentence splitsentenceViewScore
- a score that may indicate how reliable the sentence split
information isViewNames.TOKENS
and ViewNames.SENTENCE
views.public String getName()
getName
in interface TextAnnotationBuilder
public TextAnnotation createTextAnnotation(String text) throws IllegalArgumentException
createTextAnnotation
in interface TextAnnotationBuilder
text
- the text to build the TextAnnotationViewNames.SENTENCE
and ViewNames.TOKENS
views and default corpus id and text id fields.IllegalArgumentException
- if the tokenizer has problems processing the text.public TextAnnotation createTextAnnotation(String corpusId, String textId, String text) throws IllegalArgumentException
createTextAnnotation
in interface TextAnnotationBuilder
corpusId
- a field in TextAnnotation that can be used by the client for book-keeping
(e.g. track texts from the same corpus)textId
- a field in TextAnnotation that can be used by the client for book-keeping (e.g.
identify a specific document by some reference string)text
- the plain English text to processViewNames.TOKENS
and ViewNames.SENTENCE
views.IllegalArgumentException
- if the tokenizer has problems with the input text.public TextAnnotation createTextAnnotation(String corpusId, String textId, String text, Tokenizer.Tokenization tokenization) throws IllegalArgumentException
BasicTextAnnotationBuilder
if you need to create
TextAnnotation
from pre-tokenized text.createTextAnnotation
in interface TextAnnotationBuilder
text
- Raw text stringtokenization
- An instance containing tokens, character offsets, and sentence
boundaries.IllegalArgumentException
public TextAnnotation buildTextAnnotation(String corpusId, String id, String text, String[] tokens, int[] sentenceEndPositions)
For example, for the text "Jack went up the hill. So did Jill.", the tokens would be the array {"Jack", "went", "up", "the", "hill", "." ,"So", "did", "Jill", "."} and the array of sentence boundary array would be {6, 11}. If the last element of the sentence boundary array is not equal to the size of the tokens array, an IllegalArgumentException is raised.
corpusId
- A string that identifies the corpusid
- A string that identifies this texttext
- The text it selftokens
- The array of tokens of this textsentenceEndPositions
- The ending positions of sentences, specified as indices to the
tokens array. Note that the end positions are exclusive -- for example, if the
sentence ends at the 7th token, then the end position for that sentence would be 8.Copyright © 2017. All rights reserved.