public class TokenizerTextAnnotationBuilder extends Object implements TextAnnotationBuilder
SPLIT_ON_DASH| Constructor and Description |
|---|
TokenizerTextAnnotationBuilder(Tokenizer tokenizer)
instantiate a TextAnnotationBuilder.
|
| Modifier and Type | Method and Description |
|---|---|
TextAnnotation |
buildTextAnnotation(String corpusId,
String id,
String text,
String[] tokens,
int[] sentenceEndPositions)
Create a new text annotation using the given text, the tokens and the sentence boundary
positions (only the ending positions), specified in terms of the tokens.
|
static TextAnnotation |
buildTextAnnotation(String corpusId,
String textId,
String text,
String[] tokens,
int[] sentenceEndPositions,
String sentenceViewGenerator,
double sentenceViewScore)
instantiate a TextAnnotation using a SentenceViewGenerator to create an explicit Sentence
view
|
TextAnnotation |
createTextAnnotation(String text)
create a TextAnnotation for the text argument, using the Tokenizer provided at construction.
|
TextAnnotation |
createTextAnnotation(String corpusId,
String textId,
String text)
Tokenize the input text (split into sentences and "words" within sentences) and populate a
TextAnnotation object.
|
TextAnnotation |
createTextAnnotation(String corpusId,
String textId,
String text,
Tokenizer.Tokenization tokenization)
A stub method that should not be called with this Builder.
|
String |
getName() |
public TokenizerTextAnnotationBuilder(Tokenizer tokenizer)
tokenizer - The Tokenizer that will split text into sentences and words.public static TextAnnotation buildTextAnnotation(String corpusId, String textId, String text, String[] tokens, int[] sentenceEndPositions, String sentenceViewGenerator, double sentenceViewScore)
corpusId - a field in TextAnnotation that can be used by the client for book-keeping
(e.g. track texts from the same corpus)textId - a field in TextAnnotation that can be used by the client for book-keeping (e.g.
identify a specific document by some reference string)text - the plain English text to processtokens - the token Strings, in order from original textsentenceEndPositions - token offsets of sentence ends (one-past-the-end indexing)sentenceViewGenerator - the name of the source of the sentence splitsentenceViewScore - a score that may indicate how reliable the sentence split
information isViewNames.TOKENS and ViewNames.SENTENCE
views.public String getName()
getName in interface TextAnnotationBuilderpublic TextAnnotation createTextAnnotation(String text) throws IllegalArgumentException
createTextAnnotation in interface TextAnnotationBuildertext - the text to build the TextAnnotationViewNames.SENTENCE and ViewNames.TOKENS
views and default corpus id and text id fields.IllegalArgumentException - if the tokenizer has problems processing the text.public TextAnnotation createTextAnnotation(String corpusId, String textId, String text) throws IllegalArgumentException
createTextAnnotation in interface TextAnnotationBuildercorpusId - a field in TextAnnotation that can be used by the client for book-keeping
(e.g. track texts from the same corpus)textId - a field in TextAnnotation that can be used by the client for book-keeping (e.g.
identify a specific document by some reference string)text - the plain English text to processViewNames.TOKENS and ViewNames.SENTENCE
views.IllegalArgumentException - if the tokenizer has problems with the input text.public TextAnnotation createTextAnnotation(String corpusId, String textId, String text, Tokenizer.Tokenization tokenization) throws IllegalArgumentException
BasicTextAnnotationBuilder if you need to create
TextAnnotation from pre-tokenized text.createTextAnnotation in interface TextAnnotationBuildertext - Raw text stringtokenization - An instance containing tokens, character offsets, and sentence
boundaries.IllegalArgumentExceptionpublic TextAnnotation buildTextAnnotation(String corpusId, String id, String text, String[] tokens, int[] sentenceEndPositions)
For example, for the text "Jack went up the hill. So did Jill.", the tokens would be the array {"Jack", "went", "up", "the", "hill", "." ,"So", "did", "Jill", "."} and the array of sentence boundary array would be {6, 11}. If the last element of the sentence boundary array is not equal to the size of the tokens array, an IllegalArgumentException is raised.
corpusId - A string that identifies the corpusid - A string that identifies this texttext - The text it selftokens - The array of tokens of this textsentenceEndPositions - The ending positions of sentences, specified as indices to the
tokens array. Note that the end positions are exclusive -- for example, if the
sentence ends at the 7th token, then the end position for that sentence would be 8.Copyright © 2017. All rights reserved.