public class TextCleanerStringTransformation extends Object
StringTransformationCleanup
).
This class, which replicates functionality from TextCleaner
, uses StringTransformations instead
of Strings, and so allows recovery of character offsets in the original string after cleanup has taken
place, even if character offsets are not preserved.
Modifier and Type | Field and Description |
---|---|
protected static Pattern |
adhocFormatPattern |
protected static Pattern |
atSymbolPattern |
protected static Pattern |
badApostrophePattern |
protected static Pattern |
controlSequencePattern |
protected static Map<String,String> |
escXlmToChar |
protected static org.slf4j.Logger |
logger |
protected static Pattern |
repeatPunctuationPattern |
protected static Pattern |
tagAttributePatter2
find attributes in an xml tag instance.
|
protected static Pattern |
tagAttributePattern |
protected static Pattern |
underscorePattern |
protected static Pattern |
whitespacePattern |
protected static String |
xmlAmp |
protected static String |
xmlApos |
protected static Pattern |
xmlEscapeCharPattern |
protected static String |
xmlGt |
protected static String |
xmlLt |
protected static String |
xmlQuot |
protected static Pattern |
xmlTagNamePattern
used to extract the name of the tag, so we can match tag and attribute name.
|
protected static Pattern |
xmlTagPattern |
Constructor and Description |
---|
TextCleanerStringTransformation(ResourceManager rm) |
Modifier and Type | Method and Description |
---|---|
StringTransformation |
cleanText(StringTransformation origTextSt)
attempts to remove/replace characters likely to cause problems to NLP tools -- output should
be ascii only
|
protected static String |
getSubstituteStr(String substr) |
protected static boolean |
keep(List<String> tagNames,
List<String> attributeNames,
String tagname,
String attrName)
determine if we should keep the value for this tag/attr.
|
static StringTransformation |
replaceCharacters(StringTransformation origTextSt,
Pattern pattern,
String replacement)
apply a matcher that finds instances of one or more sequences and replaces them with
a specific string (including the empty string)
|
static StringTransformation |
replaceDuplicatePunctuation(StringTransformation origTextSt)
replaces duplicate punctuation with single punctuation
|
static StringTransformation |
replaceMisusedApostropheSymbol(StringTransformation origTextSt)
attempts to replace incorrect apostrophe symbol (after other substitutions for non-standard
quotation marks)
|
static StringTransformation |
replaceXmlEscapedChars(StringTransformation xmlTextSt)
given an xml string, replace xml-escaped characters with their ascii equivalent, padded with whitespace.
|
static String |
replaceXmlTags(String origText)
identifies xml tags and replaces them with an equal amount of space characters.
|
protected static final String xmlQuot
protected static final String xmlAmp
protected static final String xmlApos
protected static final String xmlLt
protected static final String xmlGt
protected static final Pattern xmlEscapeCharPattern
protected static final Pattern underscorePattern
protected static final Pattern controlSequencePattern
protected static final Pattern atSymbolPattern
protected static final Pattern badApostrophePattern
protected static org.slf4j.Logger logger
protected static Pattern repeatPunctuationPattern
protected static Pattern xmlTagPattern
protected static Pattern xmlTagNamePattern
protected static Pattern whitespacePattern
protected static Pattern tagAttributePatter2
protected static Pattern tagAttributePattern
protected static Pattern adhocFormatPattern
public TextCleanerStringTransformation(ResourceManager rm) throws SAXException, IOException
SAXException
IOException
public static StringTransformation replaceMisusedApostropheSymbol(StringTransformation origTextSt)
public static String replaceXmlTags(String origText)
origText
- text to cleanprotected static boolean keep(List<String> tagNames, List<String> attributeNames, String tagname, String attrName)
tagNames
- the names of the tags to keep.attributeNames
- the corresponding attributes within the above tag we want to keep.tagname
- the name of the current tag.attrName
- the name of the current attrigute.public static StringTransformation replaceXmlEscapedChars(StringTransformation xmlTextSt)
xmlTextSt
- StringTransformation containing text string to be modifiedpublic static StringTransformation replaceCharacters(StringTransformation origTextSt, Pattern pattern, String replacement)
origTextSt
- public static StringTransformation replaceDuplicatePunctuation(StringTransformation origTextSt)
This sometimes happens due to a typo, or may be due to ad-hoc web formating -- in the latter case, this may not have the ideal effect.
In addition, use of double dashes and ellipses may cause problems to NLP components; this should help, though it may introduce extra sentence breaks in the case of ellipses.
public StringTransformation cleanText(StringTransformation origTextSt)
Copyright © 2017. All rights reserved.