public class TextCleaner extends Object
Other methods try to remove problem character sequences from text to avoid known problems with NLP components for such sequences, but don't preserve character offsets.
| Constructor and Description |
|---|
TextCleaner(ResourceManager rm_) |
| Modifier and Type | Method and Description |
|---|---|
static String |
cleanDiscussionForumXml(StringTransformation xmlText,
List<String> tagsWithText,
List<String> tagsWithAtts,
List<String> attributeNames)
This class removes XML markup, for the most part.
|
String |
cleanText(String text_)
attempts to remove/replace characters likely to cause problems to NLP tools -- output should
be ascii only
|
static void |
main(String[] args)
Test here.
|
static String |
removeXmlLeaveAttributes(String origText,
List<String> tagNames,
List<String> attributeNames)
This class removes XML markup, for the most part, but for the provided set of attribute names,
it will leave their values in place and in the same position.
|
static String |
replaceControlSequence(String origText_)
noisy character sequences seen in some crawled text look like rendered control sequences;
this method strips them all.
|
static String |
replaceDuplicatePunctuation(String origText_)
replaces duplicate punctuation with single punctuation
|
static String |
replaceMisusedApostropheSymbol(String origText_)
attempts to replace incorrect apostrophe symbol (after other substitutions for non-standard
quotation marks)
|
static String |
replaceTildesAndStars(String origText_)
web documents sometimes use tildes and stars either for page section breaks or as bullets for
bullet points; these may cause problems to NLP components
|
static String |
replaceUnderscores(String origText_)
replaces underscores with dashes (many crawled news articles seem to have substituted em- or
en-dashes with underscores)
|
static StringTransformation |
replaceXmlEscapedChars(StringTransformation xmlTextSt)
given an xml string, replace xml-escaped characters with their ascii equivalent, padded with whitespace.
|
static String |
replaceXmlTags(String origText) |
public TextCleaner(ResourceManager rm_) throws SAXException, IOException
SAXExceptionIOExceptionpublic static String replaceMisusedApostropheSymbol(String origText_)
public static String removeXmlLeaveAttributes(String origText, List<String> tagNames, List<String> attributeNames)
quote
tags is also replaced with white space (for the purpose of cleaning up the ERE code only).origText - the original text markup.tagNames - the names of tags containing the attributes leave, MUST BE LOWERCASE.attributeNames - the names of attributes to leave, MUST BE LOWERCASE.public static String cleanDiscussionForumXml(StringTransformation xmlText, List<String> tagsWithText, List<String> tagsWithAtts, List<String> attributeNames)
quote
tags is left in place (though quote tags are removed) and the offsets are reported with the
other specified attributes.xmlText - the original xml text.tagsWithText - the names of tags containing text other than body text (e.g. headlines, quotes)tagsWithAtts - the names of tags containing the attributes leave, MUST BE LOWERCASE.attributeNames - the names of attributes to leave, MUST BE LOWERCASE.public static StringTransformation replaceXmlEscapedChars(StringTransformation xmlTextSt)
xmlTextSt - StringTransformation containing text string to be modifiedpublic static String replaceUnderscores(String origText_)
public static String replaceTildesAndStars(String origText_)
this method strips these characters completely
public static String replaceDuplicatePunctuation(String origText_)
This sometimes happens due to a typo, or may be due to ad-hoc web formating -- in the latter case, this may not have the ideal effect.
In addition, use of double dashes and ellipses may cause problems to NLP components; this should help, though it may introduce extra sentence breaks in the case of ellipses.
public static String replaceControlSequence(String origText_)
public static void main(String[] args) throws Exception
args - not used.ExceptionCopyright © 2017. All rights reserved.