public class TextCleaner extends Object
Other methods try to remove problem character sequences from text to avoid known problems with NLP components for such sequences, but don't preserve character offsets.
Constructor and Description |
---|
TextCleaner(ResourceManager rm_) |
Modifier and Type | Method and Description |
---|---|
static String |
cleanDiscussionForumXml(StringTransformation xmlText,
List<String> tagsWithText,
List<String> tagsWithAtts,
List<String> attributeNames)
This class removes XML markup, for the most part.
|
String |
cleanText(String text_)
attempts to remove/replace characters likely to cause problems to NLP tools -- output should
be ascii only
|
static void |
main(String[] args)
Test here.
|
static String |
removeXmlLeaveAttributes(String origText,
List<String> tagNames,
List<String> attributeNames)
This class removes XML markup, for the most part, but for the provided set of attribute names,
it will leave their values in place and in the same position.
|
static String |
replaceControlSequence(String origText_)
noisy character sequences seen in some crawled text look like rendered control sequences;
this method strips them all.
|
static String |
replaceDuplicatePunctuation(String origText_)
replaces duplicate punctuation with single punctuation
|
static String |
replaceMisusedApostropheSymbol(String origText_)
attempts to replace incorrect apostrophe symbol (after other substitutions for non-standard
quotation marks)
|
static String |
replaceTildesAndStars(String origText_)
web documents sometimes use tildes and stars either for page section breaks or as bullets for
bullet points; these may cause problems to NLP components
|
static String |
replaceUnderscores(String origText_)
replaces underscores with dashes (many crawled news articles seem to have substituted em- or
en-dashes with underscores)
|
static StringTransformation |
replaceXmlEscapedChars(StringTransformation xmlTextSt)
given an xml string, replace xml-escaped characters with their ascii equivalent, padded with whitespace.
|
static String |
replaceXmlTags(String origText) |
public TextCleaner(ResourceManager rm_) throws SAXException, IOException
SAXException
IOException
public static String replaceMisusedApostropheSymbol(String origText_)
public static String removeXmlLeaveAttributes(String origText, List<String> tagNames, List<String> attributeNames)
quote
tags is also replaced with white space (for the purpose of cleaning up the ERE code only).origText
- the original text markup.tagNames
- the names of tags containing the attributes leave, MUST BE LOWERCASE.attributeNames
- the names of attributes to leave, MUST BE LOWERCASE.public static String cleanDiscussionForumXml(StringTransformation xmlText, List<String> tagsWithText, List<String> tagsWithAtts, List<String> attributeNames)
quote
tags is left in place (though quote tags are removed) and the offsets are reported with the
other specified attributes.xmlText
- the original xml text.tagsWithText
- the names of tags containing text other than body text (e.g. headlines, quotes)tagsWithAtts
- the names of tags containing the attributes leave, MUST BE LOWERCASE.attributeNames
- the names of attributes to leave, MUST BE LOWERCASE.public static StringTransformation replaceXmlEscapedChars(StringTransformation xmlTextSt)
xmlTextSt
- StringTransformation containing text string to be modifiedpublic static String replaceUnderscores(String origText_)
public static String replaceTildesAndStars(String origText_)
this method strips these characters completely
public static String replaceDuplicatePunctuation(String origText_)
This sometimes happens due to a typo, or may be due to ad-hoc web formating -- in the latter case, this may not have the ideal effect.
In addition, use of double dashes and ellipses may cause problems to NLP components; this should help, though it may introduce extra sentence breaks in the case of ellipses.
public static String replaceControlSequence(String origText_)
public static void main(String[] args) throws Exception
args
- not used.Exception
Copyright © 2017. All rights reserved.