public class XmlDocumentProcessor extends Object
Modifier and Type | Class and Description |
---|---|
class |
XmlDocumentProcessor.SpanInfo
a structure to store span information: label, offsets, attributes (including value offsets)
|
Modifier and Type | Field and Description |
---|---|
static String |
SPAN_INFO
tag to indicate that the associated information denotes a span in source xml with the associated label.
|
Constructor and Description |
---|
XmlDocumentProcessor(Set<String> deletableSpanTags,
Map<String,Set<String>> tagsWithAtts,
Set<String> singletonTags,
boolean throwExceptionOnUnrecognizedTag)
specify tags that determine processor behavior.
|
Modifier and Type | Method and Description |
---|---|
static Map<IntPair,Set<String>> |
compileAttributeValues(List<XmlDocumentProcessor.SpanInfo> xmlMarkup)
builds a map of attribute value offsets to attribute value to support search for metadata matching
entity mentions
|
static Map<IntPair,XmlDocumentProcessor.SpanInfo> |
compileOffsetSpanMapping(List<XmlDocumentProcessor.SpanInfo> retainedTagInfo)
generate a mapping from span offset to span info
|
Pair<StringTransformation,List<XmlDocumentProcessor.SpanInfo>> |
processXml(String xmlText)
This class removes XML markup, for the most part.
|
public static final String SPAN_INFO
public XmlDocumentProcessor(Set<String> deletableSpanTags, Map<String,Set<String>> tagsWithAtts, Set<String> singletonTags, boolean throwExceptionOnUnrecognizedTag)
deletableSpanTags
- the names of tags containing text to be excluded from clean text (e.g. quotes)tagsWithAtts
- the names of tags containing the attributes to retain, paired with sets of attribute names
MUST BE LOWERCASE.singletonTags
- the names of one-off tags (i.e. no close tag) whose contents must be skipped entirelypublic static Map<IntPair,Set<String>> compileAttributeValues(List<XmlDocumentProcessor.SpanInfo> xmlMarkup)
xmlMarkup
- xml span information collected from source documentpublic static Map<IntPair,XmlDocumentProcessor.SpanInfo> compileOffsetSpanMapping(List<XmlDocumentProcessor.SpanInfo> retainedTagInfo)
retainedTagInfo
- span information produced by XmlDocumentProcessorpublic Pair<StringTransformation,List<XmlDocumentProcessor.SpanInfo>> processXml(String xmlText)
quote
tags is left in place (though quote tags are removed) and the offsets are reported with the
other specified attributes.
This class has some facility for handling nested tags. Opens without closes are checked against
tags to ignore (provided at construction) and if found are ignored (deleted). Otherwise, an exception
is thrown.xmlText
- StringTransformation whose basis is the original xml text.Copyright © 2017. All rights reserved.