edu.illinois.cs.cogcomp.lbj.coref.features
Class StringSimilarityFeatures

java.lang.Object
  extended by edu.illinois.cs.cogcomp.lbj.coref.features.StringSimilarityFeatures

public class StringSimilarityFeatures
extends java.lang.Object

A collection of features relating to the similarity of strings.


Constructor Summary
protected StringSimilarityFeatures()
          Should not need to construct this static feature library.
 
Method Summary
static int calcLevenshteinEditDist(java.lang.String a, java.lang.String b)
          Calculates the (unnormalized) Levenshtein edit distance for a pair of strings.
static boolean doLastWordsMatch(CExample ex, boolean useHead)
          Determines whether the last word of one mention matches the last word of another mention.
static double getEdit(CExample ex, boolean useHead)
          Gets the character-based edit distance, normalized by the length of the longer string.
static int getNumDiffCapitalizedWords(CExample ex)
          Determines the number of capitalized words occur in exactly one mention.
static boolean leftSubstring(CExample ex, boolean useHead)
          Determines whether one mention's text begins with the text of the other mention.
static boolean prenominalModifierWordMatchAnotherOrHeadWord(CExample ex)
          Determines whether any words of one mention preceding its head match any words of the other mention that precede or are contained in the head.
static boolean prenominalModifierWordMatchAnotherOrHeadWord(Mention m1, Mention m2)
          Determines whether a noun preceding the head of m1 matches a noun preceding or in the head of m2 No assumptions are made about the textual order of m1 and m2.
static boolean rightSubstring(CExample ex, boolean useHead)
          Determines whether one mention's text ends with the text of the other mention.
static boolean stringsMatchByWords(java.lang.String s1, java.lang.String s2, java.lang.String[] ignoreWords)
          Determines whether two strings contain the same sequence of words, after dropping ignoreWords.
static boolean subsequence(CExample ex, boolean useHead)
          Determines whether the sequence of words in one mention is a subsequence of the words in another mention.
static boolean subsequence(java.lang.String big, java.lang.String small)
          Determines whether the sequence of words in big is a subsequence of the words in small.
static boolean substring(CExample ex, boolean useHead)
          Determines whether one mention's text is a substring of the other's.
static boolean textMatchSoon(CExample ex, boolean useHead)
          Determines whether the text (either the heads or the extents) of the mentions match, after dropping stop words.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

StringSimilarityFeatures

protected StringSimilarityFeatures()
Should not need to construct this static feature library.

Method Detail

substring

public static boolean substring(CExample ex,
                                boolean useHead)
Determines whether one mention's text is a substring of the other's. Uses case insensitive comparison.

Parameters:
ex - The example whose mentions' text will be compared.
useHead - if true, compare head text, otherwise compare extent text.

leftSubstring

public static boolean leftSubstring(CExample ex,
                                    boolean useHead)
Determines whether one mention's text begins with the text of the other mention. Uses case insensitive comparison.

Parameters:
ex - The example whose mentions' text will be compared.
useHead - if true, compare head text, otherwise compare extent text.

rightSubstring

public static boolean rightSubstring(CExample ex,
                                     boolean useHead)
Determines whether one mention's text ends with the text of the other mention. Uses case insensitive comparison.

Parameters:
ex - The example whose mentions' text will be compared.
useHead - if true, compare head text, otherwise compare extent text.

textMatchSoon

public static boolean textMatchSoon(CExample ex,
                                    boolean useHead)
Determines whether the text (either the heads or the extents) of the mentions match, after dropping stop words.

Parameters:
ex - The example whose mentions will be compared.
useHead - if true, compare head text, otherwise compare extent text.

doLastWordsMatch

public static boolean doLastWordsMatch(CExample ex,
                                       boolean useHead)
Determines whether the last word of one mention matches the last word of another mention. Uses case insensitive comparison.

Parameters:
ex - The example whose mentions' text will be compared.
useHead - if true, compare head text, otherwise compare extent text.

prenominalModifierWordMatchAnotherOrHeadWord

public static boolean prenominalModifierWordMatchAnotherOrHeadWord(CExample ex)
Determines whether any words of one mention preceding its head match any words of the other mention that precede or are contained in the head.

Parameters:
ex - The example whose mentions will be compared.
Returns:
Whether a pre-head modifier matches another pre-head modifier or head word.

prenominalModifierWordMatchAnotherOrHeadWord

public static boolean prenominalModifierWordMatchAnotherOrHeadWord(Mention m1,
                                                                   Mention m2)
Determines whether a noun preceding the head of m1 matches a noun preceding or in the head of m2 No assumptions are made about the textual order of m1 and m2. Uses POS tags to determine whether words are nouns.

Parameters:
m1 - The first mention.
m2 - The second mention.
Returns:
true if any nouns from m1 that precede its head match any nouns from m2 preceding or in its head.

subsequence

public static boolean subsequence(CExample ex,
                                  boolean useHead)
Determines whether the sequence of words in one mention is a subsequence of the words in another mention. Uses case-insensitive comparisons.

Parameters:
ex - The example whose mentions will be compared.
useHead - Whether heads or extents should be compared.
Returns:
Whether one mention's words are a subsequence of another mention's.

subsequence

public static boolean subsequence(java.lang.String big,
                                  java.lang.String small)
Determines whether the sequence of words in big is a subsequence of the words in small. Words are split using whitespace ("\s+") and compared using a case-insensitive comparison.

Parameters:
big - The bigger string.
small - The smaller string.
Returns:
Whether one string's words are a subsequence of another strings's.

getNumDiffCapitalizedWords

public static int getNumDiffCapitalizedWords(CExample ex)
Determines the number of capitalized words occur in exactly one mention. Uses CASE-SENSITIVE comparisons.

Parameters:
ex - The example whose mentions will be compared.
Returns:
The number of capitalized words appearing in exactly one mention.

stringsMatchByWords

public static boolean stringsMatchByWords(java.lang.String s1,
                                          java.lang.String s2,
                                          java.lang.String[] ignoreWords)
Determines whether two strings contain the same sequence of words, after dropping ignoreWords. Note: Words are split by single whitespace characters (\s) rather than by one or more whitespace characters (\s+) for backwards compatibility.

Parameters:
s1 - The first string.
s2 - The second string.
ignoreWords - The words to be ignored. The words should be supplied in the lowercase.

getEdit

public static double getEdit(CExample ex,
                             boolean useHead)
Gets the character-based edit distance, normalized by the length of the longer string.

Parameters:
ex - The example whose mentions should be compared.
useHead - Whether the heads or extents should be compared.

calcLevenshteinEditDist

public static int calcLevenshteinEditDist(java.lang.String a,
                                          java.lang.String b)
Calculates the (unnormalized) Levenshtein edit distance for a pair of strings. Applying algorithm given in Wikipedia article on 3/24/2008.

Parameters:
a - One string.
b - Another string.
Returns:
The edit distance, as an integer.