edu.illinois.cs.cogcomp.lbj.coref.features
Class Gazetteers

java.lang.Object
  extended by edu.illinois.cs.cogcomp.lbj.coref.features.Gazetteers

public class Gazetteers
extends java.lang.Object

A collection of gazetteers. Each gazetteer is a set of items. Gazetteers whose names end in CS are case-sensitive; the others contain only lowercase items. Any gazetteer may contain ambiguous items, which might appear in multiple gazetteers. For example, "Israel" is a male first name and a country name. All gazetteers will be loaded and kept in memory when any is requested.

Author:
Eric Bengtson

Field Summary
protected static java.util.Set<java.lang.String> cities
           
protected static java.util.Set<java.lang.String> citiesCS
           
protected static java.util.Set<java.lang.String> commonWords
           
protected static java.util.Set<java.lang.String> commonWords5
           
protected static java.util.Set<java.lang.String> corporations
           
protected static java.util.Set<java.lang.String> corporationsCS
           
protected static java.util.Set<java.lang.String> countries
           
protected static java.util.Set<java.lang.String> countriesCS
           
protected static java.util.Set<java.lang.String> countriesDemAdj
           
protected static java.util.Set<java.lang.String> countriesDemAdjCS
           
protected static java.util.Set<java.lang.String> femaleFirstNames
           
protected static java.util.Set<java.lang.String> femaleFirstNamesCS
           
protected static boolean gazetteersInitialized
           
protected static java.util.Set<java.lang.String> honors
           
protected static java.util.Set<java.lang.String> inflectedWords
           
protected static java.util.Set<java.lang.String> lastNames
           
protected static java.util.Set<java.lang.String> lastNamesCS
           
protected static java.util.Set<java.lang.String> lowercaseWords
           
protected static java.util.Set<java.lang.String> maleFirstNames
           
protected static java.util.Set<java.lang.String> maleFirstNamesCS
           
protected static java.util.Set<java.lang.String> orgClosings
           
protected static java.util.Set<java.lang.String> pluralNouns
           
protected static java.util.Set<java.lang.String> polParties
           
protected static java.util.Set<java.lang.String> polPartiesCS
           
protected static java.util.Set<java.lang.String> prepositions
           
protected static java.util.Set<java.lang.String> pronouns
           
protected static java.util.Set<java.lang.String> sayWords
           
protected static java.util.Set<java.lang.String> singularNouns
           
protected static java.util.Set<java.lang.String> sportTeams
           
protected static java.util.Set<java.lang.String> sportTeamsCS
           
protected static java.util.Set<java.lang.String> states
           
protected static java.util.Set<java.lang.String> statesCS
           
protected static java.util.Set<java.lang.String> stopWords
           
protected static java.util.Set<java.lang.String> universities
           
protected static java.util.Set<java.lang.String> universitiesCS
           
 
Constructor Summary
protected Gazetteers()
          Should not need to construct this static feature collection.
 
Method Summary
static java.util.Set<java.lang.String> getCities()
          Gets the cities gazetteer.
static java.util.Set<java.lang.String> getCitiesCS()
          Gets the case-sensitive cities gazetteer.
static java.util.Set<java.lang.String> getCommonWords()
          Gets the common words gazetteer.
static java.util.Set<java.lang.String> getCommonWords5()
          Gets the common words appearing more than five times gazetteer.
static java.util.Set<java.lang.String> getCorporations()
          Gets the corporations gazetteer.
static java.util.Set<java.lang.String> getCorporationsCS()
          Gets the corporations gazetteer.
static java.util.Set<java.lang.String> getCountries()
          Gets the countries gazetteer.
static java.util.Set<java.lang.String> getCountriesCS()
          Gets the case-sensitive countries gazetteer.
static java.util.Set<java.lang.String> getCountriesDemAdj()
          Gets the countries, country adjectives, and country people names gazetteer.
static java.util.Set<java.lang.String> getCountriesDemAdjCS()
          Gets the countries, country adjectives, and country people names gazetteer.
static java.util.Set<java.lang.String> getFemaleFirstNames()
          Gets the female first names gazetteer.
static java.util.Set<java.lang.String> getFemaleFirstNamesCS()
          Gets the case-sensitive male first names gazetteer.
static java.util.Set<java.lang.String> getHonors()
          Gets the honorary titles gazetteer.
static java.util.Set<java.lang.String> getInflectedWords()
          Gets the inflected words gazetteer.
static java.util.Set<java.lang.String> getLastNames()
          Gets the last names gazetteer.
static java.util.Set<java.lang.String> getLastNamesCS()
          Gets the case-sensitive last names gazetteer.
static java.util.Set<java.lang.String> getLowercaseWords()
          Gets the lowercase words gazetteer.
static java.util.Set<java.lang.String> getMaleFirstNames()
          Gets the male first names gazetteer.
static java.util.Set<java.lang.String> getMaleFirstNamesCS()
          Gets the case-sensitive male first names gazetteer.
static java.util.Set<java.lang.String> getOrgClosings()
          Gets the organization identifier suffixes gazetteer.
static java.util.Set<java.lang.String> getPluralNouns()
          Gets the plural nouns gazetteer.
static java.util.Set<java.lang.String> getPolParties()
          Gets the political parties gazetteer.
static java.util.Set<java.lang.String> getPrepositions()
          Gets the prepositions gazetteer.
static java.util.Set<java.lang.String> getPronouns()
          Gets the pronouns gazetteer.
static java.util.Set<java.lang.String> getSayWords()
          Gets the say words gazetteer.
static java.util.Set<java.lang.String> getSingularNouns()
          Gets the singular nouns gazetteer.
static java.util.Set<java.lang.String> getSportTeams()
          Gets the sports teams gazetteer.
static java.util.Set<java.lang.String> getSportTeamsCS()
          Gets the sports teams gazetteer.
static java.util.Set<java.lang.String> getStates()
          Gets the US states gazetteer.
static java.util.Set<java.lang.String> getStatesCS()
          Gets the case-sensitive US states gazetteer.
static java.util.Set<java.lang.String> getStopWords()
          Gets the stop words gazetteer.
static java.util.Set<java.lang.String> getUniversities()
          Gets the universities gazetteer.
static java.util.Set<java.lang.String> getUniversitiesCS()
          Gets the universities gazetteer.
private static void initGazetteers()
          Loads the gazetteers from files in the gazetteers directory located in a directory on the classpath.
protected static java.util.Set<java.lang.String> loadLinesAsSet(java.lang.String filename, boolean lower)
           
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

gazetteersInitialized

protected static boolean gazetteersInitialized

honors

protected static java.util.Set<java.lang.String> honors

maleFirstNames

protected static java.util.Set<java.lang.String> maleFirstNames

femaleFirstNames

protected static java.util.Set<java.lang.String> femaleFirstNames

lastNames

protected static java.util.Set<java.lang.String> lastNames

orgClosings

protected static java.util.Set<java.lang.String> orgClosings

countriesDemAdj

protected static java.util.Set<java.lang.String> countriesDemAdj

countries

protected static java.util.Set<java.lang.String> countries

cities

protected static java.util.Set<java.lang.String> cities

states

protected static java.util.Set<java.lang.String> states

polParties

protected static java.util.Set<java.lang.String> polParties

corporations

protected static java.util.Set<java.lang.String> corporations

sportTeams

protected static java.util.Set<java.lang.String> sportTeams

universities

protected static java.util.Set<java.lang.String> universities

inflectedWords

protected static java.util.Set<java.lang.String> inflectedWords

lowercaseWords

protected static java.util.Set<java.lang.String> lowercaseWords

singularNouns

protected static java.util.Set<java.lang.String> singularNouns

pluralNouns

protected static java.util.Set<java.lang.String> pluralNouns

sayWords

protected static java.util.Set<java.lang.String> sayWords

pronouns

protected static java.util.Set<java.lang.String> pronouns

prepositions

protected static java.util.Set<java.lang.String> prepositions

stopWords

protected static java.util.Set<java.lang.String> stopWords

commonWords

protected static java.util.Set<java.lang.String> commonWords

commonWords5

protected static java.util.Set<java.lang.String> commonWords5

maleFirstNamesCS

protected static java.util.Set<java.lang.String> maleFirstNamesCS

femaleFirstNamesCS

protected static java.util.Set<java.lang.String> femaleFirstNamesCS

lastNamesCS

protected static java.util.Set<java.lang.String> lastNamesCS

countriesDemAdjCS

protected static java.util.Set<java.lang.String> countriesDemAdjCS

countriesCS

protected static java.util.Set<java.lang.String> countriesCS

citiesCS

protected static java.util.Set<java.lang.String> citiesCS

statesCS

protected static java.util.Set<java.lang.String> statesCS

polPartiesCS

protected static java.util.Set<java.lang.String> polPartiesCS

corporationsCS

protected static java.util.Set<java.lang.String> corporationsCS

sportTeamsCS

protected static java.util.Set<java.lang.String> sportTeamsCS

universitiesCS

protected static java.util.Set<java.lang.String> universitiesCS
Constructor Detail

Gazetteers

protected Gazetteers()
Should not need to construct this static feature collection.

Method Detail

getMaleFirstNames

public static java.util.Set<java.lang.String> getMaleFirstNames()
Gets the male first names gazetteer. The gazetteer is a set of known male first names, in lowercase. The list may include ambiguous items. If the gazetteers have not been loaded, they will be loaded first.

Returns:
A case-insensitive set of male first names.

getMaleFirstNamesCS

public static java.util.Set<java.lang.String> getMaleFirstNamesCS()
Gets the case-sensitive male first names gazetteer. The gazetteer is a set of known male first names, case preserved. The list may include ambiguous items. If the gazetteers have not been loaded, they will be loaded first.

Returns:
A case-sensitive set of male first names.

getFemaleFirstNames

public static java.util.Set<java.lang.String> getFemaleFirstNames()
Gets the female first names gazetteer. The gazetteer is a set of known female first names, in lowercase. The list may include ambiguous items. If the gazetteers have not been loaded, they will be loaded first.

Returns:
A case-insensitive set of female first names.

getFemaleFirstNamesCS

public static java.util.Set<java.lang.String> getFemaleFirstNamesCS()
Gets the case-sensitive male first names gazetteer. The gazetteer is a set of known male first names, case preserved. The list may include ambiguous items. If the gazetteers have not been loaded, they will be loaded first.

Returns:
A case-sensitive set of female first names.

getLastNames

public static java.util.Set<java.lang.String> getLastNames()
Gets the last names gazetteer. The gazetteer is a set of known last names, in lowercase. The list may include ambiguous items. If the gazetteers have not been loaded, they will be loaded first.

Returns:
A case-insensitive set of last names.

getLastNamesCS

public static java.util.Set<java.lang.String> getLastNamesCS()
Gets the case-sensitive last names gazetteer. The gazetteer is a set of known last names, case preserved. The list may include ambiguous items. If the gazetteers have not been loaded, they will be loaded first.

Returns:
A case-sensitive set of last names.

getHonors

public static java.util.Set<java.lang.String> getHonors()
Gets the honorary titles gazetteer. The gazetteer is a set of honorary titles such as "mr" and "mrs", in lowercase. The list may include ambiguous items. If the gazetteers have not been loaded, they will be loaded first.

Returns:
A case-insensitive set of honorary titles like "mr" and "mrs".

getCities

public static java.util.Set<java.lang.String> getCities()
Gets the cities gazetteer. The gazetteer is a set of known cities, in lowercase. The list may include ambiguous items. If the gazetteers have not been loaded, they will be loaded first.

Returns:
A case-insensitive set of city names.

getCitiesCS

public static java.util.Set<java.lang.String> getCitiesCS()
Gets the case-sensitive cities gazetteer. The gazetteer is a set of known cities, case preserved. The list may include ambiguous items. If the gazetteers have not been loaded, they will be loaded first.

Returns:
A case-sensitive set of city names.

getStates

public static java.util.Set<java.lang.String> getStates()
Gets the US states gazetteer. The gazetteer is the set of the states in the United States, in lowercase. The list may include ambiguous items. If the gazetteers have not been loaded, they will be loaded first.

Returns:
A case-insensitive set of the states in the United States.

getStatesCS

public static java.util.Set<java.lang.String> getStatesCS()
Gets the case-sensitive US states gazetteer. The gazetteer is the set of the states in the United States, case preserved. The list may include ambiguous items. If the gazetteers have not been loaded, they will be loaded first.

Returns:
A case-sensitive set of the states in the United States.

getCountries

public static java.util.Set<java.lang.String> getCountries()
Gets the countries gazetteer. The gazetteer is the set of countries, in lowercase. The list may include ambiguous items. If the gazetteers have not been loaded, they will be loaded first.

Returns:
A case-insensitive set of countries.

getCountriesCS

public static java.util.Set<java.lang.String> getCountriesCS()
Gets the case-sensitive countries gazetteer. The gazetteer is the set of countries, case preserved. The list may include ambiguous items. If the gazetteers have not been loaded, they will be loaded first.

Returns:
A case-sensitive set of countries.

getCountriesDemAdj

public static java.util.Set<java.lang.String> getCountriesDemAdj()
Gets the countries, country adjectives, and country people names gazetteer. The gazetteer is the set of all countries, country adjectives, and country people names, in lowercase. A country adjective is the adjective form of a country; for example "american". A country people name is the term used to describe the residents of a country; for example "americans". The list may include ambiguous items. If the gazetteers have not been loaded, they will be loaded first.

Returns:
A case-insensitive set of countries, country adjectives, and country people names.

getCountriesDemAdjCS

public static java.util.Set<java.lang.String> getCountriesDemAdjCS()
Gets the countries, country adjectives, and country people names gazetteer. The gazetteer is the set of all countries, country adjectives, and people groups, case preserved. A country adjective is the adjective form of a country; for example "American". A country people name is the term used to describe the residents of a country; for example "Americans". The list may include ambiguous items. If the gazetteers have not been loaded, they will be loaded first.

Returns:
A case-sensitive set of countries, country adjectives, and country people names.

getPolParties

public static java.util.Set<java.lang.String> getPolParties()
Gets the political parties gazetteer. The gazetteer is a set of known political parties, in lowercase. The list may include ambiguous items. If the gazetteers have not been loaded, they will be loaded first.

Returns:
A case-insensitive set of political parties.

getCorporations

public static java.util.Set<java.lang.String> getCorporations()
Gets the corporations gazetteer. The gazetteer is a set of known corporations, in lowercase. The list may include ambiguous items. If the gazetteers have not been loaded, they will be loaded first.

Returns:
A case-insensitive set of corporation names.

getCorporationsCS

public static java.util.Set<java.lang.String> getCorporationsCS()
Gets the corporations gazetteer. The gazetteer is a set of known corporations, case preserved. The list may include ambiguous items. If the gazetteers have not been loaded, they will be loaded first.

Returns:
A case-sensitive set of corporation names.

getOrgClosings

public static java.util.Set<java.lang.String> getOrgClosings()
Gets the organization identifier suffixes gazetteer. The gazetteer is a set of organization identifier suffixes, such as "inc", "llc", and "org", in lowercase. The list may include ambiguous items. If the gazetteers have not been loaded, they will be loaded first.

Returns:
A case-insensitive set of organization identifier suffixes, such as "inc", "llc", and "org".

getSportTeams

public static java.util.Set<java.lang.String> getSportTeams()
Gets the sports teams gazetteer. The gazetteer is a set of known sports teams, in lowercase. The list may include ambiguous items. If the gazetteers have not been loaded, they will be loaded first.

Returns:
A case-insensitive set of sports team names.

getSportTeamsCS

public static java.util.Set<java.lang.String> getSportTeamsCS()
Gets the sports teams gazetteer. The gazetteer is a set of known sports teams, case preserved. The list may include ambiguous items. If the gazetteers have not been loaded, they will be loaded first.

Returns:
A case-sensitive set of sports team names.

getUniversities

public static java.util.Set<java.lang.String> getUniversities()
Gets the universities gazetteer. The gazetteer is a set of known universities, in lowercase. The list may include ambiguous items. If the gazetteers have not been loaded, they will be loaded first.

Returns:
A case-insensitive set of university names.

getUniversitiesCS

public static java.util.Set<java.lang.String> getUniversitiesCS()
Gets the universities gazetteer. The gazetteer is a set of known universities, case preserved. The list may include ambiguous items. If the gazetteers have not been loaded, they will be loaded first.

Returns:
A case-sensitive set of university names.

getStopWords

public static java.util.Set<java.lang.String> getStopWords()
Gets the stop words gazetteer. The gazetteer is a set of stop words such as "and" and "of", in lowercase. The list may include ambiguous items. If the gazetteers have not been loaded, they will be loaded first.

Returns:
A case-insensitive set of stop words such as "and" and "of".

getPrepositions

public static java.util.Set<java.lang.String> getPrepositions()
Gets the prepositions gazetteer. The gazetteer is a set of prepositions, in lowercase. The list may include ambiguous items. If the gazetteers have not been loaded, they will be loaded first.

Returns:
A case-insensitive set of prepositions.

getPronouns

public static java.util.Set<java.lang.String> getPronouns()
Gets the pronouns gazetteer. The gazetteer is a set of pronouns, in lowercase. The list may include ambiguous items. If the gazetteers have not been loaded, they will be loaded first.

Returns:
A case-insensitive set of pronouns.

getSingularNouns

public static java.util.Set<java.lang.String> getSingularNouns()
Gets the singular nouns gazetteer. The gazetteer is a set of singular nouns, in lowercase. The list may include ambiguous items. If the gazetteers have not been loaded, they will be loaded first.

Returns:
A case-insensitive set of singular nouns.

getPluralNouns

public static java.util.Set<java.lang.String> getPluralNouns()
Gets the plural nouns gazetteer. The gazetteer is a set of plural nouns, in lowercase. The list may include ambiguous items. If the gazetteers have not been loaded, they will be loaded first.

Returns:
A case-insensitive set of plural nouns.

getSayWords

public static java.util.Set<java.lang.String> getSayWords()
Gets the say words gazetteer. The gazetteer is a set of words synonymous with "say", in lowercase. The list may include ambiguous items. If the gazetteers have not been loaded, they will be loaded first.

Returns:
A case-insensitive set of words synonymous with "say".

getLowercaseWords

public static java.util.Set<java.lang.String> getLowercaseWords()
Gets the lowercase words gazetteer. The gazetteer is a set of words that begin with a lowercase letter in a dictionary (probably indicating that they are not proper names), in lowercase. The list may include ambiguous items. If the gazetteers have not been loaded, they will be loaded first.

Returns:
A case-insensitive set of words that begin with a lowercase letter in a dictionary (probably indicating that they are not proper names).

getInflectedWords

public static java.util.Set<java.lang.String> getInflectedWords()
Gets the inflected words gazetteer. The gazetteer is a set of all words in a dictionary, including inflected forms (past forms, plural forms, etc.), in lowercase. The list may include ambiguous items. If the gazetteers have not been loaded, they will be loaded first.

Returns:
A case-insensitive set of all words in a dictionary, including inflected forms (past forms, plural forms, etc.).

getCommonWords5

public static java.util.Set<java.lang.String> getCommonWords5()
Gets the common words appearing more than five times gazetteer. The gazetteer is a set of words appearing more than five times in the ACE 2004 Corpus. May include words from the test set. The list may include ambiguous items. If the gazetteers have not been loaded, they will be loaded first.

Returns:
A case-insensitive set of words appearing more than five times in the ACE 2004 Corpus. May include words from the test set.

getCommonWords

public static java.util.Set<java.lang.String> getCommonWords()
Gets the common words gazetteer. The gazetteer is a set of words appearing frequently in the ACE 2004 Corpus. May include words from the test set. The list may include ambiguous items. If the gazetteers have not been loaded, they will be loaded first.

Returns:
A case-insensitive set of words appearing frequently in the ACE 2004 Corpus. May include words from the test set.

initGazetteers

private static void initGazetteers()
Loads the gazetteers from files in the gazetteers directory located in a directory on the classpath.


loadLinesAsSet

protected static java.util.Set<java.lang.String> loadLinesAsSet(java.lang.String filename,
                                                                boolean lower)