Data for Entity and Relation Recognition Experiments

Each file has several blocks, where each block denotes the information of the entities and relations in a sentence.

The format of each block is:

a senence in table format
empty line
relation descriptors (may be empty or more than one line)
empty line

In the table of a sentence, each row represents an element (a single word or consecutive words) in the sentence. Meaningful columns include:

col-2: Entity class label (B-Unknown, B-Peop, or B-Loc, which means other_ent, person, location)
col-3: Element order number
col-5: Part-of-speech tags
col-6: Words

Other columns can simply be ignored.

A relation descriptor has three fileds.

1st field : the element number of the first argument.
2nd field : the element number of the second argument.
3rd field : the name of the relation (e.g. kill or birthplace).

The following corpora have been used in D. Roth and W. Yih, "Probabilistic Reasoning for Entity & Relation Recognition" (abstract, PDF) COLING'02, Aug. 2002

kill.corp, m-kill.corp (updated by Mark Sammons) : sentences that have the kill relation.
birthplace.corp, m-birthplace.corp (updated by Mark Sammons) : sentences that have the born_in relation.
negative.corp : sentences that have NO relation.
all.corp : all the above three files compacted in one.

The following corpus has been used in D. Roth and W. Yih, "A Linear Programming Formulation for Global Inference in Natural Language Tasks" (abstract, PDF) CoNLL'04, May. 2004

conll04.corp : an updated corpus that has more relations, including located in, work for, organization based in, live in, and kill.