Data for Entity and Relation Recognition Experiments
Each file has several blocks, where each block denotes the information of the entities and
relations in a sentence.
The format of each block is:
- a senence in table format
- empty line
- relation descriptors (may be empty or more than one line)
- empty line
In the table of a sentence, each row represents an element (a single word or consecutive words) in the
sentence. Meaningful columns include:
- col-2: Entity class label (B-Unknown, B-Peop, or B-Loc, which means other_ent, person, location)
- col-3: Element order number
- col-5: Part-of-speech tags
- col-6: Words
Other columns can simply be ignored.
A relation descriptor has three fileds.
- 1st field : the element number of the first argument.
- 2nd field : the element number of the second argument.
- 3rd field : the name of the relation (e.g. kill or birthplace).
The following corpora have been used in D. Roth and W. Yih,
"Probabilistic Reasoning for Entity & Relation Recognition" (abstract, PDF)
COLING'02, Aug. 2002
kill.corp, m-kill.corp
(updated by Mark Sammons)
: sentences that have the kill relation.
birthplace.corp,
m-birthplace.corp (updated by Mark Sammons)
: sentences that have the born_in relation.
negative.corp : sentences that have NO relation.
all.corp : all the above three files compacted in one.
The following corpus has been used in D. Roth and W. Yih,
"A Linear Programming Formulation for Global Inference in Natural Language Tasks" (abstract, PDF)
CoNLL'04, May. 2004
conll04.corp : an updated corpus that has more relations, including located in, work for, organization based in, live in, and kill.