You can download CuratorDemo.java.
The Curator is a service which provides the ability to retrieve and perform
annotations on a text.
Visit the status page to see if the Curator is currently running
and what annotations are available. You can also query the Curator using the
describeAnnotations() method.
The Curator web demo acts as a reference Curator client. You can use
the demo to explore the annotations and experiment with the Curator. The web
demo is also useful for testing text that you are having problems with in
your own code, verifying that the same problem occurs on the web demo can
help us find bugs quicker.
Before you start you will need the following jars in your classpath:
libthrift.jar
curator-interfaces.jar
- sl4j api (
slf4j-api-1.5.8.jar )
- sl4j framework wrapper (
slf4j-simple-1.5.8.jar )
- commons-lang 2.5 (
commons-lang-2.5.jar )
You can find all of these jars by downloading the Curator package, and running ./bootstrap.sh in the download directory. This will populate the lib folder, and the above jars can be found there.
|
import java.util.List;
import java.util.Map;
import java.util.ArrayList;
import org.apache.thrift.TException;
import org.apache.thrift.protocol.TBinaryProtocol;
import org.apache.thrift.protocol.TProtocol;
import org.apache.thrift.transport.TFramedTransport;
import org.apache.thrift.transport.TSocket;
import org.apache.thrift.transport.TTransport;
import edu.illinois.cs.cogcomp.thrift.curator.Curator;
import edu.illinois.cs.cogcomp.thrift.curator.Record;
import edu.illinois.cs.cogcomp.thrift.base.ServiceUnavailableException;
import edu.illinois.cs.cogcomp.thrift.base.AnnotationFailedException;
import edu.illinois.cs.cogcomp.thrift.base.Span;
import edu.illinois.cs.cogcomp.thrift.base.Labeling;
import edu.illinois.cs.cogcomp.thrift.base.Clustering;
import edu.illinois.cs.cogcomp.thrift.base.Forest;
import edu.illinois.cs.cogcomp.thrift.base.Tree;
import edu.illinois.cs.cogcomp.thrift.base.Node;
|
Client Code Walkthrough
This example will walk you through creating a client to the Curator and
performing various annotations.
|
public class CuratorDemo {
public static void main(String[] args) {
|
Let's first define where the Curator is running. (Needless to say, hostname and port need to be updated to a server and port running Curator).
|
String hostname = "myhost.cs.uiuc.edu";
int port = 9090;
|
This part is boilerplate code for Thrift which is needed to
create a client. We will first get a transport and then wrap it in a
framed transport because we are using a non-blocking server. Finally
define a protocol that uses the transport.
|
TTransport transport = new TSocket(hostname, port);
transport = new TFramedTransport(transport);
TProtocol protocol = new TBinaryProtocol(transport);
|
Create the client from the protocol. Every Thrift service has an inner Client class.
|
Curator.Client client = new Curator.Client(protocol);
|
Here is the text will will use for this demo. Notice how it is untokenized.
|
String text = "With less than 11 weeks to go to the final round of climate "
+ "talks in Copenhagen, the UN chief, Ban Ki-Moon did not bother to hide "
+ "his frustration in his opening remarks. \"The world's glaciers are "
+ "now melting faster than human progress to protect them -- or us,\" he "
+ "said. Others shared his gloom. \"Today we are on a path to failure,"
+ "\" said France's Nicolas Sarkozy.";
|
Now we are going to make the call to the Curator. We must first open
the transport and then make the call to the Curator.
|
Record record = null;
try {
transport.open();
|
The Curator has multiple methods (you can view all the methods on
the Curator API page). We will focus on the provide
method which takes three arguments:
view_name represented as a String. (e.g., pos , ner ,
srl etc). For a list of the available view names call
describeAnnotations() or check the status page.
text (also a String) the text you want to process. This can contain
multiple sentences and new lines.
boolean representing whether the annotation should be
reprocessed regardless of the cache.
|
record = client.provide("ner", text, false);
|
Don't forget to close the transport once you have finished using
the Curator.
|
transport.close();
} catch (ServiceUnavailableException e) {
if (transport.isOpen())
transport.close();
e.printStackTrace();
} catch (AnnotationFailedException e) {
if (transport.isOpen())
transport.close();
e.printStackTrace();
} catch (TException e) {
if (transport.isOpen())
transport.close();
e.printStackTrace();
}
|
The record object now contains the annotations request and
cached. Records contain the raw text that was processed, an
identifier and multiple maps which hold the annotations. Each
annotation is access by a specific key.
-
Labeling annotations are stored in the labelViews field,
accessed with record.getLabelViews() which returns a Map<String,
Labeling> where the key is the name of the annotation. (e.g., pos ,
ner , quantities ).
-
Clustering annotations are stored in
clusterViews . record.getClusterViews() returns a Map<String,
Clustering> .
-
Forest annotations are stored in
parseViews . record.getParseViews() returns a Map<String,
Forest> .
The semantics of each data structure depends on the producing
annotator. The README files files for each server should
contain a description of the semantics.
|
String rawText = record.getRawText();
String identifier = record.getIdentifier();
Map<String, Labeling> labelViews = record.getLabelViews();
Map<String, Clustering> clusterViews = record.getClusterViews();
Map<String, Forest> parseViews = record.getParseViews();
|
We can now iterate over the named entity annotations we requested
printing out the detected entities and their label. Spans
contain multiple fields but the most important fields are the start
and ending fields which refer to the position the span covers in
the record's rawText field. See the Span definition for more
information.
The output of this loop should be:
LOC : Copenhagen
ORG : UN
PER : Ban Ki-Moon
LOC : France
PER : Nicolas Sarkozy
|
for (Span span : labelViews.get("ner").getLabels()) {
System.out.println(span.getLabel() + " : "
+ text.substring(span.getStart(), span.getEnding()));
}
|
Using pretokenized text
Often corpora come pre-tokenized. Running a tokenizer over
pre-tokenized text can lead to really bad tokenization. The Curator
has additional methods that are designed for pretokenized text.
These methods are prefixed with ws to signify whitespace
methods. When a ws method is called the Curator will run a
whitespace tokenizer over the text to obtain the tokens.
|
|
wsprovde works liked provide except rather than taking a String
representing the text it takes a List<String> where each String is
a whitespace tokenized sentence. We also concatenate the sentences in sentences and assign it to text , for convenience in printing later.
|
List<String> sentences = new ArrayList<String>();
sentences.add("BHP Billiton , based in Australia , is offering $ 130 for each "
+ "share of Potash , a 16 percent premium to its closing price "
+ "Monday .");
sentences.add("But Potash 's shares , which trade mainly on the New York Stock "
+ "Exchange , surged 28 percent to $ 143.17 .");
text = sentences.get(0) + " " + sentences.get(1);
|
Now we have the data structure to pass to the Curator we will open a
transport and the make the call like before. This time we will
request the chunker (shallow parse) annotation. Note that the second argument to wsprovides is a List, not a String.
|
try {
transport.open();
record = client.wsprovide("chunk", sentences, false);
transport.close();
} catch (ServiceUnavailableException e) {
if (transport.isOpen())
transport.close();
e.printStackTrace();
} catch (AnnotationFailedException e) {
if (transport.isOpen())
transport.close();
e.printStackTrace();
} catch (TException e) {
if (transport.isOpen())
transport.close();
e.printStackTrace();
}
|
Note that the record returned has exactly the same structure as
before, however the field whitespaced will now be true .
Iterating over the annotations should produce:
NP : BHP Billiton
VP : based
PP : in
NP : Australia
VP : is offering
NP : $ 130
PP : for
NP : each share
PP : of
NP : Potash
NP : a 16 percent premium
PP : to
NP : its closing price
NP : Monday
NP : Potash
NP : 's shares
NP : which
VP : trade
AD :VP mainly
PP : on
NP : the New York Stock Exchange
VP : surged
NP : 28 percent
PP : to
NP : $ 143.17
|
boolean whitespaced = record.isWhitespaced();
for (Span span : record.getLabelViews().get("chunk").getLabels()) {
System.out.println(span.getLabel() + " : "
+ text.substring(span.getStart(), span.getEnding()));
}
}
}
|