CuratorDemo.java

CuratorDemo.java
# You can download CuratorDemo.java. The Curator is a service which provides the ability to retrieve and perform annotations on a text. Visit the status page to see if the Curator is currently running and what annotations are available. You can also query the Curator using the `describeAnnotations()` method. The Curator web demo acts as a reference Curator client. You can use the demo to explore the annotations and experiment with the Curator. The web demo is also useful for testing text that you are having problems with in your own code, verifying that the same problem occurs on the web demo can help us find bugs quicker. Before you start you will need the following jars in your classpath: `libthrift.jar` `curator-interfaces.jar` sl4j api (`slf4j-api-1.5.8.jar`) sl4j framework wrapper (`slf4j-simple-1.5.8.jar`) commons-lang 2.5 (`commons-lang-2.5.jar`) You can find all of these jars by downloading the Curator package, and running `./bootstrap.sh` in the download directory. This will populate the `lib` folder, and the above jars can be found there.	import java.util.List; import java.util.Map; import java.util.ArrayList; import org.apache.thrift.TException; import org.apache.thrift.protocol.TBinaryProtocol; import org.apache.thrift.protocol.TProtocol; import org.apache.thrift.transport.TFramedTransport; import org.apache.thrift.transport.TSocket; import org.apache.thrift.transport.TTransport; import edu.illinois.cs.cogcomp.thrift.curator.Curator; import edu.illinois.cs.cogcomp.thrift.curator.Record; import edu.illinois.cs.cogcomp.thrift.base.ServiceUnavailableException; import edu.illinois.cs.cogcomp.thrift.base.AnnotationFailedException; import edu.illinois.cs.cogcomp.thrift.base.Span; import edu.illinois.cs.cogcomp.thrift.base.Labeling; import edu.illinois.cs.cogcomp.thrift.base.Clustering; import edu.illinois.cs.cogcomp.thrift.base.Forest; import edu.illinois.cs.cogcomp.thrift.base.Tree; import edu.illinois.cs.cogcomp.thrift.base.Node;
# Client Code Walkthrough This example will walk you through creating a client to the Curator and performing various annotations.	public class CuratorDemo { public static void main(String[] args) {
# Let's first define where the Curator is running. (Needless to say, `hostname` and `port` need to be updated to a server and port running Curator).	String hostname = "myhost.cs.uiuc.edu"; int port = 9090;
# This part is boilerplate code for Thrift which is needed to create a client. We will first get a transport and then wrap it in a framed transport because we are using a non-blocking server. Finally define a protocol that uses the transport.	TTransport transport = new TSocket(hostname, port); transport = new TFramedTransport(transport); TProtocol protocol = new TBinaryProtocol(transport);
# Create the client from the protocol. Every Thrift service has an inner Client class.	Curator.Client client = new Curator.Client(protocol);
# Here is the text will will use for this demo. Notice how it is untokenized.	String text = "With less than 11 weeks to go to the final round of climate " + "talks in Copenhagen, the UN chief, Ban Ki-Moon did not bother to hide " + "his frustration in his opening remarks. \"The world's glaciers are " + "now melting faster than human progress to protect them -- or us,\" he " + "said. Others shared his gloom. \"Today we are on a path to failure," + "\" said France's Nicolas Sarkozy.";
# Now we are going to make the call to the Curator. We must first open the transport and then make the call to the Curator.	Record record = null; try { transport.open();
# The Curator has multiple methods (you can view all the methods on the Curator API page). We will focus on the `provide` method which takes three arguments: `view_name` represented as a String. (e.g., `pos`, `ner`, `srl` etc). For a list of the available view names call `describeAnnotations()` or check the status page. `text` (also a String) the text you want to process. This can contain multiple sentences and new lines. `boolean` representing whether the annotation should be reprocessed regardless of the cache.	record = client.provide("ner", text, false);
# Don't forget to close the transport once you have finished using the Curator.	transport.close(); } catch (ServiceUnavailableException e) { if (transport.isOpen()) transport.close(); e.printStackTrace(); } catch (AnnotationFailedException e) { if (transport.isOpen()) transport.close(); e.printStackTrace(); } catch (TException e) { if (transport.isOpen()) transport.close(); e.printStackTrace(); }
# The `record` object now contains the annotations request and cached. Records contain the raw text that was processed, an identifier and multiple maps which hold the annotations. Each annotation is access by a specific key. Labeling annotations are stored in the `labelViews` field, accessed with `record.getLabelViews()` which returns a `Map<String, Labeling>` where the key is the name of the annotation. (e.g., `pos`, `ner`, `quantities`). Clustering annotations are stored in `clusterViews`. `record.getClusterViews()` returns a `Map<String, Clustering>`. Forest annotations are stored in `parseViews`. `record.getParseViews()` returns a `Map<String, Forest>`. The semantics of each data structure depends on the producing annotator. The README files files for each server should contain a description of the semantics.	String rawText = record.getRawText(); String identifier = record.getIdentifier(); Map<String, Labeling> labelViews = record.getLabelViews(); Map<String, Clustering> clusterViews = record.getClusterViews(); Map<String, Forest> parseViews = record.getParseViews();
# We can now iterate over the named entity annotations we requested printing out the detected entities and their label. Spans contain multiple fields but the most important fields are the `start` and `ending` fields which refer to the position the span covers in the record's `rawText` field. See the Span definition for more information. The output of this loop should be: LOC : Copenhagen ORG : UN PER : Ban Ki-Moon LOC : France PER : Nicolas Sarkozy	for (Span span : labelViews.get("ner").getLabels()) { System.out.println(span.getLabel() + " : " + text.substring(span.getStart(), span.getEnding())); }
# Using pretokenized text Often corpora come pre-tokenized. Running a tokenizer over pre-tokenized text can lead to really bad tokenization. The Curator has additional methods that are designed for pretokenized text. These methods are prefixed with `ws` to signify whitespace methods. When a `ws` method is called the Curator will run a whitespace tokenizer over the text to obtain the tokens.
# `wsprovde` works liked `provide` except rather than taking a String representing the text it takes a `List<String>` where each String is a whitespace tokenized sentence. We also concatenate the sentences in `sentences` and assign it to `text`, for convenience in printing later.	List<String> sentences = new ArrayList<String>(); sentences.add("BHP Billiton , based in Australia , is offering $ 130 for each " + "share of Potash , a 16 percent premium to its closing price " + "Monday ."); sentences.add("But Potash 's shares , which trade mainly on the New York Stock " + "Exchange , surged 28 percent to $ 143.17 ."); text = sentences.get(0) + " " + sentences.get(1);
# Now we have the data structure to pass to the Curator we will open a transport and the make the call like before. This time we will request the chunker (shallow parse) annotation. Note that the second argument to `wsprovides` is a List, not a String.	try { transport.open(); record = client.wsprovide("chunk", sentences, false); transport.close(); } catch (ServiceUnavailableException e) { if (transport.isOpen()) transport.close(); e.printStackTrace(); } catch (AnnotationFailedException e) { if (transport.isOpen()) transport.close(); e.printStackTrace(); } catch (TException e) { if (transport.isOpen()) transport.close(); e.printStackTrace(); }
# Note that the record returned has exactly the same structure as before, however the field `whitespaced` will now be `true`. Iterating over the annotations should produce: NP : BHP Billiton VP : based PP : in NP : Australia VP : is offering NP : $ 130 PP : for NP : each share PP : of NP : Potash NP : a 16 percent premium PP : to NP : its closing price NP : Monday NP : Potash NP : 's shares NP : which VP : trade AD :VP mainly PP : on NP : the New York Stock Exchange VP : surged NP : 28 percent PP : to NP : $ 143.17	boolean whitespaced = record.isWhitespaced(); for (Span span : record.getLabelViews().get("chunk").getLabels()) { System.out.println(span.getLabel() + " : " + text.substring(span.getStart(), span.getEnding())); } } }