Software
LBJava Runtime Reference
After using
LBJava to
design a classifier, the next step is to write a pure Java application that
uses it. This reference first introduces the methods every automatically
generated LBJava classifier makes available within pure Java code. These methods
comprise a simple interface for predicting, online learning, and testing with
a classifier. Once comfortable with the basics, we'll move on to more
advanced classifier usage with
BatchTrainer
,
a comprehensive solution for efficient training.
This reference assumes the reader is already familiar with how to write a
classifier in LBJava. For a gentle introduction, read the
tutorial on
machine learning and natural language parsing, and take a look at the
provided example code there. We will use the TextClassifier
developed in that code as a running example in this reference.
Getting Started
We'll start by assuming that all learning takes place when the LBJava compiler runs. It is also possible to learn online, i.e. while the application is running, but we'll defer that discussion for the online learning section. To gain access to the learned classifier within your Java program, simply instantiate an object of the classifier's generated class, which has the same name as the classifier.
TextClassifier classifier = new TextClassifier();
The classifier is now ready to make predictions on example objects.
TextClassifier
was defined to take as input an array of strings
containing the words in a document and to make a discrete prediction as
output. Thus, if we have an array of strings available, we can retrieve
TextClassifier
's prediction like this:
String[] documentInArray = ... String prediction = classifier.discreteValue(documentInArray);
The prediction made by the classifier will be one of the string labels it observed during training. And that's it! The programmer is now free to use the classifier's predictions however s/he chooses.
There's one important technical point to be aware of here. The instance
of class TextClassifier
that we just created above does not
actually contain the classifier that LBJava learned for us. It is merely a
"clone" object that contains internally a reference to the real
classifier. Thus, if our Java application creates instances of this class in
different places and performs any operation that modifies the behavior of the
classifier (like online learning), all instances will appear to be affected by
the changes.
Prediction
We've already seen how to get the prediction from a discrete valued classifier. This technique will work no matter how the classifier was defined; be it hard-coded, learned, or what have you. When the classifier is learned, it can go further than merely providing the prediction value it likes the best. In addition, it can provide a score for every possible prediction it chose amongst, thereby giving an indication of how confident the classifier is in its prediction. The prediction with the highest score is the one selected.
Scores are returned by a classifier in a
ScoreSet
object by calling the score(Object)
method, passing in the same
example object that you would have passed to
discreteValue(Object)
. Once you have a ScoreSet
you
can get the score for any particular prediction value using the
get(String)
method, which returns a double
.
Alternatively, you can retrieve all scores in an array and iterate over them,
like this:
ScoreSet scores = classifier.scores(documentInArray); Score[] scoresArray = scores.toArray(); for (Score score : scoresArray) System.out.println("prediction: " + score.value + ", score: " + score.score);
Finally, LBJava also lets you define real valued classifiers which
return double
s in the Java application. If you have such a
classifier, you can retreive its prediction on an example object by calling
the realValue(Object)
method:
double prediction = realClassifier.realValue(someExampleObject);
Online Learning
A classifier generated by the LBJava compiler can also continue learning from
labeled examples in the Java application. Since TextClassifier
takes an array of String
s as input, we merely have to get our
hands on such an array, stick its label in the first element of the array
(since that's how TextLabel
was defined), and pass it to the
classifier's learn(Object)
method.
LBJava also provides a learning method for passing multiple example objects
to the classifier at once. Its signature is learn(Object[])
.
This way, no matter what type your example objects actually have, they can all
be passed to the classifier simultaneously in an array. In the case of
TextClassifier
, however, this causes a small technical
difficulty. An array of String
s is intended to represent a
single example for our classifier. However, when we try to call
learn(Object)
, the compiler will interpret it as an array of
String
example objects and send it to
learn(Object[])
instead. We must perform a cast at the call site
to ensure the correct behavior:
String[] documentInArray = ... documentInArray[0] = label; classifier.learn((Object) documentInArray);
To be sure, the cast in the last line of code above is only necessary when
the input type of our classifier is an array, as it is for
TextClassifier
. If our classifier were defined to take a single
String
as input (for example), the cast above wouldn't be
necessary.
Now that we know how to get our classifier to learn, let's see how to make
it forget. The contents of a classifier can be completely cleared out by
calling the forget()
method. After this method is called, the
classifier returns to the state it was in before it observed any training
examples.
classifier.forget();
One reason we may wish to forget everything that a classifier has learned
is because we'd like to try new learning algorithm parameters (e.g. learning
rates, thresholds, etc.). All LBJava learning algorithms provide an inner class
named Parameters
that contains default settings for all their
parameters. Simply instantiate such an object, overwrite the parameters that
need to be updated, and call the setParameters(Parameters)
method.
Our TextClassifier
uses the system's default learning
algorithm:
SparseNetworkLearner
.
It's actually a meta-algorithm that uses another learning algorithm
internally. Looking at
SparseNetworkLearner
's Parameters
class,
we see that the choice of internal algorithm is its only parameter. The
system's default choice here is:
SparseWinnow
.
SparseWinnow
has
its own Parameters
class
as well. Note that it has many parameters, most of which are inherited from
parent classes. If we'd like to change TextClassifier
's
parameters, we might choose to do so like this:
SparseNetworkLearner.Parameters snlp = new SparseNetworkLearner.Parameters(); SparseWinnow.Parameters swp = new SparseWinnow.Parameters(); swp.learningRate = 1.4; swp.beta = .7; snlp.baseLTU = new SparseWinnow(swp); classifier.setParameters(snlp);
Testing a Discrete Classifier
When a learned classifier returns discrete values, LBJava provides the handy
TestDiscrete
class for measuring the classifier's prediction performance. This class can
be used either as a stand-alone program or as a library for use inside a Java
application. In either case, we'll need to provide TestDiscrete
with the following three items:
-
The classifier whose performance we're measuring (e.g.
TextClassifier
). -
An oracle classifier that knows the true labels (e.g.
TextLabel
). -
A parser (i.e., any class implementing the
Parser
interface) that returns objects of our classifiers' input type.
If we'd like to use TestDiscrete
on the command line, the
parser must provide a constructor that takes a single String
argument as input. TextClassifier
uses the
WordsInDocumentByDirectory
parser, which meets this requirement, so we can test our classifier on the
command line like this:
$ java LBJ2.classify.TestDiscrete \ edu.illinois.cs.cogcomp.lbj.tutorial.TextClassifier \ edu.illinois.cs.cogcomp.lbj.tutorial.TextLabel \ LBJ2.nlp.WordsInDocumentByDirectory \ "path/to/testing/data"
Alternatively, we can call TestDiscrete
from within our Java
application. This comes in handy if the interface to our parser isn't so
simple, or when we'd like to do further processing with the performance
numbers themselves. The simplest way to do so is to pass as input instances
of our classifiers and our parser, like this:
TextLabel oracle = new TextLabel(); Parser parser = new WordsInDocumentByDirectory("path/to/testing/data"); TestDiscrete tester = TestDiscrete.testDiscrete(classifier, oracle, parser); tester.printPerformance(System.out);
The above Java code does exactly the same thing as the command line above.
We can also exert more fine grained control over the computed statistics.
Starting from a new instance of TestDiscrete
, we can call
reportPrediction(String,String)
every time we acquire both a
prediction value and a label. Then, at any time, we can either call the
printPerformance(PrintStream)
method to produce the standard
output in table form or any of the methods whose names start with
get
to retrieve individual statistics. The example code below
retrieves the overall precision, recall, F1, and accuracy
measures in an array.
TestDiscrete tester = new TestDiscrete(); ... tester.reportPrediction(classifier.discreteValue(documentInArray), oracle.discreteValue(documentInArray)); ... double[] performance = tester.getOverallStats(); System.out.println("Overall Accuracy: " + performance[3]);
Saving Your Work
If we've done any forget()
ing and learn()
ing
within our Java application, we'll probably be interested in saving what we
learned at some point. No problem; simply call the save()
method.
classifier.save();
This operation overwrites the model and lexicon files that
were originally generated by the LBJava compiler. A model file stores the
values of the learned parameters (not to be confused with the manually set
learning algorithm parameters mentioned above). A lexicon file stores
the classifier's feature index, used for quick access to the learnable
parameters when training for multiple rounds. These files are written by the
LBJava compiler and by the save()
method (though only initially; see
below) in the same directory where the TextClassifier.class
file
is written.
We may also wish to train several versions of our classifier; perhaps each
version will use different manually set parameters. But how can we do this if
each instance of our TextClassifier
class is actually a "clone",
merely pointing to the real classifier object? Easy: just use the
TextClassifier
constructor that takes model and lexicon filenames
as input:
TextClassifier c2 = new TextClassifier("myModel.lc", "myLexicon.lex");
This instance of our classifier is not a clone, simply by virtue of our
chosen constructor. It has its own completely independent learnable
parameters. Furthermore, if myModel.lc
and
myLexicon.lex
exist, they will be read from disk into
c2
. If not, then calling this constructor creates them. Either
way, we can now train our classifier however we choose and then simply call
c2.save()
to save everything into the specified files.