OpenNLP Tutorial
The Apache OpenNLP library is a machine learning based toolkit for processing of natural language text. It includes a sentence detector, a tokenizer, a name finder, a parts-of-speech (POS) tagger, a chunker, and a parser. It has very good APIs that can be easily integrated with a Java program. However, the documentation contains unupdated information.
In this tutorial, I will show you how to use Apache OpenNLP through a set of simple examples.
0. Download Jar Files and Set Up Environment
Before starting the examples, you need to download the jar files required and add to your project build path. The jar files required are loaced at "apache-opennlp-1.5.3-bin.zip" which can be download here.
Accessed on March 2014, the download page looks like the following:
Unzip the .zip file and copy the 4 jar files in the "lib" directory to your project. In addition, you will need to download some model files later based on what you want to do (shown in examples below), which can be downloaded here.
Sentence detector is for detecting sentence boundaries. Given the following paragraph:
Hi. How are you? This is Mike.
sentence detector returns an array of strings. In this case, the array has two elements as below.
Hi. How are you? This is Mike.
Example Code:
public static void SentenceDetect() throws InvalidFormatException, IOException { String paragraph = "Hi. How are you? This is Mike."; // always start with a model, a model is learned from training data InputStream is = new FileInputStream("en-sent.bin"); SentenceModel model = new SentenceModel(is); SentenceDetectorME sdetector = new SentenceDetectorME(model); String sentences[] = sdetector.sentDetect(paragraph); System.out.println(sentences[0]); System.out.println(sentences[1]); is.close(); } |
Tokens are usually words which are separated by space, but there are exceptions. For example, "isn't" gets split into "is" and "n't, since it is a a brief format of "is not". Our sentence is separated into the following tokens:
Hi . How are you ? This is Mike .
Example Code:
public static void Tokenize() throws InvalidFormatException, IOException { InputStream is = new FileInputStream("en-token.bin"); TokenizerModel model = new TokenizerModel(is); Tokenizer tokenizer = new TokenizerME(model); String tokens[] = tokenizer.tokenize("Hi. How are you? This is Mike."); for (String a : tokens) System.out.println(a); is.close(); } |
By its name, name finder just finds names in the context. Check out the following example to see what name finder can do. It accepts an array of strings, and find the names inside.
Example Code:
public static void findName() throws IOException { InputStream is = new FileInputStream("en-ner-person.bin"); TokenNameFinderModel model = new TokenNameFinderModel(is); is.close(); NameFinderME nameFinder = new NameFinderME(model); String []sentence = new String[]{ "Mike", "Smith", "is", "a", "good", "person" }; Span nameSpans[] = nameFinder.find(sentence); for(Span s: nameSpans) System.out.println(s.toString()); } |
Hi._NNP How_WRB are_VBP you?_JJ This_DT is_VBZ Mike._NNP
Example Code:
public static void POSTag() throws IOException { POSModel model = new POSModelLoader() .load(new File("en-pos-maxent.bin")); PerformanceMonitor perfMon = new PerformanceMonitor(System.err, "sent"); POSTaggerME tagger = new POSTaggerME(model); String input = "Hi. How are you? This is Mike."; ObjectStream<String> lineStream = new PlainTextByLineStream( new StringReader(input)); perfMon.start(); String line; while ((line = lineStream.read()) != null) { String whitespaceTokenizerLine[] = WhitespaceTokenizer.INSTANCE .tokenize(line); String[] tags = tagger.tag(whitespaceTokenizerLine); POSSample sample = new POSSample(whitespaceTokenizerLine, tags); System.out.println(sample.toString()); perfMon.incrementCounter(); } perfMon.stopAndPrintFinalResult(); } |
Chunker may not be a concern for some users, but it is worth to mention it here. What chunker does is to partition a sentence to a set of chunks by using the tokens generated by tokenizer.
I don't have a good example to show why chunker is very useful, but you can replace the sentence with your own sentences in the following example code, and try to generate something you like.
Example Code:
public static void chunk() throws IOException { POSModel model = new POSModelLoader() .load(new File("en-pos-maxent.bin")); PerformanceMonitor perfMon = new PerformanceMonitor(System.err, "sent"); POSTaggerME tagger = new POSTaggerME(model); String input = "Hi. How are you? This is Mike."; ObjectStream<String> lineStream = new PlainTextByLineStream( new StringReader(input)); perfMon.start(); String line; String whitespaceTokenizerLine[] = null; String[] tags = null; while ((line = lineStream.read()) != null) { whitespaceTokenizerLine = WhitespaceTokenizer.INSTANCE .tokenize(line); tags = tagger.tag(whitespaceTokenizerLine); POSSample sample = new POSSample(whitespaceTokenizerLine, tags); System.out.println(sample.toString()); perfMon.incrementCounter(); } perfMon.stopAndPrintFinalResult(); // chunker InputStream is = new FileInputStream("en-chunker.bin"); ChunkerModel cModel = new ChunkerModel(is); ChunkerME chunkerME = new ChunkerME(cModel); String result[] = chunkerME.chunk(whitespaceTokenizerLine, tags); for (String s : result) System.out.println(s); Span[] span = chunkerME.chunkAsSpans(whitespaceTokenizerLine, tags); for (Span s : span) System.out.println(s.toString()); } |
Given this sentence: "Programcreek is a very huge and useful website.", parser can return the following:
Example Code:
public static void Parse() throws InvalidFormatException, IOException { // http://sourceforge.net/apps/mediawiki/opennlp/index.php?title=Parser#Training_Tool InputStream is = new FileInputStream("en-parser-chunking.bin"); ParserModel model = new ParserModel(is); Parser parser = ParserFactory.create(model); String sentence = "Programcreek is a very huge and useful website."; Parse topParses[] = ParserTool.parseLine(sentence, parser, 1); for (Parse p : topParses) p.show(); is.close(); /* * (TOP (S (NP (NN Programcreek) ) (VP (VBZ is) (NP (DT a) (ADJP (RB * very) (JJ huge) (CC and) (JJ useful) ) ) ) (. website.) ) ) */ } |
References:
Even though the documentation is not good, some part is still useful from OpenNLP official site.
wiki: http://sourceforge.net/apps/mediawiki/opennlp/index.php?title=Main_Page
Java Doc: http://opennlp.apache.org/documentation/1.5.2-incubating/apidocs/opennlp-tools/index.html
<pre><code> String foo = "bar"; </code></pre>
-
Vin
-
Himansh
-
Sreehari B S
-
Mohammad Ashraf
-
priya
-
yolo
-
yolo
-
ryanlr
-
juyeon ji
-
Amal Ghrab
-
Eduardo Felipe
-
astha tripathi
-
Youth.é
-
Hitesh Desai
-
sabena
-
Inquisitive
-
rohk
-
Harsha
-
Peter Mason
-
zila
-
Sarosh Madara
-
Muhammad Sarosh Madara
-
Dilan Wijerathne
-
Rajendra Prasad
-
Jay Nanavati
-
Nikhil Brahmbhatt
-
bezzu
-
Sujata Mehta
-
ryanlr
-
dhanashree
-
Jayant
-
yudhir
-
pavel
-
Jerome Chung
-
Gemmaicle
-
mahi
-
xera
-
ryanlr
-
Mohamed
-
ryanlr
-
roostae
-
ryanlr
-
vivek john
-
Paolo
-
ASHISH
-
Alessandra
-
Adam
-
developerSh
-
Carlos
-
Giri
-
Joern