Category >> OpenNLP  

OpenNLP Tutorial

The Apache OpenNLP library is a machine learning based toolkit for processing of natural language text. It includes a sentence detector, a tokenizer, a name finder, a parts-of-speech (POS) tagger, a chunker, and a parser. It has very good APIs that can be easily integrated with a Java program. However, the documentation contains unupdated information.

In this tutorial, I will show you how to use Apache OpenNLP through a set of simple examples.

0. Download Jar Files and Set Up Environment

Before starting the examples, you need to download the jar files required and add to your project build path. The jar files required are loaced at “apache-opennlp-1.5.3-bin.zip” which can be download here.

Accessed on March 2014, the download page looks like the following:

opennlp-download-jar

Unzip the .zip file and copy the 4 jar files in the “lib” directory to your project. In addition, you will need to download some model files later based on what you want to do (shown in examples below), which can be downloaded here.

1. Sentence Detector

Sentence detector is for detecting sentence boundaries. Given the following paragraph:

Hi. How are you? This is Mike.

sentence detector returns an array of strings. In this case, the array has two elements as below.

Hi. How are you? 
This is Mike.

Example Code:

public static void SentenceDetect() throws InvalidFormatException,
		IOException {
	String paragraph = "Hi. How are you? This is Mike.";
 
	// always start with a model, a model is learned from training data
	InputStream is = new FileInputStream("en-sent.bin");
	SentenceModel model = new SentenceModel(is);
	SentenceDetectorME sdetector = new SentenceDetectorME(model);
 
	String sentences[] = sdetector.sentDetect(paragraph);
 
	System.out.println(sentences[0]);
	System.out.println(sentences[1]);
	is.close();
}

2. Tokenizer

Tokens are usually words which are separated by space, but there are exceptions. For example, “isn’t” gets split into “is” and “n’t, since it is a a brief format of “is not”. Our sentence is separated into the following tokens:

Hi
.
How
are
you
?
This
is
Mike
.

Example Code:

public static void Tokenize() throws InvalidFormatException, IOException {
	InputStream is = new FileInputStream("en-token.bin");
 
	TokenizerModel model = new TokenizerModel(is);
 
	Tokenizer tokenizer = new TokenizerME(model);
 
	String tokens[] = tokenizer.tokenize("Hi. How are you? This is Mike.");
 
	for (String a : tokens)
		System.out.println(a);
 
	is.close();
}

3. Name Finder

By its name, name finder just finds names in the context. Check out the following example to see what name finder can do. It accepts an array of strings, and find the names inside.

Example Code:

public static void findName() throws IOException {
	InputStream is = new FileInputStream("en-ner-person.bin");
 
	TokenNameFinderModel model = new TokenNameFinderModel(is);
	is.close();
 
	NameFinderME nameFinder = new NameFinderME(model);
 
	String []sentence = new String[]{
		    "Mike",
		    "Smith",
		    "is",
		    "a",
		    "good",
		    "person"
		    };
 
		Span nameSpans[] = nameFinder.find(sentence);
 
		for(Span s: nameSpans)
			System.out.println(s.toString());			
}

4. POS Tagger

Hi._NNP How_WRB are_VBP you?_JJ This_DT is_VBZ Mike._NNP

Example Code:

public static void POSTag() throws IOException {
	POSModel model = new POSModelLoader()	
		.load(new File("en-pos-maxent.bin"));
	PerformanceMonitor perfMon = new PerformanceMonitor(System.err, "sent");
	POSTaggerME tagger = new POSTaggerME(model);
 
	String input = "Hi. How are you? This is Mike.";
	ObjectStream<String> lineStream = new PlainTextByLineStream(
			new StringReader(input));
 
	perfMon.start();
	String line;
	while ((line = lineStream.read()) != null) {
 
		String whitespaceTokenizerLine[] = WhitespaceTokenizer.INSTANCE
				.tokenize(line);
		String[] tags = tagger.tag(whitespaceTokenizerLine);
 
		POSSample sample = new POSSample(whitespaceTokenizerLine, tags);
		System.out.println(sample.toString());
 
		perfMon.incrementCounter();
	}
	perfMon.stopAndPrintFinalResult();
}

5. Chunker

Chunker may not be a concern for some users, but it is worth to mention it here. What chunker does is to partition a sentence to a set of chunks by using the tokens generated by tokenizer.
I don’t have a good example to show why chunker is very useful, but you can replace the sentence with your own sentences in the following example code, and try to generate something you like.

Example Code:

public static void chunk() throws IOException {
	POSModel model = new POSModelLoader()
			.load(new File("en-pos-maxent.bin"));
	PerformanceMonitor perfMon = new PerformanceMonitor(System.err, "sent");
	POSTaggerME tagger = new POSTaggerME(model);
 
	String input = "Hi. How are you? This is Mike.";
	ObjectStream<String> lineStream = new PlainTextByLineStream(
			new StringReader(input));
 
	perfMon.start();
	String line;
	String whitespaceTokenizerLine[] = null;
 
	String[] tags = null;
	while ((line = lineStream.read()) != null) {
		whitespaceTokenizerLine = WhitespaceTokenizer.INSTANCE
				.tokenize(line);
		tags = tagger.tag(whitespaceTokenizerLine);
 
		POSSample sample = new POSSample(whitespaceTokenizerLine, tags);
		System.out.println(sample.toString());
			perfMon.incrementCounter();
	}
	perfMon.stopAndPrintFinalResult();
 
	// chunker
	InputStream is = new FileInputStream("en-chunker.bin");
	ChunkerModel cModel = new ChunkerModel(is);
 
	ChunkerME chunkerME = new ChunkerME(cModel);
	String result[] = chunkerME.chunk(whitespaceTokenizerLine, tags);
 
	for (String s : result)
		System.out.println(s);
 
	Span[] span = chunkerME.chunkAsSpans(whitespaceTokenizerLine, tags);
	for (Span s : span)
		System.out.println(s.toString());
}

6. Parser

Given this sentence: “Programcreek is a very huge and useful website.”, parser can return the following:

Example Code:

public static void Parse() throws InvalidFormatException, IOException {
	// http://sourceforge.net/apps/mediawiki/opennlp/index.php?title=Parser#Training_Tool
	InputStream is = new FileInputStream("en-parser-chunking.bin");
 
	ParserModel model = new ParserModel(is);
 
	Parser parser = ParserFactory.create(model);
 
	String sentence = "Programcreek is a very huge and useful website.";
	Parse topParses[] = ParserTool.parseLine(sentence, parser, 1);
 
	for (Parse p : topParses)
		p.show();
 
	is.close();
 
	/*
	 * (TOP (S (NP (NN Programcreek) ) (VP (VBZ is) (NP (DT a) (ADJP (RB
	 * very) (JJ huge) (CC and) (JJ useful) ) ) ) (. website.) ) )
	 */
}

References:
Even though the documentation is not good, some part is still useful from OpenNLP official site.
wiki: http://sourceforge.net/apps/mediawiki/opennlp/index.php?title=Main_Page
Java Doc: http://opennlp.apache.org/documentation/1.5.2-incubating/apidocs/opennlp-tools/index.html

Category >> OpenNLP  
  • bezzu

    I am using en-pos-maxent.bin file for POSTagging. But it is giving Invalid format exception. When I googled i found that I have to remove tags.tag dict from the bin file. How to remove that? Please help me

  • Sujata Mehta

    Hi, we are trying to design a grammar checker using machine learning for which we need the POS tagger and parser functions of OpenNLP. But it keeps giving the exception : Usage: POSDictionaryWriter [-encoding encoding] dictionary tag_files . We cant figure out what the problem is !! Please help us .

  • ryanlr

    Yes, OpenNLP can run on windows.

  • dhanashree

    hi..what are the basic requirements for opennlp?can it run on windows7 32bit or not?

  • Jayant

    I have started learning Open NLP and curious to know that once we have POS Tagger , can we get its corresponding English sentence back ?

    For example:

    Input string: Hi._NNP How_WRB are_VBP you?_JJ This_DT is_VBZMike._NNP

    OutPut string: Hi. How are you? This is Mike.

    Any help or pointers is highly appreciated.

  • yudhir

    Anybody have any files relating to conll2003 NER (Named entity Relation) .

  • pavel

    Would be great to see how to parse with the latest OpenNLP version. Seems like the ParserTool has gone missing.

  • Jerome Chung

    it’s a great tutorial,thanks.

  • Gemmaicle

    Hi! how do I include the models in netbeans. The classes are not recognize like SentenceModel and SentenceDetectorME. tnx!

  • mahi

    Hi, from the parser output i want to remove stop words because i want to get only meaningful words, is there any way to do this any aip?

  • xera

    For No 6. Parser, how do I write the output (p.show()) to the textfile?

  • ryanlr

    This is a complicated problem if OpenNLP does not provide API to do that. This link may be helpful: http://opennlp.apache.org/documentation/1.5.3/manual/opennlp.html#tools.sentdetect.training

  • Mohamed

    How to create our models or to add ather models like Sport, art …

  • ryanlr

    I don’t think OpenNLP can do that. But when you parse the sentence, you get a tree. Then you may form the clause you want by defining your own rules.

  • roostae

    I want to extract subordinate clause,main clause,relative clause,restrictive relative clause,non-restrictive relative clause from sentences but I don’t know how doing this work. for example:

    “I first saw her in Paris, where I lived in the early nineties.”
    [main clause][relative clause]

    “She held out the hand that was hurt.”
    [main clause][restrictive relative clause]

    please help me to do this work?

  • ryanlr

    Models are learned from training data, and then used to process new data.

  • vivek john

    what are the model files used for?

  • Paolo

    hi…thank you for this tutorial.
    do you know witch input type format are supported by openNLP? for example… txt, pdf, doc, xml etc…

    thank you
    ragards from Italy

  • Ron

    Yes, findName() should print range.

  • ASHISH

    findName() prints the following output , is it correct?….

    [0..2)

  • Alessandra

    I would like to provide (train) a POS tagger model for italian language. I have some questions:
    - may I use a token_tag pair list in place of a tagged sentence list? Something like
    casa_NOUN
    e_CON (that is Conjunction)

    - Do I need to provide a tag dictionary? Is there a default tag dictionary?
    thanks

  • Adam

    Thanks so much for posting this! I really appreciate it. In the first example, with sentence boundary detection, why is “Hi. How are you?” shown as 1 sentence? Is this a bug in the program?

    Thanks

  • developerSh

    The very simple and effective example code to start the work on opennlp 10x a lot

  • http://corvettebrasil.com Carlos

    Hi there,

    Is it possible to post here the source code or email it to me? Thanks!

  • Giri

    simple and good ..

  • Admin

    thanks a lot.

  • Joern

    Official documentation for 1.5.2 is located here:
    http://opennlp.apache.org/documentation/1.5.2-incubating/manual/opennlp.html

    The documentation over at SourceForge is outdated.