OpenNLP Tutorial

The Apache OpenNLP library is a machine learning based toolkit for processing of natural language text. It includes a sentence detector, a tokenizer, a name finder, a parts-of-speech (POS) tagger, a chunker, and a parser. It has very good APIs that can be easily integrated with a Java program. However, the documentation is not very good, and contains some unupdated information. I walked through the commonly used functions and made every component work. The following are code examples and some explanations. As you will see, processing natural language is similar with how compilers deal with programming languages.

Before we start the examples, we need to download the jar files required. The only jar required is called: opennlp-tools-1.5.2-incubating.jar. Here is download address. In addition, we need the model files, which can be downloaded here.

1. Sentence Detector

Sentence detector is for detecting sentence boundaries. Given the following paragraph:

Hi. How are you? This is Mike.

sentence detector returns an array of strings. In this case, the array has two elements as below.

Hi. How are you?
This is Mike.

Example Code:

	public static void SentenceDetect() throws InvalidFormatException,
			IOException {
		String paragraph = "Hi. How are you? This is Mike.";
 
		// always start with a model, a model is learned from training data
		InputStream is = new FileInputStream("en-sent.bin");
		SentenceModel model = new SentenceModel(is);
		SentenceDetectorME sdetector = new SentenceDetectorME(model);
 
		String sentences[] = sdetector.sentDetect(paragraph);
 
		System.out.println(sentences[0]);
		System.out.println(sentences[1]);
		is.close();
	}

2. Tokenizer

Tokens are usually words which are separated by space, but there are exceptions. For example, “isn’t” gets split into “is” and “n’t, since it is a a brief format of “is not”. Our sentence is separated into the following tokens:

Hi
.
How
are
you
?
This
is
Mike
.

Example Code:

	public static void Tokenize() throws InvalidFormatException, IOException {
		InputStream is = new FileInputStream("en-token.bin");
 
		TokenizerModel model = new TokenizerModel(is);
 
		Tokenizer tokenizer = new TokenizerME(model);
 
		String tokens[] = tokenizer.tokenize("Hi. How are you? This is Mike.");
 
		for (String a : tokens)
			System.out.println(a);
 
		is.close();
	}

3. Name Finder

By its name, name finder just finds names in the context. Check out the following example to see what name finder can do. It accepts an array of strings, and find the names inside.

Example Code:

	public static void findName() throws IOException {
		InputStream is = new FileInputStream("en-ner-person.bin");
 
		TokenNameFinderModel model = new TokenNameFinderModel(is);
		is.close();
 
		NameFinderME nameFinder = new NameFinderME(model);
 
		String []sentence = new String[]{
			    "Mike",
			    "Smith",
			    "is",
			    "a",
			    "good",
			    "person"
			    };
 
			Span nameSpans[] = nameFinder.find(sentence);
 
			for(Span s: nameSpans)
				System.out.println(s.toString());
 
	}

4. POS Tagger

Hi._NNP How_WRB are_VBP you?_JJ This_DT is_VBZ Mike._NNP

Example Code:

	public static void POSTag() throws IOException {
		POSModel model = new POSModelLoader()
				.load(new File("en-pos-maxent.bin"));
		PerformanceMonitor perfMon = new PerformanceMonitor(System.err, "sent");
		POSTaggerME tagger = new POSTaggerME(model);
 
		String input = "Hi. How are you? This is Mike.";
		ObjectStream<String> lineStream = new PlainTextByLineStream(
				new StringReader(input));
 
		perfMon.start();
		String line;
		while ((line = lineStream.read()) != null) {
 
			String whitespaceTokenizerLine[] = WhitespaceTokenizer.INSTANCE
					.tokenize(line);
			String[] tags = tagger.tag(whitespaceTokenizerLine);
 
			POSSample sample = new POSSample(whitespaceTokenizerLine, tags);
			System.out.println(sample.toString());
 
			perfMon.incrementCounter();
		}
		perfMon.stopAndPrintFinalResult();
	}

5. Chunker

Chunker may not be a concern for some users, but it is worth to mention it here. What chunker does is to partition a sentence to a set of chunks by using the tokens generated by tokenizer.
I don’t have a good example to show why chunker is very useful, but you can replace the sentence with your own sentences in the following example code, and try to generate something you like.

Example Code:

	public static void chunk() throws IOException {
 
		POSModel model = new POSModelLoader()
				.load(new File("en-pos-maxent.bin"));
		PerformanceMonitor perfMon = new PerformanceMonitor(System.err, "sent");
		POSTaggerME tagger = new POSTaggerME(model);
 
		String input = "Hi. How are you? This is Mike.";
		ObjectStream<String> lineStream = new PlainTextByLineStream(
				new StringReader(input));
 
		perfMon.start();
		String line;
		String whitespaceTokenizerLine[] = null;
 
		String[] tags = null;
		while ((line = lineStream.read()) != null) {
 
			whitespaceTokenizerLine = WhitespaceTokenizer.INSTANCE
					.tokenize(line);
			tags = tagger.tag(whitespaceTokenizerLine);
 
			POSSample sample = new POSSample(whitespaceTokenizerLine, tags);
			System.out.println(sample.toString());
 
			perfMon.incrementCounter();
		}
		perfMon.stopAndPrintFinalResult();
 
		// chunker
		InputStream is = new FileInputStream("en-chunker.bin");
		ChunkerModel cModel = new ChunkerModel(is);
 
		ChunkerME chunkerME = new ChunkerME(cModel);
		String result[] = chunkerME.chunk(whitespaceTokenizerLine, tags);
 
		for (String s : result)
			System.out.println(s);
 
		Span[] span = chunkerME.chunkAsSpans(whitespaceTokenizerLine, tags);
		for (Span s : span)
			System.out.println(s.toString());
 
	}

6. Parser
Given this sentence: “Programcreek is a very huge and useful website.”, parser can return the following:

Example Code:

public static void Parse() throws InvalidFormatException, IOException {
		// http://sourceforge.net/apps/mediawiki/opennlp/index.php?title=Parser#Training_Tool
		InputStream is = new FileInputStream("en-parser-chunking.bin");
 
		ParserModel model = new ParserModel(is);
 
		Parser parser = ParserFactory.create(model);
 
		String sentence = "Programcreek is a very huge and useful website.";
		Parse topParses[] = ParserTool.parseLine(sentence, parser, 1);
 
		for (Parse p : topParses)
			p.show();
 
		is.close();
 
		/*
		 * (TOP (S (NP (NN Programcreek) ) (VP (VBZ is) (NP (DT a) (ADJP (RB
		 * very) (JJ huge) (CC and) (JJ useful) ) ) ) (. website.) ) )
		 */
	}

References:
Even though the documentation is not good, some part is still useful from OpenNLP official site.
wiki: http://sourceforge.net/apps/mediawiki/opennlp/index.php?title=Main_Page
Java Doc: http://opennlp.apache.org/documentation/1.5.2-incubating/apidocs/opennlp-tools/index.html

Related Articles:

  • ryanlr

    This is a complicated problem if OpenNLP does not provide API to do that. This link may be helpful: http://opennlp.apache.org/documentation/1.5.3/manual/opennlp.html#tools.sentdetect.training

  • Mohamed

    How to create our models or to add ather models like Sport, art …

  • ryanlr

    I don’t think OpenNLP can do that. But when you parse the sentence, you get a tree. Then you may form the clause you want by defining your own rules.

  • roostae

    I want to extract subordinate clause,main clause,relative clause,restrictive relative clause,non-restrictive relative clause from sentences but I don’t know how doing this work. for example:

    “I first saw her in Paris, where I lived in the early nineties.”
    [main clause][relative clause]

    “She held out the hand that was hurt.”
    [main clause][restrictive relative clause]

    please help me to do this work?

  • ryanlr

    Models are learned from training data, and then used to process new data.

  • vivek john

    what are the model files used for?

  • Paolo

    hi…thank you for this tutorial.
    do you know witch input type format are supported by openNLP? for example… txt, pdf, doc, xml etc…

    thank you
    ragards from Italy

  • Ron

    Yes, findName() should print range.

  • ASHISH

    findName() prints the following output , is it correct?….

    [0..2)

  • Alessandra

    I would like to provide (train) a POS tagger model for italian language. I have some questions:
    - may I use a token_tag pair list in place of a tagged sentence list? Something like
    casa_NOUN
    e_CON (that is Conjunction)

    - Do I need to provide a tag dictionary? Is there a default tag dictionary?
    thanks

  • Adam

    Thanks so much for posting this! I really appreciate it. In the first example, with sentence boundary detection, why is “Hi. How are you?” shown as 1 sentence? Is this a bug in the program?

    Thanks

  • developerSh

    The very simple and effective example code to start the work on opennlp 10x a lot

  • http://corvettebrasil.com Carlos

    Hi there,

    Is it possible to post here the source code or email it to me? Thanks!

  • Giri

    simple and good ..

  • Admin

    thanks a lot.

  • Joern

    Official documentation for 1.5.2 is located here:
    http://opennlp.apache.org/documentation/1.5.2-incubating/manual/opennlp.html

    The documentation over at SourceForge is outdated.