ixa-pipe-pos

Build Status GitHub license

ixa-pipe-pos is a multilingual Part of Speech tagger and Lemmatizer, currently offering pre-trained models for eight languages: Basque, Dutch, English, French, Galician, German, Italian, and Spanish. ixa-pipe-pos is part of IXA pipes, a multilingual set of NLP tools developed by the IXA NLP Group [http://ixa2.si.ehu.es/ixa-pipes]. Current version is 1.5.2.

Please go to [http://ixa2.si.ehu.es/ixa-pipes] for general information about the IXA pipes tools but also for official releases, including source code and binary packages for all the tools in the IXA pipes toolkit.

This document is intended to be the usage guide of ixa-pipe-pos. If you really need to clone and install this repository instead of using the releases provided in [http://ixa2.si.ehu.es/ixa-pipes], please scroll down to the end of the document for the installation instructions.

TABLE OF CONTENTS

  1. Overview of ixa-pipe-pos
  2. Usage of ixa-pipe-pos
  3. API via Maven Dependency
  4. Git installation
  5. Adding your language

OVERVIEW

ixa-pipe-pos provides statistical POS tagging and lemmatization several languages. We provide Perceptron (Collins 2002) and Maximum Entropy (Ratnapharki 1999) POS tagging and Lemmatization models trained on the following data for each language:

To avoid duplication of efforts, we use and contribute to the machine learning API provided by the Apache OpenNLP project. Additionally, we have added other features such as dictionary-based lemmatization, multiword and clitic pronoun treatment, post-processing via tag dictionaries, etc., as described below.

ixa-pipe-pos is distributed under Apache License version 2.0 (see LICENSE.txt for details).

Models

Remember that for Galician and Spanish the output of the statistical models can be post-processed using the monosemic dictionaries provided via the --dictag CLI option.

Resources

We provide some dictionaries to modify the output of the statistical tagger and lemmatizer. To use them, pllease get and unpack the contents of this tarball in the src/main/resources/ directory inside ixa-pipe-pos before compilation:

To use them, to download the package, copy it and untar it into the src/main/resources directory before compilation.

USAGE

ixa-pipe-pos provides the following functionalities:

  1. server: starts a TCP service loading the model and required resources.
  2. client: sends a NAF document to a running TCP server.
  3. tag: reads a NAF document containing wf elements and creates term elements with the morphological information.
  4. train: trains new models for with several options available (read trainParams.properties file for details).
  5. eval: evaluates a trained model with a given test set.
  6. cross: perform cross-validation evaluation.

Each of these functionalities are accessible by adding (tag|train|eval|cross|server|client) as a subcommand to ixa-pipe-pos-$version.jar. Please read below and check the -help parameter ($version refers to the current ixa-pipe-pos version).

java -jar target/ixa-pipe-pos-1.5.2-exec.jar (tag|train|eval|cross|server|client) -help

Tagging

If you are in hurry, Download or create a plain text file and use it like this:

cat guardian.txt | java -jar ixa-pipe-tok-1.8.5-exec.jar tok -l en | java -jar ixa-pipe-pos-1.5.2-exec.jar tag -m en-pos-perceptron-autodict01-conll09.bin -lm en-lemma-perceptron-conll09.bin

If you want to know more, please follow reading.

ixa-pipe-pos reads NAF documents containing wf elements via standard input and outputs NAF through standard output. The NAF format specification is here:

(http://wordpress.let.vupr.nl/naf/)

You can get the necessary input for ixa-pipe-pos by piping it with ixa-pipe-tok.

There are several options to tag with ixa-pipe-pos:

Tagging Example:

Download or create a plain text file and use it like this:

cat guardian.txt | java -jar ixa-pipe-tok-1.8.5.jar tok -l en | java -jar ixa-pipe-pos-1.5.2.jar tag -m en-pos-perceptron-autodict01-conll09.bin -lm en-lemma-perceptron-conll09.bin

Remember to download some models from the distributed packages!!

Server

We can start the TCP server as follows:

java -jar target/ixa-pipe-pos-1.5.2-exec.jar server -l en --port 2040 -m en-pos-perceptron-autodict01-conll09.bin -lm en-lemma-perceptron-conll09.bin

Once the server is running we can send NAF documents containing (at least) the text layer like this:

 cat guardian.txt | java -jar ixa-pipe-tok-1.8.5-exec.jar tok -l en | java -jar ixa-pipe-pos-1.5.2-exec.jar client -p 2040

Training

To train a new model, you just need to pass a training parameters file as an argument. Every training option is documented in the template trainParams.properties file.

Example:

java -jar target/ixa.pipe.pos-$version-exec.jar train -p trainParams.properties

Evaluation

To evaluate a trained model, the eval subcommand provides the following options:

Example:

java -jar target/ixa.pipe.pos-$version-exec.jar eval -c pos -m test-pos.bin -l en -t test.data

API

The easiest way to use ixa-pipe-pos programatically is via Apache Maven. Add this dependency to your pom.xml:

<dependency>
    <groupId>eus.ixa</groupId>
    <artifactId>ixa-pipe-pos</artifactId>
    <version>1.5.2</version>
</dependency>

JAVADOC

The javadoc of the module is located here:

ixa-pipe-pos/target/ixa-pipe-pos-$version-javadoc.jar

Module contents

The contents of the module are the following:

+ formatter.xml           Apache OpenNLP code formatter for Eclipse SDK
+ pom.xml                 maven pom file which deals with everything related to compilation and execution of the module
+ src/                    java source code of the module and required resources
+ trainParams.properties      A template properties file containing documention
+ Furthermore, the installation process, as described in the README.md, will generate another directory:
target/                 it contains binary executable and other directories

INSTALLATION

Installing the ixa-pipe-pos requires the following steps:

If you already have installed in your machine the Java 1.8+ and MAVEN 3, please go to step 3 directly. Otherwise, follow these steps:

1. Install JDK 1.8

If you do not install JDK 1.7+ in a default location, you will probably need to configure the PATH in .bashrc or .bash_profile:

export JAVA_HOME=$pwd/java8
export PATH=${JAVA_HOME}/bin:${PATH}

Replacing $pwd with the full path given by typing the pwd inside the java directory.

If you use tcsh you will need to specify it in your .login as follows:

setenv JAVA_HOME $pwd/java8
setenv PATH ${JAVA_HOME}/bin:${PATH}

If you re-login into your shell and run the command

java -version

You should now see that your JDK is 1.7+

2. Install MAVEN 3

Download MAVEN 3 from

wget http://apache.rediris.es/maven/maven-3/3.0.5/binaries/apache-maven-3.0.5-bin.tar.gz

Now you need to configure the PATH. For Bash Shell:

export MAVEN_HOME=$pwd/apache-maven-3.0.5
export PATH=${MAVEN_HOME}/bin:${PATH}

Replacing $pwd with the full path given by typing the pwd inside the apache maven directory.

For tcsh shell:

setenv MAVEN3_HOME $pwd/apache-maven-3.0.5
setenv PATH ${MAVEN3}/bin:{PATH}

If you re-login into your shell and run the command

mvn -version

You should see reference to the MAVEN version you have just installed plus the JDK 7 that is using.

3. Get module source code

If you must get the module source code from here do this:

git clone https://github.com/ixa-ehu/ixa-pipe-pos

4. Download the Resources and Models

Download the POS tagging and lemmatization models:

Additionally, we distribute dictionaries to correct the output of the statistical lemmatization. To use them, you will need to download the resources and copy them to ixa-pipe-pos/src/main/resources/ before compilation for the module to use:

Download the resources and untar the archive into the src/main/resources directory:

cd ixa-pipe-pos/src/main/resources
wget http://ixa2.si.ehu.es/ixa-pipes/models/lemmatizer-dicts.tar.gz
tar xvzf lemmatizer-dicts.tar.gz

The lemmatizer-dicts contains the required dictionaries to help the statistical lemmatization.

5. Compile

cd ixa-pipe-pos
mvn clean package

This step will create a directory called target/ which contains various directories and files. Most importantly, there you will find the module executable:

ixa-pipe-pos-$version-exec.jar

This executable contains every dependency the module needs, so it is completely portable as long as you have a JVM 1.7 or newer installed.

To install the module in the local maven repository, usually located in ~/.m2/, execute:

mvn clean install

Extend

To add your language to ixa-pipe-pos the following steps are required:

Contact information

Rodrigo Agerri
IXA NLP Group
University of the Basque Country (UPV/EHU)
E-20018 Donostia-San Sebastián
[email protected]