rnnmorph

Morphological analyzer (POS tagger) for Russian and English languages based on neural networks and dictionary-lookup systems (pymorphy2, nltk).

Russian language, MorphoRuEval-2017 test dataset, accuracy

Domain	Full tag	PoS tag	F.t. + lemma	Sentence f.t.	Sentence f.t.l.
Lenta (news)	96.31%	98.01%	92.96%	77.93%	52.79%
VK (social)	95.20%	98.04%	92.06%	74.30%	60.56%
JZ (lit.)	95.87%	98.71%	90.45%	73.10%	43.15%
All	95.81%	98.26%	N/A	74.92%	N/A

English language, UD EWT test, accuracy

Dataset	Full tag	PoS tag	F.t. + lemma	Sentence f.t.	Sentence f.t.l.
UD EWT test	91.57%	94.10%	87.02%	63.17%	50.99%

Speed and memory consumption

Speed: from 200 to 600 words per second using CPU.

Memory consumption: about 500-600 MB for single-sentence predictions

Install

sudo pip3 install rnnmorph

Usage

from rnnmorph.predictor import RNNMorphPredictor
predictor = RNNMorphPredictor(language="ru")
forms = predictor.predict(["мама", "мыла", "раму"])
print(forms[0].pos)
>>> NOUN
print(forms[0].tag)
>>> Case=Nom|Gender=Fem|Number=Sing
print(forms[0].normal_form)
>>> мама
print(forms[0].vector)
>>> [0 0 0 0 0 1 0 0 0 1 1 0 0 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 1 0 1 0 0 0 1 0 0 1]

Training

Simple model training:

Acknowledgements

Anastasyev D. G., Gusev I. O., Indenbom E. M., 2018, Improving Part-of-speech Tagging Via Multi-task Learning and Character-level Word Representations
Anastasyev D. G., Andrianov A. I., Indenbom E. M., 2017, Part-of-speech Tagging with Rich Language Description, презентация
Дорожка по морфологическому анализу "Диалога-2017"
Материалы дорожки
Morphine by kmike, CRF classifier for MorphoRuEval-2017 by kmike
Universal Dependencies
Tobias Horsmann and Torsten Zesch, 2017, Do LSTMs really work so well for PoS tagging? – A replication study
Barbara Plank, Anders Søgaard, Yoav Goldberg, 2016, Multilingual Part-of-Speech Tagging with Bidirectional Long Short-Term Memory Models and Auxiliary Loss