BiLSTM-CNN-CRF with ELMo-Representations for Sequence Tagging

This repository is an extension of my BiLSTM-CNN-CRF implementation.

It integrates the ELMo representations from the publication Deep contextualized word representations (Peters et al., 2018) into the BiLSTM-CNN-CRF architecture and can improve the performance significantly for different sequence tagging tasks.

The system is easy to use, optimized for high performance, and highly configurable.


Note: This implementation might be incompatible with different (e.g. more recent) versions of the frameworks. See docker/requirements.txt for a full list of all Python package requirements.


For an IPython Notebook with a simple example how to use ELMo representations for sentence classification, see: Keras_ELMo_Tutorial.ipynb.

This code is an extension of the emnlp2017-bilstm-cnn-crf implementation. Most examples can be used with only slight adaptation. Also please see that repository for an explanation about the definition of the datasets, the configuration of the hyperparameters, how to use it for multi-task learning, or how to create custom features.

Most aspects from emnlp2017-bilstm-cnn-crf work the same in this implementation.


This repository contains experimental software and is under active development. If you find the implementation useful, please cite the following paper: Alternative Weighting Schemes for ELMo Embedding

  author    = {Reimers, Nils, and Gurevych, Iryna},
  title     = {{Alternative Weighting Schemes for ELMo Embeddings}},
  journal   = {CoRR},
  volume    = {abs/1904.02954},
  year      = {2019},
  url       = {}

Contact person: Nils Reimers,

Don't hesitate to send us an e-mail or report an issue, if something is broken (and it shouldn't be) or if you have further questions.

First two layer of the ELMo Model

In my publication Alternative Weighting Schemes for ELMo Embedding, I show that it is often sufficient to use only the first to layers of ELMo. The third layers led for various tasks to no significant improvement. Reducing the ELMo model from three to two layers increases the training speed up to 50%.

You can download the reduced, pre-trained models from here:

This reduced ELMo model is also compatible with the models from AllenNLP, just replace the options_file / weight_file in your config with the provided URLs.


In order to run the code, Python 3.6 or higher is required. The code is based on Keras 2.2.0 and as backend I recommend Tensorflow 1.8.0. I cannot ensure that the code works with different versions for Keras / Tensorflow or with different backends for Keras.

Installation using conda / virtualenv

To get the ELMo representations, AllenNLP is required. The AllenNLP installation instructions describe a nice way how to setup a virtual enviromnent with the correct Python version.

Conda can be used set up a virtual environment with the version of Python required (3.6).

  1. Download and install Conda.

  2. Create a Conda environment with Python 3.6

    conda create -n elmobilstm python=3.6
  3. Activate the Conda environment. You will need to activate the Conda environment in each terminal in which you want to this code.

    source activate elmobilstm

Installing the dependencies with pip

You can use pip to install the dependencies.

pip install allennlp==0.5.1 tensorflow==1.8.0 Keras==2.2.0

In docker/requirements.txt) you find a full list of all used packages. You can install it via:

pip install -r docker/requirements.txt

Installation using docker

The docker-folder contains an example how to create a Docker image that contains all required dependencies. It can be used to run your code within that container. See the docker-folder for more details.

Test the installation

If the installation was successful, you can test the code by running:


This trains the ELMo-BiLSTM-CRF architecture on the CoNLL 2000 chunking dataset.


See for an example how to train and evaluate this implementation. The code assumes a CoNLL formatted dataset like the CoNLL 2000 dataset for chunking.

For training, you specify the datasets you want to train on:

datasets = {
    'conll2000_chunking':                                   #Name of the dataset
        {'columns': {0:'tokens', 1:'POS', 2:'chunk_BIO'},   #CoNLL format for the input data. Column 0 contains tokens, column 1 contains POS and column 2 contains chunk information using BIO encoding
         'label': 'chunk_BIO',                              #Which column we like to predict
         'evaluate': True,                                  #Should we evaluate on this task? Set true always for single task setups
         'commentSymbol': None}                             #Lines in the input data starting with this string will be skipped. Can be used to skip comments

For more details, see the emnlp2017-bilstm-cnn-crf implementation.

ELMo Word Representations Computation

The ELMoWordEmbeddings-class provides methods for the efficient computation of ELMo representations. It has the following parameters:

The ELMoWordEmbeddings provides methods for the efficient computation of ELMo representations. It has the following parameters:

Caching of ELMo representations

The computation of ELMo representations is computationally expensive. A CNN is used to map the characters of a token to a dense vectors. These dense vectors are then fed through two BiLSTMs. The representation of each token and the two outputs of the BiLSTMs are used to form the final context-dependent word embedding.

In order speed-up the training, the context dependent word embeddings can be cached. Then, those embeddings must only be computed for the first epoch. For consecutive epochs, the embeddings are used from the cache.

To enable the caching, you must set embLookup.cache_computed_elmo_embeddings to True:

embLookup = ELMoWordEmbeddings(embeddings_file, elmo_options_file, elmo_weight_file, elmo_mode, elmo_cuda_device)
embLookup.cache_computed_elmo_embeddings = True

This method requires about 12 KB memory per token. For large datasets, you will need a few gigabyte of RAM.

Pre-compute ELMo embeddings once

The ELMoWordEmbeddings class implements a caching mechanism for a quick lookup of sentences => context dependent word representations for all tokens in the sentence.

You can run to iterate through all you sentences in your dataset and create the ELMo embeddings for those. It stores these embeddings in the file embeddings/elmo_cache_[DatasetName].pkl.

Once you create such a cache, you can load those in your experiments:

embLookup = ELMoWordEmbeddings(embeddings_file, elmo_options_file, elmo_weight_file, elmo_mode, elmo_cuda_device)

If a sentence is in the cache, the cached representations for all tokens in that sentence are used. This requires the computation of the ELMo embeddings for a dataset must only be done once.

Note: The cache file can become rather large, as 3*1024 float numbers per token must be stored. The cache file requires about 3.7 GB for the CoNLL 2000 dataset on chunking with about 13.000 sentences.

Issues, Feedback, Future Development

This repository is under active development as I'm currently running several experiments that involve ELMo embeddings.

If you have questions, feedback or find bugs, please send an email to me: