Improving Language Understanding for Low-Resource Languages and Tasks with Generative Pre-Training

Table of Contents - [Author](#author) - [Overview](#overview) - [Research Overview](#research-overview) - [Experiments](#experiments) - [Generative Language Modeling Pre-Training](#generative-language-modeling-pre-training) - [Language Model Training with Tensor2Tensor](#language-model-training-with-tensor2tensor) - [Lessons Learned](#lessons-learned) - [Neural Machine Translation Baselines](#neural-machine-translation-baselines) - [Unsupervised Machine Translation](#unsupervised-machine-translation) - [Language Model Datasets](#language-model-datasets) - [Multi-task Benchmarks for English](#multi-task-benchmarks-for-english) - [Acknowledgement](#acknowledgement)

This project is still under construction and will continue to see breaking changes.


Ali Zaidi


In this project we examine the ability to use generative pre-training with language modeling objectives across a variety of languages for improving language understanding. Particular interest is spent on transfer learning to low-resource languages, where label data is scare.

Pre-training and generative modeling for improved language understanding in NLP remains a challenging but interesting area of research and experimentation. Currently, most SOTA NLP results are obtained by training end-to-end architectures for each language task. We examine how transformer models relying solely on attention modules, as well as convolution-only modules such as Q-RNN and those described in QANet can provide rich representations learned through generative language modeling and then fine-tuned for text classification as well as general multi-task problems. Of particular interest will be multi-label and hierarchical/structured output label classification where graph convolutional and value networks are more effective than binary categorical cross-entropy networks.

Research Overview

In the past few months, a number of techniques utilizing pre-training, generative modeling, multi-task architectures, data augmentation using back-translation, and efficiency techniques in language modeling have been implemented that have allowed faster training and greater scope for transfer learning. In particular, the five papers below tackle the problem of generative pre-training and multi-task learning in NLP and achieve SOTA results.

  1. OpenAI: Improving Language Understanding by Generative Pre-Training

    • code
    • tldr: Train an unsupervised language model using a transformer architecture, and then fine-tune on task-specific datasets.
  2. fastAI: Universal Language Model Fine-tuning for Text Classification

    • code
    • tldr: Pre-train a language model on generic English corpus (i.e., Wikipedia). Use that to initialize a new language model on your unlabeled domain-specific corpus. Fine-tune / initialize a new domain-specific architecture for text classification.

  3. Google Brain: QANet: Combining Local Convolution with Global Self-Attention for Reading Comprehension

    • code
    • tldr: Train a transformer based Q&A model consisting solely of convolutions and self-attentions. Convolutions model local interactions, and self-attention models global interactions. Use back-translation from Google NMT for data augmentation. Current #1 on SQuAD.
  4. AllenAI: Deep Contextualized Word Vectors

  5. Salesforce Research, The Natural Language Decathlon

    • code:
    • tldr: Challenge consisting of ten NLP tasks: : question answering, machine translation, summarization, natural language inference, sentiment analysis, semantic role labeling, relation extraction, goal-oriented dialogue, database query generation, and pronoun resolution. Proposed MQAN (multi-task question answering network) which uses bidirectional LSTM to encode both question and context document, dual coattention, compressed further using another two BiLSTMs + self-attention + two more BiLSTMs to obtain final representations.

  6. Trieu H. Trinh & Quoc Le: A Simple Method for Commonsense Reasoning

    • tldr: Solve difficult multiple choice questions from Pronoun Disambiguation Challenge and Winograd Schema Challenge by pre-training many language models (diversity helps!) and use coreference resolution to substitute question pronoun with answer choices and pick the one with the highest likelihood (lowest perplexity) on the language models (using ensembling).
      • language modeling can naturally capture common sense knowledge.


We describe our experimentations of generative modeling and transfer learning for improved language understanding, summarize our results, examine the issues we faced, and then discuss future directions.

Generative Language Modeling Pre-Training

Our experiments start with training language models for a variety of datasets. Our overall approach is similar across languages, so we discuss our implementation with English first.

Language Model Training with Tensor2Tensor

We utilized the Wikipedia long term dependency dataset curated by Stephen Merity, which has a vocabulary of 103M tokens in the training set. We used the Tensor2Tensor library to train this model, the details of which are summarized in wikitext103-lm.

Lessons Learned

Training on TPUs can provide significant benefits in terms of training speed. The Transformer model is devoid of any significant recurrent operations, so there is an optimized implementation in the tensor2tensor library that can utilize TPUs. Other types of language models, such as bidirectional LSTMs with attention have ops that are not yet available on TPUs.

TPUs do not yet support cloud_ml based hyperparameter search, so you'll have to revert to GPUs for their usage. Multiple TPUs for single model training is also not supported.

It took 12 hours to train to 20K steps, reaching a perplexity of 53.2, very close to the SOTA reported perplexity for this dataset.

TODO: Try out the universal transformer.

Neural Machine Translation Baselines

Unsupervised Machine Translation

Here we replicate the paper Unsupervised Machine Translation Using Monolingual Corpora Only using OpenNMT-tf. This implementation did not work TPUs, so we instead used 4 V100's for training.

Language Model Datasets

Multi-task Benchmarks for English

The task-specific dataset we will examine is a corpus of scientific articles from PubMed, collected and distributed by the NLM and the BioASQ challenge.


This was supported by Deep Learning Camp Jeju 2018 which was organized by TensorFlow Korea User Group. I also would like to thank my wonderful mentor, Minjoon Seo for his advice and inspiration. Lastly, lots of thanks to all the awesome participants for making this a super fun experience!