This project is still under construction and will continue to see breaking changes.
In this project we examine the ability to use generative pre-training with language modeling objectives across a variety of languages for improving language understanding. Particular interest is spent on transfer learning to low-resource languages, where label data is scare.
Pre-training and generative modeling for improved language understanding in NLP remains a challenging but interesting area of research and experimentation. Currently, most SOTA NLP results are obtained by training end-to-end architectures for each language task. We examine how transformer models relying solely on attention modules, as well as convolution-only modules such as Q-RNN and those described in QANet can provide rich representations learned through generative language modeling and then fine-tuned for text classification as well as general multi-task problems. Of particular interest will be multi-label and hierarchical/structured output label classification where graph convolutional and value networks are more effective than binary categorical cross-entropy networks.
In the past few months, a number of techniques utilizing pre-training, generative modeling, multi-task architectures, data augmentation using back-translation, and efficiency techniques in language modeling have been implemented that have allowed faster training and greater scope for transfer learning. In particular, the five papers below tackle the problem of generative pre-training and multi-task learning in NLP and achieve SOTA results.
We describe our experimentations of generative modeling and transfer learning for improved language understanding, summarize our results, examine the issues we faced, and then discuss future directions.
Our experiments start with training language models for a variety of datasets. Our overall approach is similar across languages, so we discuss our implementation with English first.
We utilized the Wikipedia long term dependency dataset curated by Stephen Merity, which has a vocabulary of 103M tokens in the training set. We used the
Tensor2Tensor library to train this model, the details of which are summarized in
Training on TPUs can provide significant benefits in terms of training speed. The Transformer model is devoid of any significant recurrent operations, so there is an optimized implementation in the
tensor2tensor library that can utilize TPUs. Other types of language models, such as bidirectional LSTMs with attention have ops that are not yet available on TPUs.
TPUs do not yet support
cloud_ml based hyperparameter search, so you'll have to revert to GPUs for their usage. Multiple TPUs for single model training is also not supported.
It took 12 hours to train to 20K steps, reaching a perplexity of 53.2, very close to the SOTA reported perplexity for this dataset.
TODO: Try out the universal transformer.
Here we replicate the paper Unsupervised Machine Translation Using Monolingual Corpora Only using
OpenNMT-tf. This implementation did not work TPUs, so we instead used 4 V100's for training.
The task-specific dataset we will examine is a corpus of scientific articles from PubMed, collected and distributed by the NLM and the BioASQ challenge.
This was supported by Deep Learning Camp Jeju 2018 which was organized by TensorFlow Korea User Group. I also would like to thank my wonderful mentor, Minjoon Seo for his advice and inspiration. Lastly, lots of thanks to all the awesome participants for making this a super fun experience!