tf-adaptive-softmax-lstm-lm

This repository shows the experiment result of LSTM language models on PTB (Penn Treebank) and GBW (Google One Billion Word) using AdaptiveSoftmax on TensorFlow.

Adaptive Softmax

The adaptive softmax is a faster way to train a softmax classifier over a huge number of classes, and can be used for both training and prediction. For example, it can be used for training a Language Model with a very huge vocabulary, and the trained languaed model can be used in speech recognition, text generation, and machine translation very efficiently.

Tha adaptive softmax has been used in the ASR system developed by Tencent AI Lab, and achieved about 20x speed up than full sotfmax in the second pass for rescoing.

See Efficient softmax approximation for GPUs for detail about the adaptive softmax algorithms.

Implementation

The implementation of AdaptiveSoftmax on TensorFlow can be found here: TencentAILab/tensorflow

Usage

Train with AdaptiveSoftmax:

python train_lm.py --data_path=ptb_data --gpuid=0 --use_adaptive_softmax=1

Train with full softmax:

python train_lm.py --data_path=ptb_data --gpuid=0 --use_adaptive_softmax=0

Experiment results

Language Modeling on PTB

With the hyper parameters below, it takes 5min54s to train 20 epochs on PTB corpus, the final perplexity on test set is 88.51. With the same parameters and using full softmax, it takes 6min57s to train 20 epochs, and the final perplexity on test set is 89.00.

Since the PTB vocabulary size is only 10K, the speed up is not that significant.

hyper parameters:

epoch_num = 20
train_batch_size = 128
train_step_size = 20
valid_batch_size = 128
valid_step_size = 20
test_batch_size = 20
test_step_size = 1
word_embedding_dim = 512
lstm_layers = 1
lstm_size = 512
lstm_forget_bias = 0.0
max_grad_norm = 0.25
init_scale = 0.05
learning_rate = 0.2
decay = 0.5
decay_when = 1.0
dropout_prob = 0.5
adagrad_eps = 1e-5
vocab_size = 10001
softmax_type = "AdaptiveSoftmax"
adaptive_softmax_cutoff = [2000, vocab_size]

result:

Epoch	Elapse	Train PPL	Valid PPL	Test PPL
1	0min18s	376.407	169.152	164.039
2	0min35s	154.324	132.648	127.494
3	0min53s	117.210	118.547	113.197
4	1min11s	98.662	111.791	106.373
5	1min28s	87.366	107.808	102.588
6	1min46s	79.448	105.028	100.024
7	2min04s	73.749	103.705	98.220
8	2min21s	69.392	102.939	96.931
9	2min39s	62.737	100.174	94.043
10	2min57s	59.423	99.412	93.153
11	3min15s	56.634	97.600	91.271
12	3min32s	55.036	97.388	91.061
13	3min50s	54.002	96.127	89.796
14	4min08s	53.232	96.170	89.805
15	4min25s	52.844	95.461	89.130
16	4min43s	52.488	95.085	88.788
17	5min01s	52.314	94.905	88.615
18	5min18s	52.172	94.835	88.553
19	5min36s	52.038	94.806	88.526
20	5min54s	51.998	94.788	88.510

Language Modeling on Google 1Billion Word corpus

hyper parameters:

word_embedding_dim = 256
train_batch_size = 256
train_step_size = 20
valid_batch_size = 256
valid_step_size = 20
test_batch_size = 128
test_step_size = 1
lstm_layers = 1
lstm_size = 2048
lstm_forget_bias = 1.0
max_grad_norm = 0.25
init_scale = 0.05
learning_rate = 0.1
decay = 0.5
decay_when = 1.0
dropout_prob = 0.01
adagrad_eps = 1e-5
vocab_size = 793471
softmax_type = "AdaptiveSoftmax"
adaptive_softmax_cutoff = [4000,40000,200000, vocab_size]

result:

On GBW corpus, we achived a perplexcity of 43.24 after 5 epochs, taking about two days to train on 2 GPUs with synchronous gradient updates.

Epoch	Elapse	Train PPL	Valid PPL	Test PPL
1	9h56min	51.428	52.727	49.553
2	19h53min	45.141	48.683	45.639
3	29h51min	42.605	47.379	44.332
4	39h48min	41.119	46.822	43.743
5	49h45min	38.757	46.402	43.241
6	59h42min	37.664	46.334	43.119
7	69h40min	37.139	46.337	43.101
8	79h37min	36.884	46.342	43.097

Reference

[1] Grave E, Joulin A, Cissé M, et al. Efficient softmax approximation for GPUs[J]. arXiv preprint arXiv:1609.04309, 2016.

[2] https://github.com/facebookresearch/adaptive-softmax