"LipNet: End-to-End Sentence-level Lipreading" in PyTorch

An unofficial PyTorch implementation of the model described in "LipNet: End-to-End Sentence-level Lipreading" by Yannis M. Assael, Brendan Shillingford, Shimon Whiteson, and Nando de Freitas. Based on the official Torch implementation.

LipNet video

Usage

First, create symbolic links to where you store images and alignments in the data folder:

mkdir data
ln -s PATH_TO_ALIGNS data/align
ln -s PATH_TO_IMAGES data/images

Then run the program:

python3 train_lipnet.py

This trains on the "unseen speakers" split. To train on the "overlapped speakers" split:

python3 train_lipnet.py --test_overlapped

The overlapped speakers file list we use (list_overlapped.json) is exported directly from the authors' Torch implementation release here.

To monitor training progress:

tensorboard --logdir logs

The images folder should be organised as:

├── s1
│   ├── bbaf2n
│   │   ├── mouth_000.png
│   │   ├── mouth_001.png
...

And the align folder:

├── s1
│   ├── bbaf2n.align
│   ├── bbaf3s.align
│   ├── bbaf4p.align
...

That's it! You can specify the GPU to use in the program where the environment variable CUDA_VISIBLE_DEVICES is set. Feel free to play around with the parameters.

Dependencies

Results

TODO

Pending