Reconstructing faces from voices

Implementation of Reconstructing faces from voices paper

Yandong Wen, Rita Singh, and Bhiksha Raj

Machine Learning for Signal Processing Group

Carnegie Mellon University


This implementation is based on Python 3.7 and Pytorch 1.1.

We recommend you use conda to install the dependencies. All the requirements are found in requirements.txt. Run the following command to create a new conda environment using all the dependencies.

$ ./

After you run the above script, you need to activate the environment where all the packages had been installed. The environment is called voice2face and can be run by:

$ source activate voice2face

NOTE: If you get an error complaining about "webrtcvad" not being found, then you need to make sure the pip in your PATH is the one found inside your environment. This could happen if you have multiple installations of pip (inside/outside environment).

Processed data

The following are the processed training data we used for this paper. Please feel free to download them.

Voice data (log mel-spectrograms): google drive

Face data (aligned face images): google drive

Once downloaded, update variables voice_dir and face_dir with the corresponding paths.


See on how to change configurations.


We provide pretrained models including a voice embedding network and a trained generator in pretrained_models/. Or you can train your own generator by running the training script

$ python

The trained model is models/generator.pth


We provide some examples of generated faces (in data/example_data/) using the model in pretrained_model/. If you want to generate faces for your own voice recordings using the trained model, specify the test_data (as the folder containing voice recordings) and model_path (as the path of the generator) variables in and run:

$ python

Results will be in test_data folder. For each voice recording named <filename>.wav, we generate a face image named <filename>.png.

Note: Now we only support the voice recording with one channel at 16K sample rate. The file names of the voices and faces starting with A-E are validation or testing set, while those starting with F-Z are training set.


  title={Reconstructing faces from voices},
  author={Yandong Wen, Rita Singh, Bhiksha Raj},
  journal={arXiv preprint arXiv:1905.10604},


We welcome contributions from everyone and always working to make it better. Please give us a pull request or raise an issue and we will be happy to help.


This repository is licensed under GNU GPL-3.0. Please refer to