YouTube Video Face Swap using "DeepFakes" Autoencoder-Model


The aim of this project is to perform a face swap on a youtube video almost automatically.
The only step where a human is needed is the quality in step 1.5.

How does it work?

Siraj Raval explains that pretty good in his video:


My Setup

I am using a desktop PC with one GTX1060 running ubuntu server 16.04.
Training the model for 100.000 epochs takes me about 30 hours.

Install packages from apt

sudo apt-get install ffmpeg x264 libx264-dev

Install xvfb for virtual screen

sudo apt-get install xvfb  

Install chromedriver for image scraping

sudo sh ./

Install required libraries

pip install -r requirements.txt


Step 1: Fetch Training Data

Scrape face images of two persons from google images.

python3 --name="angela merkel" --limit=500
python3 --name="taylor swift" --limit=500

Or scrape face images from youtube videos (e.g. interviews).

python3 --url="" --start=30 --stop=200 --name="siraj raval" --limit=500
python3 --url="" --start=60 --stop=179 --name="kal penn" --limit=500

Step 1.5: The Human Eye

Have a look at the extracted face images in "data/faces/"! There will appear some missextractions, just delete the images that don't fit.

Step 2: Train Model

Train the faceswap model with the collected face images.
In this example Merkel's face will be inserted on Taylor Swift.

python3 --src="angela merkel" --dst="taylor swift" --epochs=100000

Pre-trained Model

You can download a pre trained model for Angela Swift here
Just place the "models" folder next to the code directory.

Step 3: Apply Face Swap on YouTube Video

Perform facesqp on a youtube video.
The "--start" and "--stop" parameters define in seconds where to clip the video.
Set "--gif" to "True" if you want to export the generated video as gif file.

python3 --url="" --start=0 --stop=60 --gif=False


Donald Trump as Nicolas Cage:
Example GIF
Angela Merkel as Taylor Swift:
Example2 GIF
Video with better quality
Example3 GIF
Video with better quality

The first two exampels are trained with images scraped from google, that's why the swapped faces look a bit frozen.
The last one was trained using only two videos of interviews.
You can see that it can transfer facial expressions much better than the ones trained with static images.

What's coming next?

Since I am more into audio processing, I would like to transfer the concept of face swapping on music signals.
If you have any suggestions, please let me know.


Special thanks goes to Siraj Raval who inspired me to this project!