Realistic in silico generation and augmentation of single cell RNA-seq data using Generative Adversarial Neural Networks

This project contains the Tensorflow implementation and documentation for the training of the (c)scGAN models described in our manuscript 'Realistic in silico generation and augmentation of single cell RNA-seq data using Generative Adversarial Neural Networks'. This document describes how to set up an envinroment and run the code to replicate the results of the original manuscript.

Main requirements

The code was tested with the following packages and versions:

Python: 3.6
Tensorflow: 1.8
Scanpy: 1.2.2
Anndata: 0.6.5
Pandas: 0.22.0
Numpy: 1.14.3
Scipy: 1.1.0

We assume that the installation of the above-mentioned packages covers all dependencies. In case we have missed essential dependencies please raise an issue. To allow you to reproduce our results easily, we also provide a Dockerfile that contains a working environment containing all the dependencies. However, be aware that you will need to use the NVIDIA Container Runtime for Docker if you plan on using a GPU for the training (highly recommended).

Usage

The main.py script is used to start the pre-processing of the files, to start (or resume) the training, and generate cells using a trained model. Model (hyper-) parameters can be adjusted in the 'parameters.json' file.

Parameter file

The (hyper-) parameters of the experiments are defined in a parameters.json file. A template is provided on this repository. It contains the path to the directory where the processed data, models, logs and results will be stored, along with a list of GPU identifiers (0..n) to be used, given that each model will be trained on a single GPU. For each experiment, a custom name (which will be used for the folder name) has to be provided. When using the main.py script, the flag --param with the path to the .json parameter file should always be used. Further details on the parameters can be found in the subsequent 'Data format / Pre-processing' and 'training' paragraphs.

Data format / Pre-processing

Single cell RNA-seq data has to be in the .h5 or .h5ad format. In case you want to use data in mtx/tsv, please use the provided Jupyter notebook to convert the data to .h5 or .h5ad. For instance, most of the experiments reported in our paper use the Fresh 68k PBMCs (Donor A) dataset. For performance issues, the data will be stored, after pre-processing and splitting, in TFRecord format, which in turn will be used for the training of the model. All the parameters for the pre-processing of the data (filtering parameters, path to the dataset, size of the training / validation / test sets, clustering resolution) should be defined in the parameters.json parameter file. Note that the validation set is used to produce the t-SNE plots (each embedding is computed using the whole validation set and an equal number of generated cells). The test set is not used. Unless you want to use a test set for further custom experiments, you can set its size to 0. To run the pre-processing, run the main.py script with the --process flag:

python main.py --param parameters.json --process

Training

All the parameters regarding the training should be defined in the .json parameter file. The type of model (scGAN or cscGAN), the size of the different layers, the normalization layer used to condition the generation (batchnorm or layernorm), lambda, the batch size, dimension of the latent space, along with all the optimizer parameters (number of steps, learning rate, algorithm used...) are defined there. Also, the frequencies (how often the values are logged for the Tensorboard, the t-SNE validation plots are plotted, the loss values are displayed on the standard output, the model is saved) are definied there. To run the training, the --train flagged is used.

python main.py --param parameters.json --train

You can use jointly the --process and --train flags in which case they will be run sequentially.

Note that you can also resume the training by specifying the path to a checkpoint in the .json file.

Here are how the t-SNE plots should look like (non-conditional on the left, conditional on the right)

Tensorboard

After or during the training, you can use the Tensorboard to monitor the losses (for instance) evolve. We have defined some summary operations on the most important quantities in the model. Please refer to the Tensorboard documentation.

Generation of cells

To generate cells and write them to a file, you can use the --generate flag. In that case, also use the --cells_no flag with a list of integers corresponding to the number of cells (one int per cluster index) you want to generate. The --model_path and --save_path flags also should be used, to specify the path to the model to be used and the path where to save the data respectively. For instance:

python main.py --param parameters.json --generate --cells_no 1000 500 0 200 --model_path path/to/my/model --save_path where_to_save.h5ad