____  ____  ____  _  _____ _____
        / ___\/   _\/  _ \/ \/    //    /
        |    \|  /  | | \|| ||  __\|  __\
        \___ ||  \__| |_/|| || |   | |   
        \____/\____/\____/\_/\_/   \_/    

Build Status License: MIT

!!!!NEW!!!

For large single-cell datasets (e.g, > 2k cells), please use the new version of scdiff (scdiff2) at : https://github.com/phoenixding/scdiff2

SCDIFF 2.0 utilizes HDF5, Sparse matrix, and multi-threading techniques to reduce the resource requirement of the program while improving the efficiency. It also incorperates many new clustering and trajectory inference methods for more comprehensive and accurate predictions.

A few highlights:
(1) VERY EFFICIENT: Analyze 40k cells (~10k genes/cell) within 1-2 hours (--ncores 12 --maxloop 0)
(2) VERY FLEXIBLE: It was composed of many moving pieces, each can be customized by the users.

INTRODUCTION

Most existing single-cell trajectory inference methods have relied primarily on the assumption that descendant cells are similar to their parents in terms of gene expression levels. These assumptions do not always hold for in-vivo studies which often include infrequently sampled, un-synchronized and diverse cell populations. Thus, additional information may be needed to determine the correct ordering and branching of progenitor cells and the set of transcription factors (TFs) that are active during advancing stages of organogenesis. To enable such modeling we developed scdiff, which integrates expression similarity with regulatory information to reconstruct the dynamic developmental cell trajectories. SCDIFF is a package written in python and javascript, designed to analyze the cell differentiation trajectories using time-series single cell RNA-seq data. It is able to predict the transcription factors and differential genes associated with the cell differentiation trajectoreis. It also visualizes the trajectories using an interactive tree-stucture graph, in which nodes represent different sub-population cells (clusters).

flowchart

PREREQUISITES

The python setup.py script (or pip) will try to install these packages automatically. However, please install them manually if, by any reason, the automatic installation fails.

INSTALLATION

There are 3 options to install scdiff.

USAGE

scdiff.py [-h] -i INPUT -t TF_DNA -k CLUSTERS -o OUTPUT [-l LARGE]
                 [-s SPEEDUP] [-d DSYNC] [-a VIRTUALANCESTOR]
                 [-f LOG2FOLDCHANGECUT] [-e ETFLISTFILE] [--spcut SPCUT]

    -h, --help            show this help message and exit

    -i INPUT, --input INPUT, required 
                        input single cell RNA-seq expression data

    -t TF_DNA, --tf_dna TF_DNA, required
                        TF-DNA interactions used in the analysis

    -k CLUSTERS, --clusters CLUSTERS, required
                        how to learn the number of clusters for each time
                        point? user-defined or auto? if user-defined, please
                        specify the configuration file path. If set as "auto"
                        scdiff will learn the parameters automatically.

    -o OUTPUT, --output OUTPUT, required
                        output folder to store all results

    -s SPEEDUP, --speedup SPEEDUP(1/None), optional
                        If set as 'True' or '1', SCIDFF will speedup the running
                        by reducing the iteration times.

    -l LARGETYPE,  --largetype LARGETYPE (1/None), optional
                        if specified as 'True' or '1', scdiff will use LargeType mode to 
                        improve the running efficiency (both memory and time). 
                        As spectral clustering is not scalable to large data,
                        PCA+K-Means clustering was used instead. The running speed is improved 
                        significantly but the performance is slightly worse. If there are
                        more than 2k cells at each time point on average, it is highly 
                        recommended to use this parameter to improve time and memory efficiency.

    -d DSYNC,  --dsync DSYNC (1/None), optional
                        If specified as 'True' or '1', the cell synchronization will be disabled. 
                        If the users believe that cells at the same time point are similar in terms of 
                        differentiation/development. The synchronization can be disabled.

    -a VIRTUALANCESTOR, --virtualAncestor VIRTUALANCESTOR (1/None), optional
                        scdiff requires a 'Ancestor' node (the starting node, 
                        all other nodes are descendants).  By default, 
                        the 'Ancestor' node is set as the first time point. The hypothesis behind is :  
                        The cells at first time points are not differentiated yet
                        ( or at the very early stage of differentiation and thus no clear sub-groups, 
                        all Cells at the first time point belong to the same cluster).  

                        If it is not the case, users can set -a as 'True' or '1' to enable
                        a virtual ancestor before the first time point.  The expression of the 
                        virtual ancestor is the median expression of all cells at first time point. 

    -f LOG2FOLDCHANGECUT, --log2foldchangecut LOG2FOLDCHANGECUT (Float), optional
                        By default, scdiff uses log2 Fold change 1(=>2^1=2)
                        as the cutoff for differential genes (together with t-test p-value cutoff 0.05).
                        However, users are allowed to customize the cutoff based on their 
                        application scenario (e.g. log2 fold change 1.5). 

    -e ETFLISTFILE, --etfListFile ETFLISTFILE (String), optional  
                        By default, scdiff recognizes 1.6k
                        TFs (we collected in human and mouse). Users are able
                        to provide a customized list of TFs instead using this
                        option. It specifies the path to the TF list file, in
                        which each line is a TF name. Here, it does not require 
                        the targets information for the TFs, which will be used to infer
                        eTFs (TFs predicted based on the expression of themselves instead of the their targets).

    --spcut SPCUT       Float, optional  
                        By default, scdiff uses p-value=0.05
                        as the cutoff to tell whether the DistanceToAncestor
                        (DTA) of clusters are significantly different.
                        Clusters with similar DTA will be placed in the same
                        level.

INPUTS AND PRE-PROCESSING

scdiff takes the two required input files (-i/--input and -t/--tf_dna), two optional files (-k/--cluster, -e/--etfListFile) and a few other optional parameters.

For other scdiff optional parameters, please refer to the usage section.

RECOMMENDED PIPELINE

Please follow the following steps to analyze the single-cell data.

RESULTS AND VISUALIZATION

The results are given under the specified directory. The predicted model was provided as a json file, which is visualized by the provided JavaScript. Please use Chrome/FireFox/Safari browser for best experience.

example_out_fig

The following is the manual for the visualization page.

Visualization Config (Left panel):

Visualization Canvas (Right Panel):

EXAMPLES

Run scdiff on given time-series single cell RNA-seq data.
An example script exampleRun.py is provided under the example directory.

1) Run with automatic config

$ scdiff -i example.E -t example.tf_dna -k auto -o example_out

2) Run with user-defined config

$scdiff -i example.E  -t example.tf_dna -k example.config -o example_out

The format of example.E and example.tf_dna are the same as described above.

example.config specifies the custom initial clustering parameters. This was used when we have some prior knowledge. For example, if we know they are how many sub-populations within each time, we can just directly specify the clustering parameters using the example.config file, which provides better performance.

example.config format(tab delimited)

time    #_of_clusters

For example:

14  1  
16  2  
18  5  

However, if we don't have any prior knowledge about the sub-populations within each time point. We will just use the automatic initial clustering. :-k auto.

3) Run scdiff on large single cell dataset

$scdiff -i example.E -t example.tf_dna -k auto -o example_out -l True -s True

-i, -t, -k, -o parameters were discussed above.
For very large dataset (e.g., more than 20k cell), it's recommended to filter genes with very low variance. It significantly cuts down the the memory cost and running time.

(4) Run scdiff on large single cell dataset with synchronization disabled and virtual ancestor

$scdiff -i example.E -t example.tf_dna -k auto -o example_out -l True -s True -d True -a True

-i, -t , -k, -o, -l ,-s parameters were defined above.

5) example running result

The following link present the results for an example running.
example_out

MODULES & FUNCTIONS

scdiff module

This python module is used to perform the single cell differentiation analysis and it builds a graph (differentiation). Users can use the modules by importing scdiff package in their program. Besides the description below, we also provided a module testing example inside the example directory under the name moduleTestExample.py.

scdiff.Cell(Cell_ID, TimePoint, Expression,typeLabel,GeneList)
This class defines the cell.

Parameters:

Output:
A Cell class instance (with all information regarding to a cell)

Attributes:

Example:

import scdiff
from scdiff.scdiff import *

# reading example cells ...
AllCells=[]
print("reading cells...")
with open("example.E","r") as f:
    line_ct=0
    for line in f:
        if line_ct==0:
            GL=line.strip().split("\t")[3:]
        else:
            line=line.strip().split("\t")
            iid=line[0]
            ti=float(line[1])
            li=line[2]
            ei=[round(float(item),2) for item in line[3:]]
            ci=scdiff.Cell(iid,ti,ei,li,GL)
            AllCells.append(ci)
        line_ct+=1
        print('cell:'+str(line_ct))

scdiff.Graph(Cells, tfdna, kc, largeType=None, dsync=None, virtualAncestor=None,fChangCut=1.0, etfile=None)
This class defines the differentiation graph.

Parameters:

Output:
A graph instance with all nodes and edges, which represents the differentiation structure for given inputs.

Attributes:

Example:

import scdiff
from scdiff.scdiff import *

print("testing scdiff.Graph module ...")
# creating graph using scdiff.Graph module and examples cells build above
g1=scdiff.Graph(AllCells,"example.tf_dna",'auto')

scdiff.Clustering(Cells, kc,largeType=None)
This class represents the clustering.

Parameters:

Method: getClusteringPars()

import scdiff
from scdiff import *
Clustering_example=scdiff.Clustering(AllCells,'auto',None)
[dCK,dBS]=Clustering_example.getClusteringPars()

Method: performClustering()

import scdiff 
from scdiff import *
Clustering_example=scdiff.Clustering(AllCells,'auto',None)
Clusters=Clustering_example.performClustering()

scdiff.Cluster(Cells,TimePoint,Cluster_ID)
This class defines the node in the differentiation graph.

Parameters:

Output: List of float, this function calculates the average gene expression of all cells in cluster.

Attributes:

Example:

import scdiff 
from scdiff import *
cluster1=scdiff.Cluster([item for item in AllCells if item.T==14],14,'C1')

scdiff.Path(fromNode,toNode,Nodes,dTD,dTG,dMb)

This class defines the edge in the differentiation graph.

Parameters:

Output: Graph edge instance.

Attributes:

Example:

import scdiff 
from scdiff import *
g1=scdiff.Graph(AllCells,"example.tf_dna",'auto')
p1=scdiff.Path(g1.Nodes[0],g1.Nodes[1],g1.Nodes,g1.dTD,g1.dTG,g1.dMb)

viz module

This module is designed to visualize the differentiation graph structure using JavaScript.

scdiff.viz(exName,Graph,output)

Parameters:

Output: a visualization folder with HTML page, JavaScript Code and Graph Structure in JSON format.

Example:

import os
import scdiff
from scdiff import *
print ("testing scdiff.viz module ...")
# visualizing graph using scdiff.viz module 
os.mkdir("e1_out")
scdiff.viz("example",g1,"e1_out")

Then, you will find the visualized result page in HTML under 'e1_out' directory.

CREDITS

This software was developed by ZIV-system biology group @ Carnegie Mellon University.
Implemented by Jun Ding.

Please cite our paper Reconstructing differentiation networks and their regulation from time series single cell expression data.

LICENSE

This software is under MIT license.
see the LICENSE.txt file for details.

CONTACT

zivbj at cs.cmu.edu
jund at cs.cmu.edu