miRge2.0

A Python program to perform comprehensive analysis of miRNA sequencing Data, including miRNA annotation, A-to-I analysis, novel miRNA detection, isomiR analysis and tRF detection etc.

Documentation

Installation
How to use it
- Annoate mode
- Predict mode
Changelog
Citation

Installation

Download libraries

miRge2.0 relies on a huge number of libraries like:
1) Bowtie indexes of genome, hairping, mature miRNAs in miRBase, mature miRNAs in miRGeneDB, mRNA, rRNA, snoRNA, mature tRNA, primary tRNA, other ncRNA and spike-in sequences (optional)
2) Sequences of genome, mature miRNAs (including SNP information) in miRBase and miRGeneDB
3) Corrdinates of repetitive elements and mature miRNAs in the genome and miRNA merge information in miRBase and MirGeneDB

Libraries of six species including human, mouse, rat, zebrafish, nematode and fruitfly can be downloaded separately by wget. wget -O human.tar.gz https://jh.box.com/shared/static/rj7ufy5v15uw7ytsyyrsryw99u7ml82j.gz;
wget -O mouse.tar.gz https://jh.box.com/shared/static/z2bcey8j9e9nxnvpmb4fm88zzq3da4m1.gz;
wget -O rat.tar.gz https://jh.box.com/shared/static/mmztv42j8h7snk0eo80o7a7t30it4q9f.gz;
wget -O zebrafish.tar.gz https://jh.box.com/shared/static/nwn7jzn5ekgm51k7jlk43a6h75aasgr1.gz;
wget -O nematode.tar.gz https://jh.box.com/shared/static/boubl0g28b0xj952e8bhfmbds056gdvi.gz;
wget -O fruitfly.tar.gz https://jh.box.com/shared/static/ilrnq62cp06pviir5t0mh85aqet0fmjq.gz. After unpacking the downloaded .tar.gz files to the new directory of miRge.Libs, the folder named by species contains three folders: index.Libs (libraries Part 1), fasta.Libs (libraries Part 2) and annotation.Libs (libraries Part 3). The absolute directory of miRge.Libs is used as the value of parameter ‘-lib’ in miRge2.0.

If the users want to build the libraries for other species, they can use scripts miRge_bowtie_build.py and miRge_pckls_build.py which can be downloaded from https://github.com/mhalushka/miRge_build, wherein miRge_bowtie_build.py is used to build bowtie index files and miRge_pckls_build.py is used to transform _genome_repeats.GTF and ***_genome.fa in oder to accelerate the speed of reading larg files into memory.

Install miRge2.0

miRge2.0 is implemented as a Python program running on a Linux/Unix platform that requires pre-installation of Bowtie (v1.1.1 or v1.1.2; http://bowtie-bio.sourceforge.net/index.shtml), SAMtools (v1.5; http://samtools.sourceforge.net/) and RNAfold (v2.3.5; http://www.tbi.univie.ac.at/RNA).
It was built with Python (v2.7.*) programming language and Python-related libraries, including cutadapt(v1.11 to v1.16), biopython(>= v1.68), numpy(>= v1.11.3), scipy(>= v0.17.0), matplotlib(>= v2.1.1), pandas(>= v0.21.0), sklearn(>= v0.18.1), reportlab(>= v3.3.0) and forgi(v0.20).

PLEASE NOTE: miRge2.0 is currently incompatible with cutadapt v1.18. Using v1.18 will give a "TypeError: call() takes exactly 3 arguments (2 given)" error.

The source code is hosted at: https://github.com/mhalushka/miRge.

miRge2.0 is available in bioconda now and can be installed via conda. You will need to add the bioconda channel as well as the other channels bioconda depends on. It is important to add them in this order so that the priority is set correctly (that is, bioconda is highest priority).
conda config --add channels defaults
conda config --add channels conda-forge
conda config --add channels bioconda
Then install mirge by running:
conda install mirge

miRge2.0 also can be installed from the source code by pip (THIS VERSION IS CURRENTLY UNDER REVISION. USE BIOCONDA FOR AN ACCURATE INSTALL UNTIL THIS MESSAGE DISAPPEARS):
1) Download miRge2.0 source code from https://github.com/mhalushka/miRge and unzip the zipped file folder.
2) If the package of wheel is not installed, run pip install wheel to install it.
3) Change the directory to miRge2.0's directory and run python setup.py bdist_wheel to build a wheel file for the subsequent installation via pip.
4) Run pip install ./dist/mirge-2.0-py2-none-any.whl to install miRge2.0.

Troubleshooting installation

miRge2.0 was tested on the specific version of required softwares and python packages. Please make sure the version is correct.
1) If Bowtie, SAMtools or RNAfold have been already installed in the system, please run which bowtie, which samtools or which RNAfold to find their installation paths. If the versions are incorrect, please install them with right version. 2) Running pip freeze to check th version of current python packages. If some python packages can't work, please mannually install them by running pip install package==*.**.
3) If the required python pacakages of the specific version can't be installed by pip or imported by python, make sure the installed python is complied by 4-byte Unicode so that pip can install UCS4 wheels (supporting cp27mu not cp27m). Type python and enter following commands import sys print sys.maxunicode. If output is 1114111 then it is UCS4 otherwise if output is 65535 then it is UCS2.
If it is UCS2, please re-compile already installed python with 4-bype Unicode from the source code by running: a) ./configure --enable-unicode=ucs4 --prefix=*** b) make c) make install

How to use it

miRge2.0 has two modules: annotate and predict.

Annotate mode

Usage: miRge2.0 annotate [-h] [<args>]

Example:
miRge2.0 annotate -s seq_file.fastq -d miRBase -pb /home/yin/tools/bowtie-1.1.1 -lib /home/yin/miRge.Libs -sp human -ad illumina -ai -gff -trf -cpu 4

optional arguments:
  -h, --help            show this help message and exit
  -s [sample <required> [sample <required> ...]]
                        two options: 1. A file where each row represents one
                        sample name; 2. *.fastq *.fastq ...
  -o <dir>              the directory of the outputs (default: current directory)
  -d <string required>  the miRNA database (default: miRBase. MirGeneDB is optional)
  -pb <dir required>    the path to the system's bowtie binary
  -lib <dir required>   the path to the miRge libraries
  -sp <string required>
                        the species can be human, mouse, fruitfly, nematode,
                        rat and zebrafish (novel miRNA detection is confined
                        in human and mouse)
  -ex <float>           the threshold of the proportion of canonical reads for the
                        miRNAs to determine whether keeping them or not when counting.
                        Users can set it between 0 and 0.5 (default: 0.1)
  -ad <string>          the adapter need to be removed which could be
                        illumina, ion or a defined sequence (default: none)
  -phred64              phred64 format (default: 33)
  -spikeIn              switch to annotate spike-ins if the bowtie index files
                        are loacted at the path of bowtie's index
                        files (default: off)
  -tcf                  switch to write trimmed and collapsed fasta file
                        (default: off)
  -di                   switch to calculate of isomirs entropy (default: off)
  -cpu <int>            the number of processors to use for trimming, qc, and
                        alignment (default: 1)
  -ai                   switch to calculate of A to I editing (default: off)
  -gff                  switch to output results in gff format (default: off)
  -trf                  switch to analyze tRNA fragment (default: off)
  --version             show program's version number and exit

Predict mode:

Usage: miRge2.0 predict [-h] [<args>]

Example:
miRge2.0 predict -s seq_file.fastq -d miRBase -pb /home/yin/tools/bowtie-1.1.1 -lib /home/yin/miRge.Libs -ps /usr/local/bin -pr /usr/local/bin -sp human -ad illumina -ai -gff -trf -cpu 4

optional arguments:
  -h, --help            show this help message and exit
  -s [sample <required> [sample <required> ...]]
                        two options: 1. A file where each row represents one
                        sample name; 2. *.fastq *.fastq ...
  -o <dir>              the directory of the outputs (default: current directory)
  -d <string required>  the miRNA database (default: miRBase. MirGeneDB is optional)
  -pb <dir required>    the path to the system's bowtie binary
  -lib <dir required>   the path to the miRge libraries
  -sp <string required>
                        the species can be human, mouse, fruitfly, nematode,
                        rat and zebrafish (novel miRNA detection is confined
                        in human and mouse)
  -ps <dir required>    the path to the system's samtools binary
  -pr <dir required>    the path to the system's rnafold binary
  -ex <float>           the threshold of the proportion of canonical reads for the
                        miRNAs to determine whether keeping them or not when counting.
                        Users can set it between 0 and 0.5 (default: 0.1)
  -ad <string>          the adapter need to be removed which could be
                        illumina, ion or a defined sequence (default: none)
  -phred64              phred64 format (default: 33)
  -spikeIn              switch to annotate spike-ins if the bowtie index files
                        are loacted at the path of bowtie's index
                        files (default: off)
  -tcf                  switch to write trimmed and collapsed fasta file
                        (default: off)
  -di                   switch to calculate of isomirs entropy (default: off)
  -cpu <int>            the number of processors to use for trimming, qc, and
                        alignment (default: 1)
  -ai                   switch to calculate of A to I editing (default: off)
  -gff                  switch to output results in gff format (default: off)
  -trf                  switch to analyze tRNA fragment (default: off)
  -ws <file>            the file containing the overall samples to analysis
                        for novel miRNA prediction. No header, just a list of *.fastq
                        file names in a column. Names of files can be to your choosing
                        (e.g. filestochecknovel.txt)
  -minl <int>           the minimum length of the reatined reads for novel miRNA
                        detection (default: 16)
  -maxl <int>           the maximum length of the reatined reads for novel miRNA
                        detection (default: 25)
  -cc <int>             the maximum read count of the reatined reads for novel miRNA
                        detection (default: 2)
  -ml <int>             the maximum number of mapping loci for the retained
                        reads for novel miRNA detection (default: 3)
  -sl <int>             the seed length when invoking Bowtie for novel miRNA detection
                        (default: 25)
  -olc <int>            the length of overlapped seqence when joining reads
                        into longer sequences based on the coordinate on the
                        genome for novel miRNA detection (default: 14)
  -clc <int>            the maximum length of the clustered sequences for novel miRNA
                        detection (default: 30)
  --version             show program's version number and exit

Changelog

06-26-2018 Add a -trf option, so that tRF detection can be performed.

05-06-2018 Rebuild the miRNA libraries from the newly released miRBase v22 and MirGeneDB v2.0.

02-22-2018 Add a -ex option. If the canonical reads count is less than 2 or the proportion of canonical reads is less that the set threshold, the miRNA will not be counted in the ouput files: miR.Counts.csv and miR.RPM.csv.

Citation

Yin Lu, Alexander S. Baras, Marc K Halushka. miRge 2.0 for comprehensive analysis of microRNA sequencing data. BMC Bioinformatics. 2018.