Rebaler

Rebaler is a program for conducting reference-based assemblies using long reads. It relies mainly on minimap2 for alignment and Racon for making consensus sequences.

I made Rebaler for bacterial genomes (specifically for the task of testing basecallers). It should in principle work for non-bacterial genomes as well, but I haven't tested it.

If you have have raw, signal-level data available (bax.h5 files for PacBio, fast5 files for Nanopore), then it's a good idea to run a signal-level polisher (Arrow or Nanopolish) after Rebaler.

Requirements

Rebaler runs on Python 3.4+ and uses Biopython.

It also assumes that minimap2 and racon executables are available in your PATH. If you can open a terminal and run those commands, you're good to go.

Note that as of v0.1.1, Rebaler requires Racon v1.0 or later. Check this with racon --version. If that command gives you an error, your version of Racon is too old and you need a new one.

Installation

Install from source

Running the setup.py script will install a rebaler executable:

git clone https://github.com/rrwick/Rebaler.git
cd Rebaler
python3 setup.py install
rebaler -h

Run without installation

Rebaler can be run directly from its repository by using the rebaler-runner.py script:

git clone https://github.com/rrwick/Rebaler.git
Rebaler/rebaler-runner.py -h

Usage

Rebaler is simple to use – give it a reference and reads, and it will output its assembly to stdout when done:

rebaler reference.fasta reads.fastq.gz > assembly.fasta

Progress information will be outputted to stderr.

Full usage

usage: rebaler [-h] [-t THREADS] [--keep] reference reads

Rebaler: reference-based long read assemblies of bacterial genomes

positional arguments:
  reference                      FASTA file of reference assembly
  reads                          FASTA/FASTQ file of long reads

optional arguments:
  -h, --help                     show this help message and exit
  -t THREADS, --threads THREADS  Number of threads to use for alignment and polishing (default: 8)
  --keep                         Do not delete temp directory of intermediate files (default: delete
                                 temp directory)

Method

method diagram

1) Load in the reference contigs. 2) Use minimap2 to align long reads to the reference. 3) Remove lower quality alignments (judged by length, identity and size of indels) until the reference is just covered. Any given position in the reference should now have a coverage of 1 or 2 (or 0 if the reads failed to cover a spot). 4) Replace the reference sequence with corresponding read fragments to produce an unpolished assembly (like what miniasm would make). If parts of the reference had no read coverage, the original reference sequence will be left in place. 5) Conduct multiple rounds of Racon polishing with all reads to produce the best possible consensus sequence.

Circular contigs

If the reference is made of circular contigs (as is the norm for bacterial genomes), Rebaler can take this into account during Racon polishing. Specifically, it will 'rotate' the contigs (change the starting position) between polishing rounds to ensure that all parts of the genome are well polished, including the ends.

To indicate that a reference contig is circular, it should have circular=true in its fasta header
For example: >chromosome length=5138942 circular=true

License

GNU General Public License, version 3