Skip to content
Snippets Groups Projects
Raphael Müller's avatar
Raphael Müller authored
5ffca00f
History

Backmap Nextflow pipeline

Install

Dependencies 1

  • nextflow
  • samtools
  • bedtools
  • bwa
  • minimap2
  • qualimap
  • multiqc
  • r-base
  • r-ggplot2
  • r-dplyr

Installation with conda

conda env create -f environment.yml
conda activate backmap

Usage

backmap v0.4

Description:
        Automatic mapping of paired, unpaired, PacBio and Nanopore reads to an
        assembly, execution of qualimap bamqc, multiqc and estimation of genome size
        from mapped nucleotides and peak coverage.
        With the provided conda environment configuration file (environment.yml),
        a conda environment with all necessary tools can be created.

        Be aware that this pipeline was transferred to a nextflow pipeline with major
        changes to version v0.3.

Usage:
        With mapping:
                nextflow run backmap.nf  --assembly assembly.fa
                                        [--single-end <unpaired.fq>[,unpaired2.fq,...]]
                                        [--paired-end <paired1.1.fq>,<paired1.2.fq>[,<paired2.1.fq>,<paired2.2.fq>,...]]
                                        [--pacbio <pacbio.fq>[,<pacbio2.fq>,...]]
                                        [--nanopore <nanopore.fq>[,<nanopore2.fq>,...]]

        Without mapping:
                nextflow run backmap.nf [--assembly assembly.fa]
                                        [--illumina-bam-files illumina.bam[,illumina2.bam,...]]
                                        [--pacbio-bam-files pacbio.bam[,pacbio2.bam,...]]
                                        [--nanopore-bam-files nanopore.bam[,nanopore2.bam,...]]


Mandatory options:
                --assembly                      STR     Assembly were reads should mapped to in fasta format
        AND AT LEAST ONE OF
                --single-end                    STR     Fastq files with unpaired Illumina reads, comma separated
                --paired-end                    STR     Two fastq files with paired Illumina reads, comma separated
                --pacbio                        STR     Fasta or fastq file with PacBio reads, comma separated
                --nanopore                      STR     Fasta or fastq file with Nanopore reads, comma separated
        OR
                --illumina-bam-files            STR     Premapped Illumina Reads in bam format, comma separated
                --pacbio-bam-files              STR     Premapped Pacbio Reads in bam format, comma separated
                --nanopore-bam-files            STR     Premapped Nanopore Reads in bam format, comma separated
                                                        Skips read mapping
                                                        Genome size estimation only possible if assembly is given

Optional Options: [default]
                --file-separator                STR     Overrides file separator [,]
                --output                        STR     Output directory [./Results/]
                                                        Will be created if not existing
                --prefix                        STR     Prefix of output files
                --keep-temporary                        Keep temporary, intermediate files [false]
                --threads                       INT     Maximum number of threads per process [1]
                                                        This is not the total number of available threads!

                --skip-quality-control                  Skip qualimap bamqc run [false]
                --skip-coverage-histogram               Skip creation of coverage histograms [false]
                --skip-genome-size-estimation           Skip genome size estimation [false]
                                                        genome size estimation is only possible, if assembly is given


                --bwa-options                   STR     Options passed to bwa [-a -c 10000]
                --minimap-pacbio-options        STR     Options passed to minimap ["-H"]
                --minimap-nanopore-options      STR     Options passes to nanopore minimap runs [""]
                --qualimap-options              STR     Options passed to qualimap [""]
                                                        Pass options with quotes e.g. --bwa-options "<options>"

                --help                                  Print this help and exit
                --version                               Print version number and exit
                --debug                                 Print inner variables

Changes to the original perl pipeline

Parameters

  • all options now have a long version, short version is currently not available
  • added option `--illumina-bam-files`, `--pacbio-bam-files`, `--nanopore-bam-files`, which substitute `-b`
  • `--debug` prints internal variables for faster development and bug investigations
  • `--threads` sets the maximum number of threads per process (not maximum number of threads of the whole workflow)
  • removed `-b`:splitted into three different options
  • removed `-sort`: every bam is now sorted
  • removed `-v`: removed, because this is a nextflow feature
  • removed `-dry-run`
  • removed `-t`: handled by nextflow

Multiple files of one kind

Multiple files of one kind, e.g., three different PacBio runs, are now given as one string with each run comma separated instead of giving one parameter multiple times.

$ nextflow run backmap.nf --assembly assembly.fasta --pacbio pacbio1.fq,pacbio2.fq,pacbio3.fq

Genome size estimation with bam files

Genome size estimation with bam files can now be done if assembly is given

$ nextflow run backmap.nf --assembly assembly.fasta --pacbio-bam-files-- pacbio1.bam,pacbio2.bam,pacbio3.bam

Improved R scripts

R scripts have been improved and rewritten.

R All plot perl

R All plot from the original perl pipeline

R All plot nextflow

R All plot from the new nextflow pipeline

Resolved bugs of the perl pipeline

While running the original pipeline with different kinds of parameters, some bugs appeared

Issue Solved? PR
keep temporary option did not work properly yes https://github.com/schellt/backmap/pull/1
plot all did not work, if illumina reads were not present yes https://github.com/schellt/backmap/pull/5
pipeline crashed on rerun due to not enforcing symlink creation yes https://github.com/schellt/backmap/pull/4

Development/Test

Install dev/test environments

Development

The development environment includes everything, which is needed for running the original perl pipeline, the newly created nextflow pipeline, and the snakemake test pipeline

---
name: backmap-development
channels:
  - conda-forge
  - bioconda
  - defaults
dependencies:
  - snakemake
  - mamba
  - nextflow
  - perl
  - perl-app-cpanminus
  - samtools
  - bedtools
  - bwa
  - minimap2
  - qualimap
  - multiqc
  - r-base
  - r-ggplot2
  - r-dplyr

Perl packages needed for original workflow

  • Cwd
  • IPC::Cmd
  • Number::FormatEng
  • Parallel::Loops
# create and activate conda environment
conda env create -f environment-dev.yml
conda activate backmap-development
# install perl packages and check if they are really available
env PERL5LIB="" PERL_LOCAL_LIB_ROOT="" PERL_MM_OPT="" PERL_MB_OPT="" cpanm Cwd IPC::Cmd Number::FormatEng Parallel::Loops
for i in Cwd IPC::Cmd Number::FormatEng Parallel::Loops;
do
  perl -M"$i" -e 'print "Modul exists\\n";'
done

Testing with Snakemake pipeline

Install
name: backmap-test
channels:
  - conda-forge
  - bioconda
  - defaults
dependencies:
  - snakemake
  - mamba
Prepare Test environments
# create and activate conda environment
conda env create -f environment-test.yml
conda activate backmap-test
# create environments for all jobs
snakemake --profile sm_profile/ --conda-create-envs-only -F
# install perl packages
snakemake --profile sm_profile/ -f install_perl_packages
Run all tests
snakemake --profile sm_profile/ -F 

Linter

TODO


Original perl script

https://github.com/schellt/backmap

Citation

Schell T, Feldmeyer B, Schmidt H, Greshake B, Tills O et al. (2017). An Annotated Draft Genome for Radix auricularia (Gastropoda, Mollusca). Genome Biology and Evolution, 9(3):585–592, https://doi.org/10.1093/gbe/evx032

If you use this tool please cite the dependencies as well:


Internal notes

Directory

/vol/cb/projects/112020_tbg_backmap

Directory structure

Test files

The dataset for testing was brought to us via JLUBox and is now stored at /vol/cb/projects/112020_tbg_backmap/dataset1/Test_datasets.zip

-r--r--r--  1 rmueller cb  16G Dec  1 15:13 Test_datasets.zip

Content

Archive:  Test_datasets.zip
  Length      Date    Time    Name
---------  ---------- -----   ----
2209018648  2020-08-26 14:16   dgal_1.paired.confilter.fq.gz
2375279937  2020-08-26 14:14   dgal_2.paired.confilter.fq.gz
 38668802  2020-08-26 14:14   dgal_ra_pb-target-and-other_ill-confilter.blobfilter.rmmt_sspace-lr3_lrgc3_pg3_pilon_3.fasta.gz
1659448324  2020-08-26 14:10   pacbio_gal.target-and-other.blobfilter.fastq.gz
473770619  2020-08-26 14:11   dgal.confilter.fq.gz
2760385779  2020-08-26 14:09   dgal_ra_pb-target-and-other_ill-confilter.blobfilter.rmmt_sspace-lr3_lrgc3_pg3_pilon_3.fasta.pb.sort.bam
6667551038  2020-08-26 14:13   dgal_ra_pb-target-and-other_ill-confilter.blobfilter.rmmt_sspace-lr3_lrgc3_pg3_pilon_3.fasta.sort.bam
---------                     -------
16184123147                     7 files

There are no Nanopore reads. The snakemake pipeline simulates them

Purpose File
Genome assembly dgal_ra_pb-target-and-other_ill-confilter.blobfilter.rmmt_sspace-lr3_lrgc3_pg3_pilon_3.fasta.gz
Illumina forward reads dgal_1.paired.confilter.fq.gz
Illumina reverse reads dgal_2.paired.confilter.fq.gz
Illumina unpaired reads dgal.confilter.fq.gz
PacBio reads pacbio_gal.target-and-other.blobfilter.fastq.gz
Illumina mapping dgal_ra_pb-target-and-other_ill-confilter.blobfilter.rmmt_sspace-lr3_lrgc3_pg3_pilon_3.fasta.sort.bam
PacBio mapping dgal_ra_pb-target-and-other_ill-confilter.blobfilter.rmmt_sspace-lr3_lrgc3_pg3_pilon_3.fasta.pb.sort.bam

Footnotes

  1. New dependencies appear green