Newer
Older
### Dependencies [^green]
- {+ nextflow +}
- samtools
- bedtools
- bwa
- minimap2
- qualimap
- multiqc
- r-base
- {+ r-ggplot2 +}
- {+ r-dplyr +}
### Installation with conda
```
conda env create -f environment.yml
conda activate backmap
```
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
## Usage
```
backmap v0.4
Description:
Automatic mapping of paired, unpaired, PacBio and Nanopore reads to an
assembly, execution of qualimap bamqc, multiqc and estimation of genome size
from mapped nucleotides and peak coverage.
With the provided conda environment configuration file (environment.yml),
a conda environment with all necessary tools can be created.
Be aware that this pipeline was transferred to a nextflow pipeline with major
changes to version v0.3.
Usage:
With mapping:
nextflow run backmap.nf --assembly assembly.fa
[--single-end <unpaired.fq>[,unpaired2.fq,...]]
[--paired-end <paired1.1.fq>,<paired1.2.fq>[,<paired2.1.fq>,<paired2.2.fq>,...]]
[--pacbio <pacbio.fq>[,<pacbio2.fq>,...]]
[--nanopore <nanopore.fq>[,<nanopore2.fq>,...]]
Without mapping:
nextflow run backmap.nf [--assembly assembly.fa]
[--illumina-bam-files illumina.bam[,illumina2.bam,...]]
[--pacbio-bam-files pacbio.bam[,pacbio2.bam,...]]
[--nanopore-bam-files nanopore.bam[,nanopore2.bam,...]]
Mandatory options:
--assembly STR Assembly were reads should mapped to in fasta format
AND AT LEAST ONE OF
--single-end STR Fastq files with unpaired Illumina reads, comma separated
--paired-end STR Two fastq files with paired Illumina reads, comma separated
--pacbio STR Fasta or fastq file with PacBio reads, comma separated
--nanopore STR Fasta or fastq file with Nanopore reads, comma separated
OR
--illumina-bam-files STR Premapped Illumina Reads in bam format, comma separated
--pacbio-bam-files STR Premapped Pacbio Reads in bam format, comma separated
--nanopore-bam-files STR Premapped Nanopore Reads in bam format, comma separated
Skips read mapping
Genome size estimation only possible if assembly is given
Optional Options: [default]
--file-separator STR Overrides file separator [,]
--output STR Output directory [./Results/]
Will be created if not existing
--prefix STR Prefix of output files
--keep-temporary Keep temporary, intermediate files [false]
--threads INT Maximum number of threads per process [1]
This is not the total number of available threads!
--skip-quality-control Skip qualimap bamqc run [false]
--skip-coverage-histogram Skip creation of coverage histograms [false]
--skip-genome-size-estimation Skip genome size estimation [false]
genome size estimation is only possible, if assembly is given
--bwa-options STR Options passed to bwa [-a -c 10000]
--minimap-pacbio-options STR Options passed to minimap ["-H"]
--minimap-nanopore-options STR Options passes to nanopore minimap runs [""]
--qualimap-options STR Options passed to qualimap [""]
Pass options with quotes e.g. --bwa-options "<options>"
--help Print this help and exit
--version Print version number and exit
--debug Print inner variables
```
- {+ all options now have a long version, short version is currently not available +}
- {+ added option \`--illumina-bam-files\`, \`--pacbio-bam-files\`, \`--nanopore-bam-files\`, which substitute \`-b\` +}
- {+ \`--debug\` prints internal variables for faster development and bug investigations +}
- {+ \`--threads\` sets the maximum number of threads per process (not maximum number of threads of the whole workflow) +}
- {- removed \`-b\`:splitted into three different options -}
- {- removed \`-sort\`: every bam is now sorted -}
- {- removed \`-v\`: removed, because this is a nextflow feature -}
- {- removed \`-dry-run\` -}
- {- removed \`-t\`: handled by nextflow -}
Multiple files of one kind, e.g., three different PacBio runs, are now given as one string with each run comma separated instead of giving one parameter multiple times.
$ nextflow run backmap.nf --assembly assembly.fasta --pacbio pacbio1.fq,pacbio2.fq,pacbio3.fq
### Genome size estimation with bam files
Genome size estimation with bam files can now be done if assembly is given
$ nextflow run backmap.nf --assembly assembly.fasta --pacbio-bam-files-- pacbio1.bam,pacbio2.bam,pacbio3.bam
R scripts have been improved and rewritten.
![R All plot from the original perl pipeline][oldRall]
![R All plot from the new nextflow pipeline][newRall]
While running the original pipeline with different kinds of parameters, some bugs appeared
| Issue | Solved? | PR |
| --------------------------------------------------------------- | ------- | ------------------------------------------|
| keep temporary option did not work properly | yes | https://github.com/schellt/backmap/pull/1 |
| plot all did not work, if illumina reads were not present | yes | https://github.com/schellt/backmap/pull/5 |
| pipeline crashed on rerun due to not enforcing symlink creation | yes | https://github.com/schellt/backmap/pull/4 |
The development environment includes everything, which is needed for running the original perl pipeline, the newly created nextflow pipeline, and the snakemake test pipeline
```yaml
---
name: backmap-development
channels:
- conda-forge
- bioconda
- defaults
dependencies:
- snakemake
- mamba
- nextflow
- perl
- perl-app-cpanminus
- samtools
- bedtools
- bwa
- minimap2
- qualimap
- multiqc
- r-base
- r-ggplot2
- r-dplyr
Perl packages needed for original workflow
- Cwd
- IPC::Cmd
- Number::FormatEng
- Parallel::Loops
# create and activate conda environment
conda activate backmap-development
# install perl packages and check if they are really available
env PERL5LIB="" PERL_LOCAL_LIB_ROOT="" PERL_MM_OPT="" PERL_MB_OPT="" cpanm Cwd IPC::Cmd Number::FormatEng Parallel::Loops
for i in Cwd IPC::Cmd Number::FormatEng Parallel::Loops;
do
perl -M"$i" -e 'print "Modul exists\\n";'
done
#### Testing with Snakemake pipeline
##### Install
```yaml
name: backmap-test
channels:
- conda-forge
- bioconda
- defaults
dependencies:
- snakemake
- mamba
```
##### Prepare Test environments
# create and activate conda environment
conda env create -f environment-test.yml
conda activate backmap-test
# create environments for all jobs
snakemake --profile sm_profile/ --conda-create-envs-only -F
snakemake --profile sm_profile/ -f install_perl_packages
```
##### Run all tests
```
----------------------------------------
# Original perl script
https://github.com/schellt/backmap
### Citation
Schell T, Feldmeyer B, Schmidt H, Greshake B, Tills O et al. (2017). An Annotated Draft Genome for _Radix auricularia_ (Gastropoda, Mollusca). _Genome Biology and Evolution_, 9(3):585–592, <https://doi.org/10.1093/gbe/evx032>
__If you use this tool please cite the dependencies as well:__
- samtools:
Li H, Handsaker B, Wysoker A, Fennell T, Ruan J et al. (2009). The Sequence Alignment/Map format and SAMtools. _Bioinformatics_, 25(16):2078–2079, <https://doi.org/10.1093/bioinformatics/btp352>
- bwa mem:
Li H (2013). Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. _arXiv preprint arXiv:1303.3997_.
- minimap2:
Li H (2018). Minimap2: pairwise alignment for nucleotide sequences. _Bioinformatics_, 34:3094–3100, <https://doi.org/10.1093/bioinformatics/bty191>
- Qualimap:
Okonechnikov K, Conesa A, García-Alcalde F (2016). Qualimap 2: advanced multi-sample quality control for high-throughput sequencing data. _Bioinformatics_, 32(2):292–294, <https://doi.org/10.1093/bioinformatics/btv566>
- MultiQC:
Ewels P, Magnusson M, Lundin S, Käller M (2016). MultiQC: summarize analysis results for multiple tools and samples in a single report. _Bioinformatics_, 32(19):3047–3048, <https://doi.org/10.1093/bioinformatics/btw354>
- bedtools:
Quinlan AR, Hall IM (2010). BEDTools: a flexible suite of utilities for comparing genomic features. _Bioinformatics_, 26(6):841–842, <https://doi.org/10.1093/bioinformatics/btq033>
- Rscript:
R Core Team (2019). R: A Language and Environment for Statistical Computing. <http://www.R-project.org/>
--------------------------------------------------------------------------------------
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
## Directory
`/vol/cb/projects/112020_tbg_backmap`
## Directory structure
## Test files
The dataset for testing was brought to us via JLUBox and is now stored at
`/vol/cb/projects/112020_tbg_backmap/dataset1/Test_datasets.zip`
```
-r--r--r-- 1 rmueller cb 16G Dec 1 15:13 Test_datasets.zip
```
### Content
```
Archive: Test_datasets.zip
Length Date Time Name
--------- ---------- ----- ----
2209018648 2020-08-26 14:16 dgal_1.paired.confilter.fq.gz
2375279937 2020-08-26 14:14 dgal_2.paired.confilter.fq.gz
38668802 2020-08-26 14:14 dgal_ra_pb-target-and-other_ill-confilter.blobfilter.rmmt_sspace-lr3_lrgc3_pg3_pilon_3.fasta.gz
1659448324 2020-08-26 14:10 pacbio_gal.target-and-other.blobfilter.fastq.gz
473770619 2020-08-26 14:11 dgal.confilter.fq.gz
2760385779 2020-08-26 14:09 dgal_ra_pb-target-and-other_ill-confilter.blobfilter.rmmt_sspace-lr3_lrgc3_pg3_pilon_3.fasta.pb.sort.bam
6667551038 2020-08-26 14:13 dgal_ra_pb-target-and-other_ill-confilter.blobfilter.rmmt_sspace-lr3_lrgc3_pg3_pilon_3.fasta.sort.bam
--------- -------
16184123147 7 files
```
There are no Nanopore reads. The snakemake pipeline simulates them
| Genome assembly | dgal_ra_pb-target-and-other_ill-confilter.blobfilter.rmmt_sspace-lr3_lrgc3_pg3_pilon_3.fasta.gz |
| Illumina forward reads | dgal_1.paired.confilter.fq.gz |
| Illumina reverse reads | dgal_2.paired.confilter.fq.gz |
| Illumina unpaired reads | dgal.confilter.fq.gz |
| PacBio reads | pacbio_gal.target-and-other.blobfilter.fastq.gz |
| Illumina mapping | dgal_ra_pb-target-and-other_ill-confilter.blobfilter.rmmt_sspace-lr3_lrgc3_pg3_pilon_3.fasta.sort.bam |
| PacBio mapping | dgal_ra_pb-target-and-other_ill-confilter.blobfilter.rmmt_sspace-lr3_lrgc3_pg3_pilon_3.fasta.pb.sort.bam |
[oldRall]: img/RallOld.png "Resulting graph of the original perl pipeline"
[newRall]: img/RallNew.png "Resulting graph of the new nextflor pipeline"
[^green]: New dependencies appear green
----------------------------------------
# Footnotes