README.md 12.2 KB
Newer Older
Raphael Müller's avatar
Raphael Müller committed
1
# Backmap Nextflow pipeline
2

3
4
## Install

5
### Dependencies [^green]
6
7
8
9
10
11
12
13
14

  - {+ nextflow +}
  - samtools
  - bedtools
  - bwa
  - minimap2
  - qualimap
  - multiqc
  - r-base
15
16
  - {+ r-ggplot2 +}
  - {+ r-dplyr +}
17
18
19

### Installation with conda

20
21
22
23
24
```
conda env create -f environment.yml
conda activate backmap
```

25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
## Usage

```
backmap v0.4

Description:
        Automatic mapping of paired, unpaired, PacBio and Nanopore reads to an
        assembly, execution of qualimap bamqc, multiqc and estimation of genome size
        from mapped nucleotides and peak coverage.
        With the provided conda environment configuration file (environment.yml),
        a conda environment with all necessary tools can be created.

        Be aware that this pipeline was transferred to a nextflow pipeline with major
        changes to version v0.3.

Usage:
        With mapping:
                nextflow run backmap.nf  --assembly assembly.fa
                                        [--single-end <unpaired.fq>[,unpaired2.fq,...]]
                                        [--paired-end <paired1.1.fq>,<paired1.2.fq>[,<paired2.1.fq>,<paired2.2.fq>,...]]
                                        [--pacbio <pacbio.fq>[,<pacbio2.fq>,...]]
                                        [--nanopore <nanopore.fq>[,<nanopore2.fq>,...]]

        Without mapping:
                nextflow run backmap.nf [--assembly assembly.fa]
                                        [--illumina-bam-files illumina.bam[,illumina2.bam,...]]
                                        [--pacbio-bam-files pacbio.bam[,pacbio2.bam,...]]
                                        [--nanopore-bam-files nanopore.bam[,nanopore2.bam,...]]


Mandatory options:
                --assembly                      STR     Assembly were reads should mapped to in fasta format
        AND AT LEAST ONE OF
                --single-end                    STR     Fastq files with unpaired Illumina reads, comma separated
                --paired-end                    STR     Two fastq files with paired Illumina reads, comma separated
                --pacbio                        STR     Fasta or fastq file with PacBio reads, comma separated
                --nanopore                      STR     Fasta or fastq file with Nanopore reads, comma separated
        OR
                --illumina-bam-files            STR     Premapped Illumina Reads in bam format, comma separated
                --pacbio-bam-files              STR     Premapped Pacbio Reads in bam format, comma separated
                --nanopore-bam-files            STR     Premapped Nanopore Reads in bam format, comma separated
                                                        Skips read mapping
                                                        Genome size estimation only possible if assembly is given

Optional Options: [default]
                --file-separator                STR     Overrides file separator [,]
                --output                        STR     Output directory [./Results/]
                                                        Will be created if not existing
                --prefix                        STR     Prefix of output files
Raphael Müller's avatar
Raphael Müller committed
74
                --keep-temporary                        Keep temporary, intermediate files [false]
Raphael Müller's avatar
Raphael Müller committed
75
                --threads                       INT     Maximum number of threads per process [1]
Raphael Müller's avatar
Raphael Müller committed
76
77
                                                        This is not the total number of available threads!

78
79
80
81
                --skip-quality-control                  Skip qualimap bamqc run [false]
                --skip-coverage-histogram               Skip creation of coverage histograms [false]
                --skip-genome-size-estimation           Skip genome size estimation [false]
                                                        genome size estimation is only possible, if assembly is given
Raphael Müller's avatar
Raphael Müller committed
82
83


84
85
86
87
88
89
90
91
92
93
94
95
                --bwa-options                   STR     Options passed to bwa [-a -c 10000]
                --minimap-pacbio-options        STR     Options passed to minimap ["-H"]
                --minimap-nanopore-options      STR     Options passes to nanopore minimap runs [""]
                --qualimap-options              STR     Options passed to qualimap [""]
                                                        Pass options with quotes e.g. --bwa-options "<options>"

                --help                                  Print this help and exit
                --version                               Print version number and exit
                --debug                                 Print inner variables

```

Raphael Müller's avatar
Raphael Müller committed
96
## Changes to the original perl pipeline
97

Raphael Müller's avatar
Raphael Müller committed
98
### Parameters
99

Raphael Müller's avatar
Raphael Müller committed
100
 - {+ all options now have a long version, short version is currently not available +}
101
102
103
104
105
106
107
108
 - {+ added option \`--illumina-bam-files\`, \`--pacbio-bam-files\`, \`--nanopore-bam-files\`, which substitute \`-b\` +}
 - {+ \`--debug\` prints internal variables for faster development and bug investigations +}
 - {+ \`--threads\` sets the maximum number of threads per process (not maximum number of threads of the whole workflow) +}
 - {- removed \`-b\`:splitted into three different options -}
 - {- removed \`-sort\`: every bam is now sorted -}
 - {- removed \`-v\`: removed, because this is a nextflow feature -}
 - {- removed \`-dry-run\` -}
 - {- removed \`-t\`: handled by nextflow -}
Raphael Müller's avatar
Raphael Müller committed
109
110

### Multiple files of one kind
111

Raphael Müller's avatar
Raphael Müller committed
112
Multiple files of one kind, e.g., three different PacBio runs, are now given as one string with each run comma separated instead of giving one parameter multiple times.
113
```
Raphael Müller's avatar
Raphael Müller committed
114
$ nextflow run backmap.nf --assembly assembly.fasta --pacbio pacbio1.fq,pacbio2.fq,pacbio3.fq
115
116
```

Raphael Müller's avatar
Raphael Müller committed
117
118
119
### Genome size estimation with bam files

Genome size estimation with bam files can now be done if assembly is given
120
121

```
Raphael Müller's avatar
Raphael Müller committed
122
$ nextflow run backmap.nf --assembly assembly.fasta --pacbio-bam-files-- pacbio1.bam,pacbio2.bam,pacbio3.bam
123
124
```

Raphael Müller's avatar
Raphael Müller committed
125
### Improved R scripts
126

Raphael Müller's avatar
Raphael Müller committed
127
R scripts have been improved and rewritten.
128

129
130
#### R All plot perl

Raphael Müller's avatar
Raphael Müller committed
131
![R All plot from the original perl pipeline][oldRall]
132
133
134

#### R All plot perl

Raphael Müller's avatar
Raphael Müller committed
135
![R All plot from the new nextflow pipeline][newRall]
136

Raphael Müller's avatar
Raphael Müller committed
137
### Resolved bugs of the perl pipeline
138

Raphael Müller's avatar
Raphael Müller committed
139
140
141
142
143
144
While running the original pipeline with different kinds of parameters, some bugs appeared
|                              Issue                              | Solved? |                      PR                   |
| --------------------------------------------------------------- | ------- | ------------------------------------------|
| keep temporary option did not work properly                     |   yes   | https://github.com/schellt/backmap/pull/1 |
| plot all did not work, if illumina reads were not present       |   yes   | https://github.com/schellt/backmap/pull/5 |
| pipeline crashed on rerun due to not enforcing symlink creation |   yes   | https://github.com/schellt/backmap/pull/4 |
145

146
## Development/Test
147

Raphael Müller's avatar
Raphael Müller committed
148
### Install dev/test environments
149

Raphael Müller's avatar
Raphael Müller committed
150
#### Development
151

Raphael Müller's avatar
Raphael Müller committed
152
The development environment includes everything, which is needed for running the original perl pipeline, the newly created nextflow pipeline, and the snakemake test pipeline
153

154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
```yaml
---
name: backmap-development
channels:
  - conda-forge
  - bioconda
  - defaults
dependencies:
  - snakemake
  - mamba
  - nextflow
  - perl
  - perl-app-cpanminus
  - samtools
  - bedtools
  - bwa
  - minimap2
  - qualimap
  - multiqc
  - r-base
  - r-ggplot2
  - r-dplyr
Raphael Müller's avatar
Raphael Müller committed
176
177
```

178
179
180
181
182
183
Perl packages needed for original workflow

 - Cwd
 - IPC::Cmd
 - Number::FormatEng
 - Parallel::Loops
Raphael Müller's avatar
Raphael Müller committed
184
185

```
186
# create and activate conda environment
Raphael Müller's avatar
Raphael Müller committed
187
conda env create -f environment-dev.yml
188
189
190
191
192
193
194
conda activate backmap-development
# install perl packages and check if they are really available
env PERL5LIB="" PERL_LOCAL_LIB_ROOT="" PERL_MM_OPT="" PERL_MB_OPT="" cpanm Cwd IPC::Cmd Number::FormatEng Parallel::Loops
for i in Cwd IPC::Cmd Number::FormatEng Parallel::Loops;
do
  perl -M"$i" -e 'print "Modul exists\\n";'
done
Raphael Müller's avatar
Raphael Müller committed
195
196
```

197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
#### Testing with Snakemake pipeline

##### Install

```yaml
name: backmap-test
channels:
  - conda-forge
  - bioconda
  - defaults
dependencies:
  - snakemake
  - mamba
```

##### Prepare Test environments
Raphael Müller's avatar
Raphael Müller committed
213
214

```
215
# create and activate conda environment
Raphael Müller's avatar
Raphael Müller committed
216
217
conda env create -f environment-test.yml
conda activate backmap-test
218
# create environments for all jobs
Raphael Müller's avatar
Raphael Müller committed
219
snakemake --profile sm_profile/ --conda-create-envs-only -F
220
# install perl packages
Raphael Müller's avatar
Raphael Müller committed
221
snakemake --profile sm_profile/ -f install_perl_packages
222
223
224
225
226
```

##### Run all tests

```
Raphael Müller's avatar
Raphael Müller committed
227
228
snakemake --profile sm_profile/ -F 
```
229

230
231
232
#### Linter

TODO
233

234
235
236
----------------------------------------

# Original perl script
Raphael Müller's avatar
Raphael Müller committed
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259

https://github.com/schellt/backmap

### Citation
Schell T, Feldmeyer B, Schmidt H, Greshake B, Tills O et al. (2017). An Annotated Draft Genome for _Radix auricularia_ (Gastropoda, Mollusca). _Genome Biology and Evolution_, 9(3):585–592, <https://doi.org/10.1093/gbe/evx032>

__If you use this tool please cite the dependencies as well:__

- samtools:  
Li H, Handsaker B, Wysoker A, Fennell T, Ruan J et al. (2009). The Sequence Alignment/Map format and SAMtools. _Bioinformatics_, 25(16):2078–2079, <https://doi.org/10.1093/bioinformatics/btp352>
- bwa mem:  
Li H (2013). Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. _arXiv preprint arXiv:1303.3997_.
- minimap2:  
Li H (2018). Minimap2: pairwise alignment for nucleotide sequences. _Bioinformatics_, 34:3094–3100, <https://doi.org/10.1093/bioinformatics/bty191>
- Qualimap:  
Okonechnikov K, Conesa A, García-Alcalde F (2016). Qualimap 2: advanced multi-sample quality control for high-throughput sequencing data. _Bioinformatics_, 32(2):292–294, <https://doi.org/10.1093/bioinformatics/btv566>
- MultiQC:  
Ewels P, Magnusson M, Lundin S, Käller M (2016). MultiQC: summarize analysis results for multiple tools and samples in a single report. _Bioinformatics_, 32(19):3047–3048, <https://doi.org/10.1093/bioinformatics/btw354>
- bedtools:  
Quinlan AR, Hall IM (2010). BEDTools: a flexible suite of utilities for comparing genomic features. _Bioinformatics_, 26(6):841–842, <https://doi.org/10.1093/bioinformatics/btq033>
- Rscript:  
R Core Team (2019). R: A Language and Environment for Statistical Computing. <http://www.R-project.org/>

260
261
--------------------------------------------------------------------------------------

Raphael Müller's avatar
Raphael Müller committed
262
# Internal notes
Raphael Müller's avatar
Raphael Müller committed
263

264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
## Directory

`/vol/cb/projects/112020_tbg_backmap`

## Directory structure

## Test files

The dataset for testing was brought to us via JLUBox and is now stored at
`/vol/cb/projects/112020_tbg_backmap/dataset1/Test_datasets.zip`

```
-r--r--r--  1 rmueller cb  16G Dec  1 15:13 Test_datasets.zip
```

### Content
```
Archive:  Test_datasets.zip
  Length      Date    Time    Name
---------  ---------- -----   ----
2209018648  2020-08-26 14:16   dgal_1.paired.confilter.fq.gz
2375279937  2020-08-26 14:14   dgal_2.paired.confilter.fq.gz
 38668802  2020-08-26 14:14   dgal_ra_pb-target-and-other_ill-confilter.blobfilter.rmmt_sspace-lr3_lrgc3_pg3_pilon_3.fasta.gz
1659448324  2020-08-26 14:10   pacbio_gal.target-and-other.blobfilter.fastq.gz
473770619  2020-08-26 14:11   dgal.confilter.fq.gz
2760385779  2020-08-26 14:09   dgal_ra_pb-target-and-other_ill-confilter.blobfilter.rmmt_sspace-lr3_lrgc3_pg3_pilon_3.fasta.pb.sort.bam
6667551038  2020-08-26 14:13   dgal_ra_pb-target-and-other_ill-confilter.blobfilter.rmmt_sspace-lr3_lrgc3_pg3_pilon_3.fasta.sort.bam
---------                     -------
16184123147                     7 files
```

There are no Nanopore reads. The snakemake pipeline simulates them
Raphael Müller's avatar
Raphael Müller committed
296

297
| Purpose | File |
Raphael Müller's avatar
Raphael Müller committed
298
| ---- | ---- |
299
| Genome assembly | dgal_ra_pb-target-and-other_ill-confilter.blobfilter.rmmt_sspace-lr3_lrgc3_pg3_pilon_3.fasta.gz |
Raphael Müller's avatar
Raphael Müller committed
300
301
302
303
304
305
| Illumina forward reads | dgal_1.paired.confilter.fq.gz | 
| Illumina reverse reads | dgal_2.paired.confilter.fq.gz | 
| Illumina unpaired reads | dgal.confilter.fq.gz | 
| PacBio reads | pacbio_gal.target-and-other.blobfilter.fastq.gz | 
| Illumina mapping | dgal_ra_pb-target-and-other_ill-confilter.blobfilter.rmmt_sspace-lr3_lrgc3_pg3_pilon_3.fasta.sort.bam | 
| PacBio mapping | dgal_ra_pb-target-and-other_ill-confilter.blobfilter.rmmt_sspace-lr3_lrgc3_pg3_pilon_3.fasta.pb.sort.bam | 
Raphael Müller's avatar
Raphael Müller committed
306

Raphael Müller's avatar
Raphael Müller committed
307
308
[oldRall]: img/RallOld.png "Resulting graph of the original perl pipeline"
[newRall]: img/RallNew.png "Resulting graph of the new nextflor pipeline"
309
[^green]: New dependencies appear green
Raphael Müller's avatar
Raphael Müller committed
310
311
312
313

----------------------------------------

# Footnotes