README.md 16.2 KB
Newer Older
Raphael Müller's avatar
Raphael Müller committed
1
# Backmap Nextflow pipeline
2

3
4
## Install

5
### Dependencies [^green]
6
7
8
9
10
11
12
13
14

  - {+ nextflow +}
  - samtools
  - bedtools
  - bwa
  - minimap2
  - qualimap
  - multiqc
  - r-base
15
16
  - {+ r-ggplot2 +}
  - {+ r-dplyr +}
17
18
19

### Installation with conda

20
21
22
23
24
```
conda env create -f environment.yml
conda activate backmap
```

25
26
27
## Usage

```
Raphael Müller's avatar
Raphael Müller committed
28
backmap v0.5
29
30

Description:
Raphael Müller's avatar
Raphael Müller committed
31
32
33
34
35
    Automatic mapping of paired, unpaired, PacBio and Nanopore reads to an
    assembly, execution of qualimap bamqc, multiqc and estimation of genome size
    from mapped nucleotides and peak coverage.
    With the provided conda environment configuration file (environment.yml),
    a conda environment with all necessary tools can be created.
36

Raphael Müller's avatar
Raphael Müller committed
37
38
    Be aware that this pipeline was transferred to a nextflow pipeline with major
    changes from version v0.3 to v.0.4.
39
40

Usage:
Raphael Müller's avatar
Raphael Müller committed
41
42
43
44
45
46
47
48
49
50
51
52
53
54
    With mapping:
        nextflow run backmap.nf  --assembly assembly.fa
                                [--single-end <unpaired.fq>[,unpaired2.fq,...]]
                                [--paired-end <paired1.1.fq>,<paired1.2.fq>[,<paired2.1.fq>,<paired2.2.fq>,...]]
                                [--pacbio <pacbio.fq>[,<pacbio2.fq>,...]]
                                [--pacbio-ccs <pacbio_ccs.fq>[,<pacbio_ccs2.fq>,...]]
                                [--nanopore <nanopore.fq>[,<nanopore2.fq>,...]]

    Without mapping:
        nextflow run backmap.nf [--assembly assembly.fa]
                                [--illumina-bam-files illumina.bam[,illumina2.bam,...]]
                                [--pacbio-bam-files pacbio.bam[,pacbio2.bam,...]]
                                [--pacbio-ccs-bam-files pacbio_ccs.bam[,pacbio_ccs2.bam,...]]
                                [--nanopore-bam-files nanopore.bam[,nanopore2.bam,...]]
55
56

Mandatory options:
Raphael Müller's avatar
Raphael Müller committed
57
58
59
60
61
62
63
64
65
66
67
68
69
70
        --assembly                    STR    Assembly were reads should mapped to in fasta format
    AND AT LEAST ONE OF
        --single-end                  STR    Fastq files with unpaired Illumina reads, comma separated
        --paired-end                  STR    Two fastq files with paired Illumina reads, comma separated
        --pacbio                      STR    Fasta or fastq file with PacBio reads (CLR), comma separated
        --pacbio-ccs                  STR    Fasta or fastq file with PacBio CCS reads, comma separated
        --nanopore                    STR    Fasta or fastq file with Nanopore reads, comma separated
    OR
        --illumina-bam-files          STR    Premapped Illumina Reads in bam format, comma separated
        --pacbio-bam-files            STR    Premapped Pacbio Reads in bam format, comma separated
        --pacbio-ccs-bam-files        STR    Premapped Pacbio CCS Reads in bam formac, comma separated 
        --nanopore-bam-files          STR    Premapped Nanopore Reads in bam format, comma separated
                                             Skips read mapping
                                             Genome size estimation only possible if assembly is given
71
72

Optional Options: [default]
Raphael Müller's avatar
Raphael Müller committed
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
        --file-separator              STR    Overrides file separator [,]
        --output                      STR    Output directory [./Results/]
                                             Will be created if not existing
        --prefix                      STR    Prefix of output files [tbg_backmap]
        --keep-temporary                     Keep temporary, intermediate files [false]
        --threads                     INT    Maximum number of threads per process [1]
                                             This is not the total number of available threads!

        --skip-quality-control               Skip qualimap bamqc run [false]
        --skip-coverage-histogram            Skip creation of coverage histograms [false]
        --skip-genome-size-estimation        Skip genome size estimation [false]
                                             genome size estimation is only possible, if assembly is given

        --bwa-options                 STR    Options passed to bwa [-a -c 10000]
        --minimap-pacbio-options      STR    Options passed to pacbio CLR minimap runs ["-H"]
        --minimap-pacbio-ccs-options  STR    Options passed to pacbio CCS minimap runs [""]
        --minimap-nanopore-options    STR    Options passes to nanopore minimap runs [""]
        --qualimap-options            STR    Options passed to qualimap [""]
                                             Pass options with quotes e.g. --bwa-options "<options>"

        --help                               Print this help and exit
        --version                            Print version number and exit
        --debug                              Print inner variables
96
97
98

```

Raphael Müller's avatar
Raphael Müller committed
99
100
101
## Graphical description

```mermaid
Raphael Müller's avatar
Raphael Müller committed
102
graph TD;
Raphael Müller's avatar
Raphael Müller committed
103
  Assembly[[Assembly]] --> BWAIndex{bwa index};
Raphael Müller's avatar
Raphael Müller committed
104
  Assembly --> MinimapP{minimap2};
Raphael Müller's avatar
Raphael Müller committed
105
  Assembly --> MinimapPCCS{minimap2};
Raphael Müller's avatar
Raphael Müller committed
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
  Assembly --> MinimapN{minimap2};
  BWAI --> BWAMemSE{bwa mem};
  BWAI --> BWAMemPE{bwa mem};
  
  subgraph Illumina
    BWAIndex{bwa index} --> BWAI[BWA Index Files];
    SER[[Single-end Reads]] --> BWAMemSE;
    PER[[Paired-end Reads]] --> BWAMemPE;
    BWAMemSE --> SEBam[[Single-end Mapped Reads]] --> SMI{samtools merge};
    BWAMemPE --> PEBam[[Paired-end Mapped Reads]] --> SMI{samtools merge};
    SMI --> IMer[Illumina Merged Bam Files] --> SSI{samtools sort};
    SSI --> ISor[Illumina Sorted Bam Files];
  end

  subgraph PacBio
Raphael Müller's avatar
Raphael Müller committed
121
122
    PB[[PacBio CLR Reads]] --> MinimapP;
    PBCCS[[PacBio CCS Reads]] --> MinimapPCCS;
Raphael Müller's avatar
Raphael Müller committed
123
    MinimapP --> MBam[[Pacbio Mapped Reads]] --> SMP{samtools merge};
Raphael Müller's avatar
Raphael Müller committed
124
    MinimapPCCS --> MPCCSBam[[Pacbio CCS Mapped Reads]] --> SMPCCS{samtools merge};
Raphael Müller's avatar
Raphael Müller committed
125
    SMP --> PMer[Pacbio Merged Bam Files] --> SSP{samtools sort};
Raphael Müller's avatar
Raphael Müller committed
126
    SMPCCS --> PCCSMer[Pacbio CCS Merged Bam Files] --> SSPCCS{samtools sort};
Raphael Müller's avatar
Raphael Müller committed
127
    SSP --> PSor[Pacbio Sorted Bam Files];
Raphael Müller's avatar
Raphael Müller committed
128
    SSPCCS --> PCCSSor[Pacbio CCS Sorted Bam Files];
Raphael Müller's avatar
Raphael Müller committed
129
130
131
132
133
134
135
136
137
  end

  subgraph Nanopore
    ONT[[Nanopore Reads]] --> MinimapN; 
    MinimapN --> NBam[[Nanopore Mapped Reads]] --> SMN{samtools merge};
    SMN --> NMer[Nanopore Merged Bam Files] --> SSN{samtools sort};
    SSN --> NSor[Nanopore Sorted Bam Files];
  end

Raphael Müller's avatar
Raphael Müller committed
138
139
  ISor --> QMI{qualimap} --> MultiQC{{multiqc}}
  PSor --> QMI
Raphael Müller's avatar
Raphael Müller committed
140
  PCCSSor --> QMI
Raphael Müller's avatar
Raphael Müller committed
141
142
  NSor --> QMI
  
Raphael Müller's avatar
Raphael Müller committed
143

Raphael Müller's avatar
Raphael Müller committed
144
145
  ISor --> bedI{bedtools genomecov} --> RCovSingleI{{"R Coverage Plot (single)"}}
  PSor --> bedI
Raphael Müller's avatar
Raphael Müller committed
146
  PCCSSor --> bedI
Raphael Müller's avatar
Raphael Müller committed
147
  NSor --> bedI
Raphael Müller's avatar
Raphael Müller committed
148

Raphael Müller's avatar
Raphael Müller committed
149
150
151
152
  bedI --> RCovMulti{{"R Coverage Plot (multi)"}}
  
  ISor --> SSTI{{samtools stat}}
  PSor --> SSTI
Raphael Müller's avatar
Raphael Müller committed
153
  PCCSSor --> SSTI
Raphael Müller's avatar
Raphael Müller committed
154
155
156
157
  NSor --> SSTI
  
  bedI --> GSEI{{Genome Size Estimation}}
  Assembly --> GSEI
Raphael Müller's avatar
Raphael Müller committed
158
159
160

```

Raphael Müller's avatar
Raphael Müller committed
161
## Changes to the original perl pipeline
162

Raphael Müller's avatar
Raphael Müller committed
163
### Parameters
164

Raphael Müller's avatar
Raphael Müller committed
165
 - {+ all options now have a long version, short version is currently not available +}
166
 - {+ added option \`--illumina-bam-files\`, \`--pacbio-bam-files\`, \`--nanopore-bam-files\`, which substitute \`-b\` +}
Raphael Müller's avatar
Raphael Müller committed
167
 - {+ added option \`--pacbio-ccs-bam-files\`, \`--pacbio-ccs\` +}
168
169
170
171
172
173
174
 - {+ \`--debug\` prints internal variables for faster development and bug investigations +}
 - {+ \`--threads\` sets the maximum number of threads per process (not maximum number of threads of the whole workflow) +}
 - {- removed \`-b\`:splitted into three different options -}
 - {- removed \`-sort\`: every bam is now sorted -}
 - {- removed \`-v\`: removed, because this is a nextflow feature -}
 - {- removed \`-dry-run\` -}
 - {- removed \`-t\`: handled by nextflow -}
Raphael Müller's avatar
Raphael Müller committed
175
176

### Multiple files of one kind
177

Raphael Müller's avatar
Raphael Müller committed
178
Multiple files of one kind, e.g., three different PacBio runs, are now given as one string with each run comma separated instead of giving one parameter multiple times.
179
```
Raphael Müller's avatar
Raphael Müller committed
180
$ nextflow run backmap.nf --assembly assembly.fasta --pacbio pacbio1.fq,pacbio2.fq,pacbio3.fq
181
182
```

Raphael Müller's avatar
Raphael Müller committed
183
184
185
### Genome size estimation with bam files

Genome size estimation with bam files can now be done if assembly is given
186
187

```
Raphael Müller's avatar
Raphael Müller committed
188
$ nextflow run backmap.nf --assembly assembly.fasta --pacbio-bam-files pacbio1.bam,pacbio2.bam,pacbio3.bam
189
190
```

Raphael Müller's avatar
Raphael Müller committed
191
### Improved R scripts
192

Raphael Müller's avatar
Raphael Müller committed
193
R scripts have been improved and rewritten.
194

195
196
#### R All plot perl

Raphael Müller's avatar
Raphael Müller committed
197
![R All plot from the original perl pipeline][oldRall]
198

Raphael Müller's avatar
typo    
Raphael Müller committed
199
#### R All plot nextflow
200

Raphael Müller's avatar
Raphael Müller committed
201
![R All plot from the new nextflow pipeline][newRall]
202

Raphael Müller's avatar
Raphael Müller committed
203
### Resolved bugs of the perl pipeline
204

Raphael Müller's avatar
Raphael Müller committed
205
206
207
208
209
210
While running the original pipeline with different kinds of parameters, some bugs appeared
|                              Issue                              | Solved? |                      PR                   |
| --------------------------------------------------------------- | ------- | ------------------------------------------|
| keep temporary option did not work properly                     |   yes   | https://github.com/schellt/backmap/pull/1 |
| plot all did not work, if illumina reads were not present       |   yes   | https://github.com/schellt/backmap/pull/5 |
| pipeline crashed on rerun due to not enforcing symlink creation |   yes   | https://github.com/schellt/backmap/pull/4 |
211

212
## Development/Test
213

Raphael Müller's avatar
Raphael Müller committed
214
### Install dev/test environments
215

Raphael Müller's avatar
Raphael Müller committed
216
#### Development
217

Raphael Müller's avatar
Raphael Müller committed
218
The development environment includes everything, which is needed for running the original perl pipeline, the newly created nextflow pipeline, and the snakemake test pipeline
219

220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
```yaml
---
name: backmap-development
channels:
  - conda-forge
  - bioconda
  - defaults
dependencies:
  - snakemake
  - mamba
  - nextflow
  - perl
  - perl-app-cpanminus
  - samtools
  - bedtools
  - bwa
  - minimap2
  - qualimap
  - multiqc
  - r-base
  - r-ggplot2
  - r-dplyr
242
  - nodejs
Raphael Müller's avatar
Raphael Müller committed
243
244
```

245
246
247
248
249
250
Perl packages needed for original workflow

 - Cwd
 - IPC::Cmd
 - Number::FormatEng
 - Parallel::Loops
Raphael Müller's avatar
Raphael Müller committed
251
252

```
253
# create and activate conda environment
Raphael Müller's avatar
Raphael Müller committed
254
conda env create -f environment-dev.yml
255
256
257
258
259
260
261
conda activate backmap-development
# install perl packages and check if they are really available
env PERL5LIB="" PERL_LOCAL_LIB_ROOT="" PERL_MM_OPT="" PERL_MB_OPT="" cpanm Cwd IPC::Cmd Number::FormatEng Parallel::Loops
for i in Cwd IPC::Cmd Number::FormatEng Parallel::Loops;
do
  perl -M"$i" -e 'print "Modul exists\\n";'
done
Raphael Müller's avatar
Raphael Müller committed
262
263
```

264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
#### Testing with Snakemake pipeline

##### Install

```yaml
name: backmap-test
channels:
  - conda-forge
  - bioconda
  - defaults
dependencies:
  - snakemake
  - mamba
```

##### Prepare Test environments
Raphael Müller's avatar
Raphael Müller committed
280
281

```
282
# create and activate conda environment
Raphael Müller's avatar
Raphael Müller committed
283
284
conda env create -f environment-test.yml
conda activate backmap-test
285
# create environments for all jobs
Raphael Müller's avatar
Raphael Müller committed
286
snakemake --profile sm_profile/ --conda-create-envs-only -F
287
# install perl packages
Raphael Müller's avatar
Raphael Müller committed
288
snakemake --profile sm_profile/ -f install_perl_packages
289
290
291
292
293
```

##### Run all tests

```
Raphael Müller's avatar
Raphael Müller committed
294
295
snakemake --profile sm_profile/ -F 
```
296

297
298
#### Linter

299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
For this pipeline, we use [npm-groovy-lint](https://www.npmjs.com/package/npm-groovy-lint) as the linter for the nextflow pipeline. Also, check out its [ruleset](https://codenarc.org/)

##### Install

```
conda activate backmap-development
npm install -g npm-groovy-lint
```

##### Run

```
npm-groovy-lint --output "txt" --no-insight --verbose -p nextflow/
```

##### Notes

Consider changing the file `.groovylintrc.json`, if the linter shows unexpected behavior.
317

318
319
320
----------------------------------------

# Original perl script
Raphael Müller's avatar
Raphael Müller committed
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343

https://github.com/schellt/backmap

### Citation
Schell T, Feldmeyer B, Schmidt H, Greshake B, Tills O et al. (2017). An Annotated Draft Genome for _Radix auricularia_ (Gastropoda, Mollusca). _Genome Biology and Evolution_, 9(3):585–592, <https://doi.org/10.1093/gbe/evx032>

__If you use this tool please cite the dependencies as well:__

- samtools:  
Li H, Handsaker B, Wysoker A, Fennell T, Ruan J et al. (2009). The Sequence Alignment/Map format and SAMtools. _Bioinformatics_, 25(16):2078–2079, <https://doi.org/10.1093/bioinformatics/btp352>
- bwa mem:  
Li H (2013). Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. _arXiv preprint arXiv:1303.3997_.
- minimap2:  
Li H (2018). Minimap2: pairwise alignment for nucleotide sequences. _Bioinformatics_, 34:3094–3100, <https://doi.org/10.1093/bioinformatics/bty191>
- Qualimap:  
Okonechnikov K, Conesa A, García-Alcalde F (2016). Qualimap 2: advanced multi-sample quality control for high-throughput sequencing data. _Bioinformatics_, 32(2):292–294, <https://doi.org/10.1093/bioinformatics/btv566>
- MultiQC:  
Ewels P, Magnusson M, Lundin S, Käller M (2016). MultiQC: summarize analysis results for multiple tools and samples in a single report. _Bioinformatics_, 32(19):3047–3048, <https://doi.org/10.1093/bioinformatics/btw354>
- bedtools:  
Quinlan AR, Hall IM (2010). BEDTools: a flexible suite of utilities for comparing genomic features. _Bioinformatics_, 26(6):841–842, <https://doi.org/10.1093/bioinformatics/btq033>
- Rscript:  
R Core Team (2019). R: A Language and Environment for Statistical Computing. <http://www.R-project.org/>

344
345
--------------------------------------------------------------------------------------

Raphael Müller's avatar
Raphael Müller committed
346
# Internal notes
Raphael Müller's avatar
Raphael Müller committed
347

348
349
350
351
## Directory

`/vol/cb/projects/112020_tbg_backmap`

Raphael Müller's avatar
Raphael Müller committed
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
## Directory structure (as of date 01/06/2021)

```
.
├── dataset
│   ├── dgal_1.paired.confilter.fq
│   ├── dgal_1.paired.confilter.fq.gz
│   ├── dgal_2.paired.confilter.fq
│   ├── dgal_2.paired.confilter.fq.gz
│   ├── dgal.confilter.fq
│   ├── dgal.confilter.fq.gz
│   ├── dgal_ra_pb-target-and-other_ill-confilter.blobfilter.rmmt_sspace-lr3_lrgc3_pg3_pilon_3.fasta
│   ├── dgal_ra_pb-target-and-other_ill-confilter.blobfilter.rmmt_sspace-lr3_lrgc3_pg3_pilon_3.fasta.gz
│   ├── dgal_ra_pb-target-and-other_ill-confilter.blobfilter.rmmt_sspace-lr3_lrgc3_pg3_pilon_3.fasta.pb.sort.bam
│   ├── dgal_ra_pb-target-and-other_ill-confilter.blobfilter.rmmt_sspace-lr3_lrgc3_pg3_pilon_3.fasta.sort.bam
│   ├── nohup.out
│   ├── pacbio_gal.target-and-other.blobfilter.fastq
│   ├── pacbio_gal.target-and-other.blobfilter.fastq.gz
│   └── Test_datasets.zip
├── environment-dev.yml
├── environment-test.yml
├── environment.yml
├── img
│   ├── RallNew.png
│   └── RallOld.png
├── nextflow
│   ├── backmap.nf
│   └── nextflow.config
├── original_wf
│   └── backmap
├── README.md
└── test
    ├── benchmarks
    ├── config.yml
    ├── data_prep
    ├── data_prep_bams
    ├── envs
    ├── logs
    ├── no_backup
    ├── params_files
    ├── results
    ├── sm_profile
    ├── Snakefile
    └── snakemake_workflows

19 directories, 25 files
```
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425

## Test files

The dataset for testing was brought to us via JLUBox and is now stored at
`/vol/cb/projects/112020_tbg_backmap/dataset1/Test_datasets.zip`

```
-r--r--r--  1 rmueller cb  16G Dec  1 15:13 Test_datasets.zip
```

### Content
```
Archive:  Test_datasets.zip
  Length      Date    Time    Name
---------  ---------- -----   ----
2209018648  2020-08-26 14:16   dgal_1.paired.confilter.fq.gz
2375279937  2020-08-26 14:14   dgal_2.paired.confilter.fq.gz
 38668802  2020-08-26 14:14   dgal_ra_pb-target-and-other_ill-confilter.blobfilter.rmmt_sspace-lr3_lrgc3_pg3_pilon_3.fasta.gz
1659448324  2020-08-26 14:10   pacbio_gal.target-and-other.blobfilter.fastq.gz
473770619  2020-08-26 14:11   dgal.confilter.fq.gz
2760385779  2020-08-26 14:09   dgal_ra_pb-target-and-other_ill-confilter.blobfilter.rmmt_sspace-lr3_lrgc3_pg3_pilon_3.fasta.pb.sort.bam
6667551038  2020-08-26 14:13   dgal_ra_pb-target-and-other_ill-confilter.blobfilter.rmmt_sspace-lr3_lrgc3_pg3_pilon_3.fasta.sort.bam
---------                     -------
16184123147                     7 files
```

There are no Nanopore reads. The snakemake pipeline simulates them
Raphael Müller's avatar
Raphael Müller committed
426

427
| Purpose | File |
Raphael Müller's avatar
Raphael Müller committed
428
| ---- | ---- |
429
| Genome assembly | dgal_ra_pb-target-and-other_ill-confilter.blobfilter.rmmt_sspace-lr3_lrgc3_pg3_pilon_3.fasta.gz |
Raphael Müller's avatar
Raphael Müller committed
430
431
432
433
434
435
| Illumina forward reads | dgal_1.paired.confilter.fq.gz | 
| Illumina reverse reads | dgal_2.paired.confilter.fq.gz | 
| Illumina unpaired reads | dgal.confilter.fq.gz | 
| PacBio reads | pacbio_gal.target-and-other.blobfilter.fastq.gz | 
| Illumina mapping | dgal_ra_pb-target-and-other_ill-confilter.blobfilter.rmmt_sspace-lr3_lrgc3_pg3_pilon_3.fasta.sort.bam | 
| PacBio mapping | dgal_ra_pb-target-and-other_ill-confilter.blobfilter.rmmt_sspace-lr3_lrgc3_pg3_pilon_3.fasta.pb.sort.bam | 
Raphael Müller's avatar
Raphael Müller committed
436

Raphael Müller's avatar
Raphael Müller committed
437
438
[oldRall]: img/RallOld.png "Resulting graph of the original perl pipeline"
[newRall]: img/RallNew.png "Resulting graph of the new nextflor pipeline"
439
[^green]: New dependencies appear green
Raphael Müller's avatar
Raphael Müller committed
440
441
442
443

----------------------------------------

# Footnotes