README.md 8.03 KB
Newer Older
1
2
# (new) Nextflow pipeline

3
4
5
6
7
8
9
10
## Install

```
conda env create -f environment.yml

conda activate backmap
```

11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
## Usage

```
backmap v0.4

Description:
        Automatic mapping of paired, unpaired, PacBio and Nanopore reads to an
        assembly, execution of qualimap bamqc, multiqc and estimation of genome size
        from mapped nucleotides and peak coverage.
        With the provided conda environment configuration file (environment.yml),
        a conda environment with all necessary tools can be created.

        Be aware that this pipeline was transferred to a nextflow pipeline with major
        changes to version v0.3.

Usage:
        With mapping:
                nextflow run backmap.nf  --assembly assembly.fa
                                        [--single-end <unpaired.fq>[,unpaired2.fq,...]]
                                        [--paired-end <paired1.1.fq>,<paired1.2.fq>[,<paired2.1.fq>,<paired2.2.fq>,...]]
                                        [--pacbio <pacbio.fq>[,<pacbio2.fq>,...]]
                                        [--nanopore <nanopore.fq>[,<nanopore2.fq>,...]]

        Without mapping:
                nextflow run backmap.nf [--assembly assembly.fa]
                                        [--illumina-bam-files illumina.bam[,illumina2.bam,...]]
                                        [--pacbio-bam-files pacbio.bam[,pacbio2.bam,...]]
                                        [--nanopore-bam-files nanopore.bam[,nanopore2.bam,...]]


Mandatory options:
                --assembly                      STR     Assembly were reads should mapped to in fasta format
        AND AT LEAST ONE OF
                --single-end                    STR     Fastq files with unpaired Illumina reads, comma separated
                --paired-end                    STR     Two fastq files with paired Illumina reads, comma separated
                --pacbio                        STR     Fasta or fastq file with PacBio reads, comma separated
                --nanopore                      STR     Fasta or fastq file with Nanopore reads, comma separated
        OR
                --illumina-bam-files            STR     Premapped Illumina Reads in bam format, comma separated
                --pacbio-bam-files              STR     Premapped Pacbio Reads in bam format, comma separated
                --nanopore-bam-files            STR     Premapped Nanopore Reads in bam format, comma separated
                                                        Skips read mapping
                                                        Genome size estimation only possible if assembly is given

Optional Options: [default]
                --file-separator                STR     Overrides file separator [,]
                --output                        STR     Output directory [./Results/]
                                                        Will be created if not existing
                --prefix                        STR     Prefix of output files
Raphael Müller's avatar
Raphael Müller committed
60
61
62
63
                --keep-temporary                        Keep temporary, intermediate files [false]
                --threads       INT                     Maximum number of threads per process [1]
                                                        This is not the total number of available threads!

64
65
66
67
                --skip-quality-control                  Skip qualimap bamqc run [false]
                --skip-coverage-histogram               Skip creation of coverage histograms [false]
                --skip-genome-size-estimation           Skip genome size estimation [false]
                                                        genome size estimation is only possible, if assembly is given
Raphael Müller's avatar
Raphael Müller committed
68
69


70
71
72
73
74
75
76
77
78
79
80
81
                --bwa-options                   STR     Options passed to bwa [-a -c 10000]
                --minimap-pacbio-options        STR     Options passed to minimap ["-H"]
                --minimap-nanopore-options      STR     Options passes to nanopore minimap runs [""]
                --qualimap-options              STR     Options passed to qualimap [""]
                                                        Pass options with quotes e.g. --bwa-options "<options>"

                --help                                  Print this help and exit
                --version                               Print version number and exit
                --debug                                 Print inner variables

```

82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
## Test

### Install dev/test environments

#### Development

```
conda env create -f environment-dev.yml
conda activate backmap-dev
```

#### Snakemake

```
conda env create -f environment-test.yml
conda activate backmap-test
snakemake --profile sm_profile/ --conda-create-envs-only -F
snakemake --profile sm_profile/ -f install_perl_packages
Raphael Müller's avatar
Raphael Müller committed
100
snakemake --profile sm_profile/ -F 
101
102
```

103
104
105
106
107
108
109
## What's new? What did change?

### new options

1. all options now have a long version, short version is currently not available
1. `--illumina-bam-files`, `--pacbio-bam-files`, `--nanopore-bam-files` instead of `-b`
1. `--debug` prints internal variables for faster development and bug investigations
Raphael Müller's avatar
Raphael Müller committed
110
1. `--threads` sets the maximum number of threads per process (not maximum number of threads of the whole workflow)
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131

### options not available anymore

1. `-b` -> splitted into three different options
1. `-sort` -> every bam is now sorted
1. `-v` -> removed, because this is a nextflow feature
1. `-dry-run` -> removed
1. `-t` -> removed, because this is handled by nextflow

### multifile call

Multi files of one kind, e.g., three different pacbio runs, are now given as one string with each run comma separated instead of giving one parameter multiple times.

### Genome size estimation with bam files

Genome size estimation with bam files can now be done if assembly is given

### Improved R scripts

R scripts have been improved and rewritten

Raphael Müller's avatar
Raphael Müller committed
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
# original perl script

https://github.com/schellt/backmap

### Citation
Schell T, Feldmeyer B, Schmidt H, Greshake B, Tills O et al. (2017). An Annotated Draft Genome for _Radix auricularia_ (Gastropoda, Mollusca). _Genome Biology and Evolution_, 9(3):585–592, <https://doi.org/10.1093/gbe/evx032>

__If you use this tool please cite the dependencies as well:__

- samtools:  
Li H, Handsaker B, Wysoker A, Fennell T, Ruan J et al. (2009). The Sequence Alignment/Map format and SAMtools. _Bioinformatics_, 25(16):2078–2079, <https://doi.org/10.1093/bioinformatics/btp352>
- bwa mem:  
Li H (2013). Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. _arXiv preprint arXiv:1303.3997_.
- minimap2:  
Li H (2018). Minimap2: pairwise alignment for nucleotide sequences. _Bioinformatics_, 34:3094–3100, <https://doi.org/10.1093/bioinformatics/bty191>
- Qualimap:  
Okonechnikov K, Conesa A, García-Alcalde F (2016). Qualimap 2: advanced multi-sample quality control for high-throughput sequencing data. _Bioinformatics_, 32(2):292–294, <https://doi.org/10.1093/bioinformatics/btv566>
- MultiQC:  
Ewels P, Magnusson M, Lundin S, Käller M (2016). MultiQC: summarize analysis results for multiple tools and samples in a single report. _Bioinformatics_, 32(19):3047–3048, <https://doi.org/10.1093/bioinformatics/btw354>
- bedtools:  
Quinlan AR, Hall IM (2010). BEDTools: a flexible suite of utilities for comparing genomic features. _Bioinformatics_, 26(6):841–842, <https://doi.org/10.1093/bioinformatics/btq033>
- Rscript:  
R Core Team (2019). R: A Language and Environment for Statistical Computing. <http://www.R-project.org/>

# testdata

There are no Nanopore reads.

| what | file |
| ---- | ---- |
| genome assembly | dgal_ra_pb-target-and-other_ill-confilter.blobfilter.rmmt_sspace-lr3_lrgc3_pg3_pilon_3.fasta.gz |
| Illumina forward reads | dgal_1.paired.confilter.fq.gz | 
| Illumina reverse reads | dgal_2.paired.confilter.fq.gz | 
| Illumina unpaired reads | dgal.confilter.fq.gz | 
| PacBio reads | pacbio_gal.target-and-other.blobfilter.fastq.gz | 
| Illumina mapping | dgal_ra_pb-target-and-other_ill-confilter.blobfilter.rmmt_sspace-lr3_lrgc3_pg3_pilon_3.fasta.sort.bam | 
| PacBio mapping | dgal_ra_pb-target-and-other_ill-confilter.blobfilter.rmmt_sspace-lr3_lrgc3_pg3_pilon_3.fasta.pb.sort.bam |