Commit 08666c0d authored by Lukas Jelonek's avatar Lukas Jelonek
Browse files

Add quality control exercise

# Exercise: Quality control pipeline
In this exercise you will create a workflow that runs fastqc on a set of fastq
files and aggregates the reports with multiqc. Hint: Run your workflow after
each step of the exercise.
### Setup
Create a directory for your pipeline, e.g. qc-wf. Create a file inside
this directory. Download the example data into the directory via:
mkdir data
cd data
curl | bash
### Initial channel creation
Define a channel that contains all fastq.gz files from the data directory and
name it, e.g. ch_fastqs. Check its content with the `.println()` or the
`.view()` method (be sure to remove it when proceeding to the next assignment).
### First process template
Add the general structure for your first process to your file. Give it a name.
We will fill in the input, script, output and directives in the subsequent
process <name> {
### Define input channel
Add the input block. Be sure to use the right qualifier for files.
### Script template
Fastqc requires a least one unnamed parameter in batch mode: the fastq file. It
then creates two files, a html report and a zip file with the raw qc-data. They
have the same prefix as the fastq file.
As a task is executed in it's own task directory and the input files are staged
inside that directory (be sure to use the file qualifier!) we can use the
default fastqc behaviour and just call it with the file.
### Define output channels
Fastqc created two files: a html and a zip file. For the aggregated report with
multiqc only the zip files are needed. Define a channel for the zip files.
### Copy data to an output directory
Although the html files are not required for multiqc, we want to save them in a
results directory. Therefore we need to define a second output channel with
only the html files. To copy them from the task work directory to the output
repository you have to define the publishDir directive.
The publishDir directive copies all files from all output channels to the
results directory, but it can be configured to only include files that follow a
specific pattern. Check the documentation at to find out how.
By default nextflow creates symbolic links in the results directory to the
files. This can also be configured. Look it up in the documentation.
(Optional) Define a tag for the execution, e.g. the file name or the file
### (Optional) Command line parameters
Add command line parameters to your pipeline, at least the input files and the
target directories. Write some code for sanity checks of the passed parameters,
e.g. are the files available, is the result directory already there, ...
### Add multiqc process
MultiQC is a tool that aggregates statistics from several bioinformatics tools
for multiple datasets into a single html report. These reports help in quickly
assessing the results. To use it you specify a folder where it will search for
statistics. It will then create multiqc_report.html file and a directory
containing the qc-data.
Create a second process for multiqc. Define the input, output and script. The
multiqc-html file should be copied to the results directory.
In a previous exercise you already created a channel that contains all
fastqc-zip-files. If you simply consume the channel data, then for each zip
file a separated multiqc task will be executed. In order to process all files
in a single task you have to modify the channel to contain a single list of
files instead of just files.
Check the documentation at for
the right channel operator and apply it to the channel in the input block of
your multiqc process. Be sure to use the right qualifier for files.
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment