SARS CoV2 Variant Analysis workflow
This is the workflow to analyse sars-cov2 variants from the raw samples, as well as to compare the effect of different mapping strategies and variant callers on our analysis.
The two mapping strategies used are: bowtie2 and bwa, and the two variant callers gatk and lofreq.
All in all the workflow generates results for all the 4 combinations of the above mentioned strategies i.e. bowtie2-gatk, bowtie2-lofreq, bwa-gatk and bwa-lofreq.
Getting Started:
Step 1: Copy the git repository
You can use the command:
git clone <link_to_repository>
Step 2: install the dependencies
Create a new environment using the conda_env.yaml file.
Use the command:
conda env create -f conda_env.yaml
Step 3: download the reference genome
Create a genome folder and download the fasta, gtf and other files for the reference genome you want to use there.
In our case we have used the following genome assembly for our analysis: ASM985889v3 (https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_009858895.2/)
Download command:
wget -r --no-parent -e robots=off -nd https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/009/858/895/GCF_009858895.2_ASM985889v3/
NOTE: We use the unziped genome files in our Snakefile.
Step 4: setup database for snpEff
In our workflow we use the package snpEff to annotate our variants. In order to run this operation, a an snpEff database entry must be present for the respective reference genome.
You can check whether a database entry for your reference genome exists or not using the following command:
snpEff databases | grep <your_reference_genome>
You can try out the different names for zour reference genome, in case the command returns nothing for any of those, then that means that an snpEff database for this genome does not exist, and you need to create an snpEff database entry for the genome yourself.
In our case, we have already created a database entry for our reference genome assembly ASM985889v3
In case you want to use the same genome as ours you can do it in just 2 steps.
First, you need to add a few lines to the snpEff.config file. The path to your config file looks like this:
<path_to_conda>/share/snpeff-4.3.1t-5/snpEff.config
NOTE: you can find the <path_to_conda> using the command:
conda env list
NOTE: the name of the 'snpeff-4.3.1t-5' folder can differ depending on your version.
Once you find the config file, open it and paste the following lines around line number 127 of the file:
# Sars-cov2:
ASM985889v3.genome : Sars_cov2
and save the file.
Second, you need to copy the data folder from the snppEff folder in your local clone of this git repo, to the snpeff folder where your snpEff.config file lies:
cp -r snpEff/data <path_to_conda>/share/snpeff-4.3.1t-5/
Now run the command:
snpEff databases | grep ASM985889v3
it should now return an entry as follows
ASM985889v3 Sars_cov2 OK http://downloads.sourceforge.net/project/snpeff/databases/v4_3/snpEff_v4_3_ASM985889v3.zip
and that means that you are ready for the next step.
NOTE: in case you need to create the database on your own, then follow the instructions on the following link: http://pcingola.github.io/SnpEff/snpeff/build_db/
Step 5: Downloading samples (Reads):
Download all your samples in a samples folder.
NOTE: Our workflow only works with illumuna paired end reads!
We used the following samples for our analysis:
- Omikron (ERR6494491) https://www.ebi.ac.uk/ena/browser/view/ERR6494491
- Gamma (ERR4165100) https://www.ebi.ac.uk/ena/browser/view/ERR4165100
- Delta (ERR5777093) https://www.ebi.ac.uk/ena/browser/view/ERR5777093
- Beta (ERR5063609) https://www.ebi.ac.uk/ena/browser/view/ERR5063609
- Theta (ERR5628794) https://www.ebi.ac.uk/ena/browser/view/ERR5628794
The commands to download the forward and reverse reads for the sample ERR5628794 are as follows:
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR562/004/ERR5628794/ERR5628794_2.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR562/004/ERR5628794/ERR5628794_1.fastq.gz
NOTE: Depending on the samples or source of your samples, the download scripts can differ.
NOTE: In the snakemake workflow, the names of the samples are restricted to have only alphanumeric characters (regex: '[A-Za-z0-9]+'). Make sure that your sample names follow this pattern. It should not be a problem because all the sequence databases have alphanumeric sample names.
Step 6: Read Groups Infos:
In order to run gatk HaplotypeCaller, we need our bam files to include the read group infos. We do so with our picard AddOrReplaceReadGroups command. But, as parameter we require an argument file that contains the read groups infos.
The argument files should lie inside the 'readgroups' folder and should be named as follows:
<SAMPLE_NAME>.txt
And inside the file there should be written as follows.
--RGID <ID> --RGLB <LIBRARY_SOURCE> --RGPL <PLATFORM> --RGPU <MODEL> --RGSM <SAMPLE_ACCESSION>
This information can be found on the webpade of the sample.
For example the read groups argument file for the sample ERR5628794 is named as follows:
ERR5628794.txt
And its contents are as follows:
--RGID ERR5628794 --RGLB VIRAL_RNA --RGPL ILLUMINA --RGPU NextSeq_550 --RGSM SAMEA8463382
Once you have created the read groups arguments files for all the samples, you are ready to move on to the next step. The read groups files for our samples are present under the readgroups folder in this repository.
Step 7: Create Working Directory:
You need to create a working directory, where all the results and intermediate files will be generated. The folder structure should be as follows:
.
├── annotations
├── bcftools
├── fasta
├── mappings
├── phylogeny
├── picard
├── readgroups
├── reports
│ ├── mapping
│ ├── qc
│ │ ├── fastqc
│ │ └── multiqc
│ ├── snpeff
│ └── trees
├── samtools
└── variants
The command to generate these is as follows:
mkdir -p mappings samtools bcftools reports/{qc/{fastqc,multiqc},mapping,snpeff,trees} picard fasta variants annotations phylogeny readgroups
NOTE: The read groups argument files should lie in the readgroups folder before running the snakemake.
NOTE: The folders for samples, genome and Snakefile can lie within or outside the working directory. This is configurable in the Snakefile.
Step 8: Configure Snakefile:
Configure all the paths in the snakefile_config.
Add all the files that you want to generate in the rule_all in the Snakefile.
Make sure you use the correct variable names, format and config!
Step 9: Run snakemake:
Use the following command to run snakemake with cores:
snakemake -j 6
Off you go!
Step 10: View reports:
You can generate a snakemake report (html) for your run using the command:
snakemake --report <path_to_report_file>
Furthermore, the Snakefile also generates the following reports:
- fastqc (html)
- multiqc (html)
- mapping statistics (text)
- snpeff annotation report (html)
- phylogenetic trees (image)
These are present under the reports folder as described below:
├── reports
│ ├── mapping
│ ├── qc
│ │ ├── fastqc
│ │ └── multiqc
│ ├── snpeff
│ └── trees
Authors:
Aviral Jain
aviral.jain@bioinfsys.uni-giessen.de
Parisa Golbargtaklimi
parisa.golbargtaklimi@bioinfsys.uni-giessen.de
Vanessa Chaptchet Tchapda
vanessa.chaptchet.tchapda@bioinfsys.uni-giessen.de