Extended workflow
Warning
This tutorial shows a full run of the workflow with all options activated. For testing, we ran this example locally on a large cloud instance. The data is likely too large for running locally on an average laptop.
We show a run of the full workflow, including deepribo predictions and differential expression analysis, on data available from NCBI.
For this purpose, we use a salmonella enterica
dataset available under the accession number PRJNA421559
[PGAR19].
Warning
Ensure that you have miniconda3
and singularity
installed and a snakemake environment
set-up. Please refer to the overview for details on the installation.
Setup
First of all, we start by creating the project directory and changing to it. (you can choose any directory name)
mkdir project
cd project
We then download the latest version of HRIBO into the newly created project folder and unpack it.
wget https://github.com/RickGelhausen/HRIBO/archive/1.7.0.tar.gz
tar -xzf 1.7.0.tar.gz; mv HRIBO-1.7.0 HRIBO; rm 1.7.0.tar.gz;
Retrieve and prepare input files
Before starting the workflow, we have to acquire and prepare several input files. These files are the annotation file, the genome file, the fastq files, the configuration file and the sample sheet.
Annotation and genome files
First, we want to retrieve the annotation file and the genome file. In this case, we can find both on NCBI using the accession number NC_016856.1
.
Note
Ensure that you download the annotation for the correct strain str. 14028S
.
On this page, we can directly retrieve both files by clicking on the according download links next to the file descriptions. Alternatively, you can directly download them using the following commands:
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/022/165/GCF_000022165.1_ASM2216v1/GCF_000022165.1_ASM2216v1_genomic.gff.gz
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/022/165/GCF_000022165.1_ASM2216v1/GCF_000022165.1_ASM2216v1_genomic.fna.gz
Then, we unpack and rename both files.
gunzip GCF_000022165.1_ASM2216v1_genomic.gff.gz && mv GCF_000022165.1_ASM2216v1_genomic.gff annotation.gff
gunzip GCF_000022165.1_ASM2216v1_genomic.fna.gz && mv GCF_000022165.1_ASM2216v1_genomic.fna genome.fa
.fastq files
Next, we want to acquire the fastq files. The fastq files are available under the accession number PRJNA421559
on NCBI.
The files have to be downloaded using the Sequence Read Archive (SRA).
There are multiple ways of downloading files from SRA as explained here.
As we already have conda
installed, the easiest way is to install the sra-tools
:
conda create -n sra-tools -c bioconda -c conda-forge sra-tools pigz
This will create a conda environment containing the sra-tools
and pigz
. Using these, we can simply pass the SRA identifiers and download the data:
conda activate sra-tools;
fasterq-dump SRR6359966; pigz -p 2 SRR6359966.fastq; mv SRR6359966.fastq.gz RIBO-WT-1.fastq.gz
fasterq-dump SRR6359967; pigz -p 2 SRR6359967.fastq; mv SRR6359967.fastq.gz RIBO-WT-2.fastq.gz
fasterq-dump SRR6359974; pigz -p 2 SRR6359974.fastq; mv SRR6359974.fastq.gz RNA-WT-1.fastq.gz
fasterq-dump SRR6359975; pigz -p 2 SRR6359975.fastq; mv SRR6359975.fastq.gz RNA-WT-2.fastq.gz
fasterq-dump SRR6359970; pigz -p 2 SRR6359970.fastq; mv SRR6359970.fastq.gz RIBO-csrA-1.fastq.gz
fasterq-dump SRR6359971; pigz -p 2 SRR6359971.fastq; mv SRR6359971.fastq.gz RIBO-csrA-2.fastq.gz
fasterq-dump SRR6359978; pigz -p 2 SRR6359978.fastq; mv SRR6359978.fastq.gz RNA-csrA-1.fastq.gz
fasterq-dump SRR6359979; pigz -p 2 SRR6359979.fastq; mv SRR6359979.fastq.gz RNA-csrA-2.fastq.gz
conda deactivate;
Note
we will use two conditions and two replicates for each condition. There are 4 replicates available for each condition, we run it with two as this is just an example. If you run an analysis always try to use as many replicates as possible.
Warning
If you have a bad internet connection, this step might take some time. It is advised to run this workflow on a cluster or cloud instance.
This will download compressed files for each of the required .fastq
files. We will move them into a folder called fastq
.
mkdir fastq;
mv *.fastq.gz fastq;
Sample sheet and configuration file
Finally, we will prepare the configuration file (config.yaml
) and the sample sheet (samples.tsv
). We start by copying templates for both files from the HRIBO/templates/
into the HRIBO/
folder.
cp HRIBO/templates/samples.tsv HRIBO/
The sample file looks as follows:
method |
condition |
replicate |
fastqFile |
fastqFile2 |
---|---|---|---|---|
RIBO |
A |
1 |
fastq/RIBO-A-1.fastq.gz |
|
RIBO |
A |
2 |
fastq/RIBO-A-2.fastq.gz |
|
RIBO |
B |
1 |
fastq/RIBO-B-1.fastq.gz |
|
RIBO |
B |
2 |
fastq/RIBO-B-2.fastq.gz |
|
RNA |
A |
1 |
fastq/RNA-A-1.fastq.gz |
|
RNA |
A |
2 |
fastq/RNA-A-2.fastq.gz |
|
RNA |
B |
1 |
fastq/RNA-B-1.fastq.gz |
|
RNA |
B |
2 |
fastq/RNA-B-2.fastq.gz |
Note
When using your own data, use any editor (vi(m), gedit, nano, atom, …) to customize the sample sheet.
Warning
Please ensure not to replace any tabulator symbols with spaces while changing this file.
We will rewrite this file to fit the previously downloaded .fastq.gz files.
method |
condition |
replicate |
fastqFile |
fastqFile2 |
---|---|---|---|---|
RIBO |
WT |
1 |
fastq/RIBO-WT-1.fastq.gz |
|
RIBO |
WT |
2 |
fastq/RIBO-WT-2.fastq.gz |
|
RIBO |
csrA |
1 |
fastq/RIBO-csrA-1.fastq.gz |
|
RIBO |
csrA |
2 |
fastq/RIBO-csrA-2.fastq.gz |
|
RNA |
WT |
1 |
fastq/RNA-WT-1.fastq.gz |
|
RNA |
WT |
2 |
fastq/RNA-WT-2.fastq.gz |
|
RNA |
csrA |
1 |
fastq/RNA-csrA-1.fastq.gz |
|
RNA |
csrA |
2 |
fastq/RNA-csrA-1.fastq.gz |
Next, we are going to set up the config.yaml
.
cp HRIBO/templates/config.yaml HRIBO/
The config file can be used to easily change the parameters of HRIBO
.
Note
For a detailed overview of the available options please refer to our workflow-configuration
In our small example, we will adjust the adapter sequence which will lead to the following changes in the config.yaml
file:
biologySettings:
# Adapter sequence used
adapterS3: "AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC"
Running the workflow
Now that all the required files are prepared, we can start running the workflow, either locally or in a cluster environment.
Warning
if you have problems running deepribo
, please refer to Activating DeepRibo.
Warning
before you start using snakemake
remember to activate the environment first.
conda activate snakemake
Run the workflow locally
Use the following steps when you plan to execute the workflow on a single server, cloud instance or workstation.
Warning
Please be aware that some steps of the workflow require a lot of memory or time, depending on the size of your input data. To get a better idea about the memory consumption, you can have a look at the provided sge.yaml
or torque.yaml
files.
Navigate to the project folder containing your annotation and genome files, as well as the HRIBO folder. Start the workflow locally from this folder by running:
snakemake --use-conda --use-singularity --singularity-args " -c " --greediness 0 -s HRIBO/Snakefile --directory ${PWD} -j 10 --latency-wait 60
This will start the workflow locally.
--use-conda
: instruct snakemake to download tool dependencies from conda.-s
: specifies the Snakefile to be used.--directory
: specifies your current path.-j
: specifies the maximum number of cores snakemake is allowed to use.--latency-wait
: specifies how long (in seconds) snakemake will wait for filesystem latencies until declaring a file to be missing.
Run Snakemake in a cluster environment
Use the following steps if you are executing the workflow via a queuing system. Edit the configuration file <cluster>.yaml
according to your queuing system setup and cluster hardware.
Navigate to the project folder on your cluster system. Start the workflow from this folder by running (The following system call shows the usage with Grid Engine):
snakemake --use-conda --use-singularity --singularity-args " -c " -s HRIBO/Snakefile --directory ${PWD} -j 20 --cluster-config HRIBO/sge.yaml
Note
Ensure that you use an appropriate <cluster>.yaml
for your cluster system. We provide one for SGE
and TORQUE
based systems.
Example: Run Snakemake in a cluster environment
Warning
Be advised that this is a specific example, the required options may change depending on your system.
We ran the tutorial workflow in a cluster environment, specifically a TORQUE cluster environment.
Therefore, we created a bash script torque.sh
in our project folder.
vi torque.sh
Note
Please note that all arguments enclosed in <> have to be customized. This script will only work if your cluster uses the TORQUE queuing system.
We proceeded by writing the queuing script:
#!/bin/bash
#PBS -N <ProjectName>
#PBS -S /bin/bash
#PBS -q "long"
#PBS -d <PATH/ProjectFolder>
#PBS -l nodes=1:ppn=1
#PBS -o <PATH/ProjectFolder>
#PBS -j oe
cd <PATH/ProjectFolder>
source activate HRIBO
snakemake --latency-wait 600 --use-conda --use-singularity --singularity-args " -c " -s HRIBO/Snakefile --directory ${PWD} -j 20 --cluster-config HRIBO/torque.yaml --cluster "qsub -N {cluster.jobname} -S /bin/bash -q {cluster.qname} -d <PATH/ProjectFolder> -l {cluster.resources} -o {cluster.logoutputdir} -j oe"
We then simply submitted this job to the cluster:
qsub torque.sh
Using any of the presented methods, this will run the workflow on the tutorial dataset and create the desired output files.
Results
The last step will be to aggregate all the results once the workflow has finished running.
In order to do this, we provided a script in the scripts folder of HRIBO called makereport.sh
.
bash HRIBO/scripts/makereport.sh <reportname>
Running this will create a folder where all the results are collected from the workflows final output, it will additionally create compressed file in .zip
format.
The <reportname>
will be extended by report_HRIBOX.X.X_dd-mm-yy
.
Note
A detailed explanation of the result files can be found in the result section.
Note
The final result of this example workflow, can be found here .
Warning
As many browsers stopped the support for viewing ftp files, you might have to use a ftp viewer instead.
Runtime
The total runtime of the extended workflow, using 12 cores of an AMD EPYC Processor (with IBPB), 1996 MHz CPUs and 64 GB RAM, was 5h51m14s.
The runtime contains the automatic download and installation time of all dependencies by conda and singularity. This step is mainly dependent on the available network bandwidth. In this case it took about 12 minutes.
The runtime difference compared to the example workflow is explained by the additional libraries and tools used.
References
- GruningDSjodin+17
Björn Grüning, Ryan Dale, Andreas Sjödin, Jillian Rowe, Brad A. Chapman, Christopher H. Tomkins-Tinch, Renan Valieris, and Johannes Köster. Bioconda: a sustainable and comprehensive software distribution for the life sciences. bioRxiv, 2017. URL: https://www.biorxiv.org/content/early/2017/10/27/207092, arXiv:https://www.biorxiv.org/content/early/2017/10/27/207092.full.pdf, doi:10.1101/207092.
- KosterR18
Johannes Köster and Sven Rahmann. Snakemake—a scalable bioinformatics workflow engine. Bioinformatics, ():bty350, 2018. URL: http://dx.doi.org/10.1093/bioinformatics/bty350, arXiv:/oup/backfile/content_public/journal/bioinformatics/pap/10.1093_bioinformatics_bty350/2/bty350.pdf, doi:10.1093/bioinformatics/bty350.
- PGAR19
Anastasia H. Potts, Yinping Guo, Brian M. M. Ahmer, and Tony Romeo. Role of csra in stress responses and metabolism important for salmonella virulence revealed by integrated transcriptomics. PLOS ONE, 14(1):1–30, 01 2019. URL: https://doi.org/10.1371/journal.pone.0211430, doi:10.1371/journal.pone.0211430.
- ZAA+18
Daniel R Zerbino, Premanand Achuthan, Wasiu Akanni, M Ridwan Amode, Daniel Barrell, Jyothish Bhai, Konstantinos Billis, Carla Cummins, Astrid Gall, Carlos García Girón, Laurent Gil, Leo Gordon, Leanne Haggerty, Erin Haskell, Thibaut Hourlier, Osagie G Izuogu, Sophie H Janacek, Thomas Juettemann, Jimmy Kiang To, Matthew R Laird, Ilias Lavidas, Zhicheng Liu, Jane E Loveland, Thomas Maurel, William McLaren, Benjamin Moore, Jonathan Mudge, Daniel N Murphy, Victoria Newman, Michael Nuhn, Denye Ogeh, Chuang Kee Ong, Anne Parker, Mateus Patricio, Harpreet Singh Riat, Helen Schuilenburg, Dan Sheppard, Helen Sparrow, Kieron Taylor, Anja Thormann, Alessandro Vullo, Brandon Walts, Amonida Zadissa, Adam Frankish, Sarah E Hunt, Myrto Kostadima, Nicholas Langridge, Fergal J Martin, Matthieu Muffato, Emily Perry, Magali Ruffier, Dan M Staines, Stephen J Trevanion, Bronwen L Aken, Fiona Cunningham, Andrew Yates, and Paul Flicek. Ensembl 2018. Nucleic Acids Research, 46(D1):D754–D761, 2018. URL: http://dx.doi.org/10.1093/nar/gkx1098, arXiv:/oup/backfile/content_public/journal/nar/46/d1/10.1093_nar_gkx1098/2/gkx1098.pdf, doi:10.1093/nar/gkx1098.