HRIBO¶
HRIBO¶
Introduction¶
HRIBO is a workflow for the analysis of prokaryotic Ribo-Seq data. HRIBO is available on github. It includes among others, prediction of novel open reading frames (ORFs), metagene profiling, quality control and differential expression analysis. The workflow is based on the workflow management system snakemake and handles installation of all dependencies via bioconda [GruningDSjodin+17] and docker, as well as all processings steps. The source code of HRIBO is open source and available under the License GNU General Public License 3. Installation and basic usage is described below.
Note
For a detailed step by step tutorial on how to use this workflow on a sample dataset, please refer to our example-workflow.
Requirements¶
In the following, we describe all the required files and tools needed to run our workflow.
Warning
HRIBO was tested on a linux system. We cannot guarantee that the workflow will run on other systems.
Tools¶
miniconda3¶
As this workflow is based on the workflow management system snakemake [KosterR18], Snakemake will download all necessary dependencies via conda.
We strongly recommend installing miniconda3 with python3.7
.
After downloading the miniconda3
version suiting your linux system, execute the downloaded bash file and follow the instructions given.
HRIBO¶
Using the workflow requires HRIBO
. The latest version is available on our GitHub page.
In order to run the workflow, we suggest that you download the HRIBO
into your project directory.
The following command creates an example directory and changes into it:
mkdir project
cd project
Now, download and unpack the latest version of HRIBO
by entering the following commands:
wget https://github.com/RickGelhausen/HRIBO/archive/1.7.0.tar.gz
tar -xzf 1.7.0.tar.gz; mv HRIBO-1.7.0 HRIBO; rm 1.7.0.tar.gz;
HRIBO
is now in a subdirectory of your project directory.
snakemake¶
Note
HRIBO was tested using snakemake (version>=7.24.2)
In order to support docker container, snakemake requires singularity.
HRIBO requires snakemake
and singularity
to be installed on your system.
We suggest installing it using conda
. To this end we provide an environment.yaml
file in the HRIBO
directory.
conda create --file HRIBO/environment.yaml
This creates a new conda environment called snakemake
and installs snakemake
and singularity
into the environment. The environment can be activated using:
conda activate hribo_env
and deactivated using:
conda deactivate
Input files¶
Several input files are required in order to run the workflow, a genome file (.fa
), an annotation file (.gff/.gtf
) and compressed sequencing files (.fastq.gz
).
File name |
Description |
---|---|
annotation.gff |
user-provided annotation file with genomic features |
genome.fa |
user-provided genome file containing the genome sequence |
<method>-<conditon>-<replicate>.fastq.gz |
user-provided compressed sequencing files |
config.yaml |
configuration file to customize the workflow |
samples.tsv |
sample file describing the relation between the input fastq files |
annotation.gff and genome.fa¶
We recommend retrieving both the genome and the annotation files for your organism from National Center for Biotechnology Information (NCBI) or Ensembl Genomes [ZAA+18].
Warning
if you use custom annotation files, ensure that you adhere to the gtf/gff standard. Wrongly formatted files are likely to cause problems with downstream tools.
Note
For detailed information about downloading and unpacking these files, please refer to our example-workflow.
input .fastq files¶
These are the input files provided by the user. Both single end and paired end data is supported.
Note
As most downstream tools do not support paired end data, we combine the paired end data into single end data using flash2 . For more information about how to use paired-end data please refer to the workflow-configuration.
Note
Please ensure that you compress your files in .gz
format.
Please ensure that you move all input .fastq.gz
files into a folder called fastq (Located in your project folder):
mkdir fastq
cp *.fastq.gz fastq/
Sample sheet and configuration file¶
In order to run HRIBO
, you have to provide a sample sheet and a configuration file.
There are templates for both files available in the HRIBO
folder, in the subfolder templates
.
The configuration file is used to allow the user to easily customize certain settings, like the adapter sequence.
The sample sheet is used to specify the relation of the input .fastq
files (condition / replicate etc…)
Copy the templates of the sample sheet and the configuration file into the HRIBO
folder:
cp HRIBO/templates/samples.tsv HRIBO/
cp HRIBO/templates/config.yaml HRIBO/
Customize the config.yaml
using your preferred editor.
Note
For a detailed overview of the available options please refer to our workflow-configuration
Edit the sample sheet corresponding to your project. It contains the following variables:
method: indicates the method used for this project, here RIBO for ribosome profiling and RNA for RNA-seq.
condition: indicates the applied condition (e.g. A, B, …).
replicate: ID used to distinguish between the different replicates (e.g. 1,2, …)
inputFile: indicates the according fastq file for a given sample.
Note
If you have paired end data, please ensure that you use the samples_pairedend.tsv
file.
As seen in the samples.tsv
template:
method |
condition |
replicate |
fastqFile |
---|---|---|---|
RIBO |
A |
1 |
fastq/RIBO-A-1.fastq.gz |
RIBO |
A |
2 |
fastq/RIBO-A-2.fastq.gz |
RIBO |
B |
1 |
fastq/RIBO-B-1.fastq.gz |
RIBO |
B |
2 |
fastq/RIBO-B-2.fastq.gz |
RNA |
A |
1 |
fastq/RNA-A-1.fastq.gz |
RNA |
A |
2 |
fastq/RNA-A-2.fastq.gz |
RNA |
B |
1 |
fastq/RNA-B-1.fastq.gz |
RNA |
B |
2 |
fastq/RNA-B-2.fastq.gz |
Note
This is just an example, please refer to our example-workflow for another example.
cluster.yaml¶
Warning
As we are currently unable to test HRIBO on cluster systems, the support for cluster systems is experimental. As HRIBO is based on snakemake, cluster support is still possible and we added an example SLURM profile. Please check out the snakemake documentation for more detail snakemake
Output files¶
In the following tables all important output files of the workflow are listed.
Note
Files create as intermediate steps of the workflow are omitted from this list. (e.g. .bam
files)
Note
For more details about the output files, please refer to the analysis results.
Single-file Output¶
File name |
Description |
---|---|
samples.xlsx |
Excel version of the input samples file. |
manual.pdf |
A PDF file describing the analysis. |
annotation_total.xlsx |
Excel file containing detailed measures for every feature in the input annotation using read counts containing multi-mapping reads. |
annotation_unique.xlsx |
Excel file containing detailed measures for every feature in the input annotation using read counts containing no multi-mapping reads. |
total_read_counts.xlsx |
Excel file containing read counts with multi-mapping reads. |
unique_read_counts.xlsx |
Excel file containing read counts without multi-mapping reads. |
multiqc_report.html |
Quality control report combining all finding of individual fastQC reports into a well structured overview file. |
heatmap_SpearmanCorr_readCounts.pdf |
PDF file showing the Spearman correlation between all samples. |
predictions_reparation.xlsx |
Excel file containing detailed measures for every ORF detected by reparation. |
predictions_reparation.gff |
GFF file containing ORFs detected by reparation, for genome browser visualization. |
potentialStartCodons.gff |
GFF file for genome browser visualization containing all potential start codons in the input genome. |
potentialStopCodons.gff |
GFF file for genome browser visualization containing all potential stop codons in the input genome. |
potentialRibosomeBindingSite.gff |
GFF file for genome browser visualization containing all potential ribosome binding sites in the input genome. |
potentialAlternativeStartCodons.gff |
GFF file for genome browser visualization containing all potential alternative start codons in the input genome. |
Multi-file Output¶
File name |
Description |
---|---|
riborex/<contrast>_sorted.csv |
Differential expression results by Riborex, sorted by pvalue. |
riborex/<contrast>_significant.csv |
Differential expression results by Riborex, only significant results. (pvalue < 0.05) |
xtail/<contrast>_sorted.csv |
Differential expression results by xtail, sorted by pvalue. |
xtail/<contrast>_significant.csv |
Differential expression results by xtail, only significant results. (pvalue < 0.05) |
xtail/r_<contrast>.pdf |
Differential expression results by xtail, plot with RPF-to-mRNA ratios. |
xtail/fc_<contrast>.pdf |
Differential expression results by xtail, plot with log2 fold change of both mRNA and RPF. |
<method>-<condition>-<replicate>.X.Y.Z.bw |
BigWig file for genome browser visualization, containing a single nucleotide mapping around certain regions. |
<accession>_Z.Y_profiling.xlsx/tsv |
Excel and tsv files containing raw data of the metagene analysis. |
<accession>_Z.Y_profiling.pdf |
visualization of the metagene analysis. |
Note
<contrast> represents a pair of conditions that are being compared.
Note
The BigWig files are available for different normalization methods, strands and regions, X=(min/mil) Y=(forward/reverse) Z=(fiveprime, threeprime, global, centered).
Tool Parameters¶
The tools used in our workflow are listed below, with links to their respective webpage and a short description.
Tool |
Version |
Special parameters used |
---|---|---|
4.1 |
Adapter removal and quality trimming |
|
0.11.9 |
Quality control |
|
1.13 |
Quality control report |
|
0.3.4 |
Mapping of reads |
|
2.2.00 |
Merging paired end samples into single end |
|
2.2.1 |
Used to convert gff to gtf |
|
2.30.0 |
Collection of useful processing tools (e.g. read counting etc…) |
|
1.0.9 |
Prediction of novel Open Reading frames |
|
1.1 |
Prediction of novel Open Reading frames |
|
2.4.0 |
Differential expression analysis |
|
1.1.5 |
Differential expression analysis |
Report¶
In order to aggregate the final results into a single folder structure and receive a date-tagged .zip
file, you can use the makereport.sh
script.
The current date will be used as a tag for the report folder.
bash HRIBO/scripts/makereport.sh <reportname>
Example-workflow¶
A detailed step by step tutorial is available at: example-workflow.
References¶
- GruningDSjodin+17
Björn Grüning, Ryan Dale, Andreas Sjödin, Jillian Rowe, Brad A. Chapman, Christopher H. Tomkins-Tinch, Renan Valieris, and Johannes Köster. Bioconda: a sustainable and comprehensive software distribution for the life sciences. bioRxiv, 2017. URL: https://www.biorxiv.org/content/early/2017/10/27/207092, arXiv:https://www.biorxiv.org/content/early/2017/10/27/207092.full.pdf, doi:10.1101/207092.
- KosterR18
Johannes Köster and Sven Rahmann. Snakemake—a scalable bioinformatics workflow engine. Bioinformatics, ():bty350, 2018. URL: http://dx.doi.org/10.1093/bioinformatics/bty350, arXiv:/oup/backfile/content_public/journal/bioinformatics/pap/10.1093_bioinformatics_bty350/2/bty350.pdf, doi:10.1093/bioinformatics/bty350.
- PGAR19
Anastasia H. Potts, Yinping Guo, Brian M. M. Ahmer, and Tony Romeo. Role of csra in stress responses and metabolism important for salmonella virulence revealed by integrated transcriptomics. PLOS ONE, 14(1):1–30, 01 2019. URL: https://doi.org/10.1371/journal.pone.0211430, doi:10.1371/journal.pone.0211430.
- ZAA+18
Daniel R Zerbino, Premanand Achuthan, Wasiu Akanni, M Ridwan Amode, Daniel Barrell, Jyothish Bhai, Konstantinos Billis, Carla Cummins, Astrid Gall, Carlos García Girón, Laurent Gil, Leo Gordon, Leanne Haggerty, Erin Haskell, Thibaut Hourlier, Osagie G Izuogu, Sophie H Janacek, Thomas Juettemann, Jimmy Kiang To, Matthew R Laird, Ilias Lavidas, Zhicheng Liu, Jane E Loveland, Thomas Maurel, William McLaren, Benjamin Moore, Jonathan Mudge, Daniel N Murphy, Victoria Newman, Michael Nuhn, Denye Ogeh, Chuang Kee Ong, Anne Parker, Mateus Patricio, Harpreet Singh Riat, Helen Schuilenburg, Dan Sheppard, Helen Sparrow, Kieron Taylor, Anja Thormann, Alessandro Vullo, Brandon Walts, Amonida Zadissa, Adam Frankish, Sarah E Hunt, Myrto Kostadima, Nicholas Langridge, Fergal J Martin, Matthieu Muffato, Emily Perry, Magali Ruffier, Dan M Staines, Stephen J Trevanion, Bronwen L Aken, Fiona Cunningham, Andrew Yates, and Paul Flicek. Ensembl 2018. Nucleic Acids Research, 46(D1):D754–D761, 2018. URL: http://dx.doi.org/10.1093/nar/gkx1098, arXiv:/oup/backfile/content_public/journal/nar/46/d1/10.1093_nar_gkx1098/2/gkx1098.pdf, doi:10.1093/nar/gkx1098.
Workflow configuration¶
HRIBO allows different customization to be able to handle different types of input data and customize the analysis. On this page we explain the different options that can be set to easily customize the workflow.
Default workflow¶
We provide a default config.yaml
with generally well-defined default values in the template folder of the HRIBO GitHub repository.
Biology Settings¶
This section contains general settings for the workflow.
adapter sequence¶
adapter: ""
The adapter sequence that will be removed using cutadapt
. It is important to add it if you know that your data was not trimmed before.
genome file¶
genome: "genome.fa"
The path to the genome file in fasta format. Only fasta format is supported. Make sure that the genome file has the same identifiers as the reference annotation file.
annotation file¶
annotation: "annotation.gff"
The path to the annotation file in gff format. Only gff format is supported. Make sure that the annotation file has the same identifiers as the reference genome file.
sample sheet¶
samples: "HRIBO/samples.tsv"
The path to the sample sheet (example provided in the templates folder). The sample sheet describes the relation between the different samples used for the experiment.
alternative start codons¶
alternativestartcodons: ["GTG","TTG"]
A list of alternative start codons that will be used to predict ORFs. The default is ["GTG","TTG"]
.
Differential Expression Settings¶
differentialexpression¶
differentialexpression: "on"
This option allows you to turn on or off differential expression analysis. If you do not have multiple conditions defined in the sample sheet and differential expression is activated, you will receive an error message. Options are “on / off”.
features¶
features: ["CDS", "sRNA"]
This option allows you to specify which features should be used for differential expression analysis. Any feature that appears in your reference annotation is allowed. We suggest using CDS and sRNA features.
contrasts¶
contrasts: ["treated1-untreated1", "treated2-untreated2"]
This option allows you to specify which contrasts should be used for differential expression analysis. The order will affect the directionality of the log2FC values in the output files.
adjusted pvalue cutoff¶
padjCutoff: 0.05
This option allows you to specify the adjusted pvalue cutoff for differential expression analysis. The default is 0.05. All results will be present in the output, this will is soley used to add additional pre-filtered list in the output excel tables.
log2 fold change cutoff¶
log2fcCutoff: 1.0
This option allows you to specify the log2 fold change cutoff for differential expression analysis. The default is 1.0. All results will be present in the output, this will is soley used to add additional pre-filtered list in the output excel tables. Only positive values are allowed. Respective negative values to determine down-regulation will be generated automatically by multiplying by -1.
ORF predictions¶
deepribo: "on"
Activating DeepRibo predictions will give you a different file with ORF predictions. By experience, the top DeepRibo results tend to be better than those of reparation. For archea, where reparation performs very poorly, DeepRibo is the preferred option.
Warning
DeepRibo cannot cope with genomes containing special IUPAC symbols
, ensure that your genome file contains only A
, G
, C
, T
, N
symbols.
Read statistics Settings¶
readLengths: "10-80"
This option allows you to specify the read lengths that should be used for read statistics analysis. The default is 10-80
.
We allow combinations of intervals and single values, e.g. 10-40,55,70
.
Metagene Profiling Settings¶
There exist multiple options to customize the metagene profiling analysis.
Positions outside of the ORF¶
positionsOutsideORF: 50
This option allows you to specify the number of positions outside of the ORF that should be considered for metagene profiling analysis. The default is 50
.
Positions inside of the ORF¶
positionsInsideORF: 150
This option allows you to specify the number of positions inside of the ORF that should be considered for metagene profiling analysis. The default is 150
.
Genes that are shorter than the specified number of positions inside of the ORF will be ignored.
Filtering Methods¶
filteringMethods: ["overlap", "length", "rpkm"]
This option allows you to specify which filtering methods should be used for metagene profiling analysis. The default is ["overlap", "length", "rpkm"]
.
These methods are used to filter out genes that would cause artifacts in the metagene profiling analysis.
The
overlap
method filters out genes that overlap with other genes.The
length
method filters out genes that are shorter than the threshold.The
rpkm
method filters out genes below the rpkm threshold.
Neighboring Genes Distance¶
neighboringGenesDistance: 50
This option allows you to specify the distance to neighboring genes that will be considered to filter out overlapping genes.
The default is 50
. This means that genes that are closer than 50nt to another gene will be filtered out.
RPKM Threshold¶
rpkmThreshold: 10.0
This option allows you to specify the RPKM threshold that will be used to filter out genes below the threshold.
The default is 10.0
.
Length Cutoff¶
lengthCutoff: 50
This option allows you to specify the length cutoff that will be used to filter out genes below the threshold. Be aware that genes that are smaller than the positionsInsideORF will be filtered out anyway.
Mapping Methods¶
mappingMethods: ["fiveprime", "threeprime", "centered", "global"]
This option allows you to specify which mapping methods should be used for metagene profiling analysis. The default is ["fiveprime", "threeprime"]
.
Multiple mappings can be used and result in multiple output files/folders.
The
fiveprime
method will use the 5’ end of the reads to count the number of reads per position.The
threeprime
method will use the 3’ end of the reads to count the number of reads per position.The
centered
method will use the center nucleotides of each read to count the number of reads per position.The
global
method will use the entire read to count the number of reads per position.
Read Lengths:¶
readLengths: "25-34"
This option allows you to specify the read lengths that should be used for metagene profiling analysis. The default is 25-34
.
We allow combinations of intervals and single values, e.g. 25-34,40,50
.
Normalization Methods¶
normalizationMethods: ["raw", "cpm"]
This option allows you to specify which normalization methods should be used for metagene profiling analysis. The default is ["raw", "cpm"]
.
Output Formats¶
outputFormats: ["svg", "pdf", "png", "jpg", "interactive"]
This option allows you to specify which output formats should be used for metagene profiling analysis. The default is ["svg", "interactive"]
.
The interactive option will generate an interactive html file that can be used to explore the metagene profiling results.
PlotlyJS¶
This option is only used when the interactive
output format is selected.
includePlotlyJS: "integrated"
This option allows you to specify how the plotly.js library should be included in the interactive html file. The default is integrated
.
The
integrated
option will include the plotly.js library in the html file. The file will be larger, but can be used offline. (+3.7mb/file)The
online
option will include a link to the plotly.js library in the html file.The
local
option will include a link to a local plotly.js library in the html file. This option requires the plotly.js library to be available in the same folder as the html file.
Colors¶
colorList: []
This option allows you to specify a list of colors that should be used for the metagene profiling analysis. The default is []
.
Per default a list of colorblind-friendly colors will be used.
If you want to change the colors, make sure that the number of colors matches the number of samples.
We also suggest to use hex colors, e.g. #ff0000
.
Paired-end support¶
Until full paired-end support is added, we allow paired-end data by merging it into single-end data.
Therefore, we use the tool flash2
to convert paired-end data to single-end data.
In order to use paired-end data, simply replace the Snakefile
with the Snakefile_pairedend
.
This will now require a special samples_pairedend.tsv
, which is also available in the HRIBO templates folder.
method |
condition |
replicate |
fastqFile |
fastqFile2 |
---|---|---|---|---|
RIBO |
A |
1 |
fastq/RIBO-A-1_R1.fastq.gz |
fastq/RIBO-A-1_R2.fastq.gz |
RIBO |
A |
2 |
fastq/RIBO-A-2_R1.fastq.gz |
fastq/RIBO-A-2_R2.fastq.gz |
RIBO |
B |
1 |
fastq/RIBO-B-1_R1.fastq.gz |
fastq/RIBO-B-1_R2.fastq.gz |
RIBO |
B |
2 |
fastq/RIBO-B-2_R1.fastq.gz |
fastq/RIBO-B-2_R2.fastq.gz |
RNA |
A |
1 |
fastq/RNA-A-1_R1.fastq.gz |
fastq/RNA-A-1_R2.fastq.gz |
RNA |
A |
2 |
fastq/RNA-A-2_R1.fastq.gz |
fastq/RNA-A-2_R2.fastq.gz |
RNA |
B |
1 |
fastq/RNA-B-1_R1.fastq.gz |
fastq/RNA-A-1_R2.fastq.gz |
RNA |
B |
2 |
fastq/RNA-B-2_R1.fastq.gz |
fastq/RNA-A-1_R2.fastq.gz |
Analysis result files¶
The important files in this workflow are listed and explained below.
ORF Predictions¶
The output files containing information about predicted Open Reading Frames, these also contain novel predictions.
predictions_reparation.xlsx¶
This file contains all reparation
ORF predictions.
Column name |
Description |
---|---|
Identifier |
Unique identifier describing the entry. |
Genome |
The genome accession identifier. |
Source |
The source of the ORF. (here reparation) |
Feature |
The feature of the ORF (here CDS) |
Start |
The start position of the ORF. |
Stop |
The stop position of the ORF. |
Strand |
The strand of the ORF. (+/-) |
Pred_probability |
The probability of this ORF (0.5-1, 1 being the best value) |
Locus_tag |
If the detected ORF is already in the annotation, this gives its locus tag. |
Old_locus_tag |
The old locus tag of a gene (if available in the annotation) |
Name |
If the detected ORF is already in the annotation, this gives its name. |
Length |
The length of the ORF. |
Codon_count |
The number of codons in the ORF. (length/3) |
<method>-<condition>-<replicate>_TE |
The translational efficiency for the given sample. |
<method>-<condition>-<replicate>_rpkm |
The RPKM for the given sample. |
Evidence |
The <condition>-<replicate> sample in which this ORF was predicted. |
Start_codon |
The start codon of the ORF. |
Stop_codon |
The stop codon of the ORF. |
15nt_upstream |
The 15nt upstream of the start codon |
Nucleotide_seq |
The nucleotide sequence of the ORF. |
Aminoacid_seq |
The amino acid sequence of the ORF. |
predictions_reparation.gff¶
An annotation file in .gff
format containing all predictions of reparation
for visualization in a genome browser.
predictions_deepribo.xlsx¶
Note
These files are only available when activating DeepRibo predictions in the config.yaml
. (see configurations)
This file contains all DeepRibo
ORF predictions.
Column name |
Description |
---|---|
Identifier |
Unique identifier describing the entry. |
Genome |
The genome accession identifier. |
Source |
The source of the ORF. (here reparation) |
Feature |
The feature of the ORF (here CDS) |
Start |
The start position of the ORF. |
Stop |
The stop position of the ORF. |
Strand |
The strand of the ORF. (+/-) |
Pred_value |
The value DeepRibo attributes the given prediction. |
Pred_rank |
The rank calculated from the prediction value. (the best prediction has rank 1) |
Novel_rank |
A special ranking involving only novel ORFs that are not in the annotation. |
Locus_tag |
If the detected ORF is already in the annotation, this gives its locus tag. |
Old_locus_tag |
The old locus tag of a gene (if available in the annotation) |
Name |
If the detected ORF is already in the annotation, this gives its name. |
Length |
The length of the ORF. |
Codon_count |
The number of codons in the ORF. (length/3) |
<method>-<condition>-<replicate>_TE |
The translational efficiency for the given sample. |
<method>-<condition>-<replicate>_rpkm |
The RPKM for the given sample. |
Evidence |
The <condition>-<replicate> sample in which this ORF was predicted. |
Start_codon |
The start codon of the ORF. |
Stop_codon |
The stop codon of the ORF. |
15nt_upstream |
The 15nt upstream of the start codon |
Nucleotide_seq |
The nucleotide sequence of the ORF. |
Aminoacid_seq |
The amino acid sequence of the ORF. |
predictions_deepribo.gff¶
Note
These files are only available when activating DeepRibo predictions in the config.yaml
. (see configurations)
An annotation file in .gff3
format containing all predictions of DeepRibo for visualization in a genome browser.
Quality control¶
This comprises all files that can help to perform quality control on all input samples.
multiqc_report.html¶
The multiQC report collects information from different tools, including fastQC
and subread featurecounts
.
The general statistics give an overview over:
the number of duplicates
the GC content
the average read lengths
the number of reads (in millions)
These statistics are collected after each processing step of our pipeline.
raw: the unprocessed data
trimmed: the data after trimming the adapter sequences
mapped: the data after mapping with Segemehl
unique: the data after removing multi-mapping reads
norRNA: the data after filtering out the rRNA
Further, feature counts are provided for different features from the annotation file. (i.e. how many reads map to each feature) This includes, all(featurecount), rRNA, norRNA(after filtering), tRNA and ncRNA. Following is a fastQC report including sequence counts, sequence quality histograms, per sequence quality scores, per base sequence content, per sequence GC content, per base N content, sequence length distribution, sequence duplication levels, overrepresented features, adapter content and a status overview.
heatmap_SpearmanCorr_readCounts.pdf¶
Spearman correlation coefficients of read counts. The dendrogram indicates which samples read counts are most similar to each other. Since there should be always a higher correlation between experiments with the same condition and experiment type (e.g. replicates) and not others, this is a rapid way to quality-control the labeling/consistency of input data.
annotation_total.xlsx¶
This file contains detailed measures for every feature in the input annotation using read counts including multi-mapping reads.
Column name |
Description |
---|---|
Identifier |
Unique identifier describing the entry. |
Genome |
The genome accession identifier. |
Source |
The source of the annotated feature. |
Feature |
The feature of the annotated feature. |
Start |
The start position of the annotated feature. |
Stop |
The stop position of the annotated feature. |
Strand |
The strand of the annotated feature. (+/-) |
Locus_tag |
The locus tag of the annotated feature. (if available) |
Old_locus_tag |
The old locus tag of a gene (if available in the annotation) |
Name |
The name of the annotated feature. (if available) |
Length |
The length of the annotated feature. |
Codon_count |
The number of codons in the annotated feature. (length / 3) |
<method>-<condition>-<replicate>_TE |
The translational efficiency for the given sample. |
<method>-<condition>-<replicate>_rpkm |
The RPKM for the given sample. (ReadsPerKilobaseMillion) |
Start_codon |
The start codon of the annotated feature. |
Stop_codon |
The stop codon of the annotated feature. |
15nt_upstream |
The 15nt upstream of the start codon |
Nucleotide_seq |
The nucleotide sequence of the annotated feature. |
Aminoacid_seq |
The amino acid sequence of the annotated feature. |
Product |
The product of the annotated feature. (if available) |
Note |
The note of the annotated feature. (if available) |
total_read_counts.xlsx¶
This file shows the overall read-counts for each feature annotated in the user-provided annotation, after mapping and before removal of multi-mapping reads.
annotation_unique.xlsx¶
This file contains detailed measures for every feature in the input annotation using read counts after removal of multi-mapping reads.
Column name |
Description |
---|---|
Identifier |
Unique identifier describing the entry. |
Genome |
The genome accession identifier. |
Source |
The source of the annotated feature. |
Feature |
The feature of the annotated feature. |
Start |
The start position of the annotated feature. |
Stop |
The stop position of the annotated feature. |
Strand |
The strand of the annotated feature. (+/-) |
Locus_tag |
The locus tag of the annotated feature. (if available) |
Old_locus_tag |
The old locus tag of a gene (if available in the annotation) |
Name |
The name of the annotated feature. (if available) |
Length |
The length of the annotated feature. |
Codon_count |
The number of codons in the annotated feature. (length / 3) |
<method>-<condition>-<replicate>_TE |
The translational efficiency for the given sample. |
<method>-<condition>-<replicate>_rpkm |
The RPKM for the given sample. (ReadsPerKilobaseMillion) |
Start_codon |
The start codon of the annotated feature. |
Stop_codon |
The stop codon of the annotated feature. |
15nt_upstream |
The 15nt upstream of the start codon |
Nucleotide_seq |
The nucleotide sequence of the annotated feature. |
Aminoacid_seq |
The amino acid sequence of the annotated feature. |
Product |
The product of the annotated feature. (if available) |
Note |
The note of the annotated feature. (if available) |
unique_read_counts.xlsx¶
This file shows the overall read-counts for each feature annotated in the user-provided annotation, after mapping and after removal of multi-mapping reads.
genome-browser¶
The files that can be used for visualization in a genome browser.
updated_annotation.gff¶
A gff track containing both the original annotation together with the new predictions by reparation.
potentialStartCodons.gff¶
A genome browser track with all possible start codons.
potentialStopCodons.gff¶
A genome browser track with all possible stop codons.
potentialRibosomeBindingSite.gff¶
A genome browser track with possible ribosome binding sites.
potentialAlternativeStartCodons.gff¶
A genome browser track with alternative start codons.
BigWig coverage files¶
We offer many different single nucleotide mapping bigwig files for genome browser visualization. These files are available for different regions and performed with different methods.
global: full read is mapped
centered: region around the center.
threeprime: region around the three prime end.
fiveprime: region around the five prime end.
These are all available with the following normalization methods:
raw: raw, unprocessed files. This should only be used to check the coverage of a single file. It should not be used to compare to other files.
min: normalized by number of minimal total reads per sample (factor = min. number of reads / number of reads). This is the recommended normalization when comparing different samples from the same experiment.
mil: normalized by 1000000 (factor = 1000000 / number of reads). This is the recommended normalization when comparing different samples from the different experiments.
Differential Expression¶
Files related to the differential expression analysis. All output tables contain multiple pre-filtered sheets using padj cutoff (0.05) and showing up or downregulated genes.
deltate/<contrast>_sorted.xlsx¶
Table containing all differential expression results from deltate.
riborex/<contrast>_sorted.xlsx¶
Table containing all differential expression results from riborex.
xtail/<contrast>_sorted.xlsx¶
Table containing all differential expression results from xtail.
xtail/r_<contrast>.pdf¶
This figure shows the RPF-to-mRNA ratios in two conditions, where the position of each gene is determined by its RPF-to-mRNA ratio (log2R) in two conditions, represented on the x-axis and y-axis respectively. The points will be color-coded with the pvalue final obtained with xtail (more significant p values having darker color)
blue: for genes with log2R larger in first condition than second condition.
red: for genes with log2R larger in second condition than the first condition.
green: for genes with log2R changing homodirectionally in two condition.
yellow: for genes with log2R changing antidirectionally in two condition.
xtail/fc_<contrast>.pdf¶
This figure shows the result of the differential expression at the two expression levels, where each gene is a dot whose position is determined by its log2 fold change (log2FC) of transcriptional level (mRNA log2FC), represented on the x-axis, and the log2FC of translational level (RPF log2FC), represented on the y-axis. The points will be color-coded with the pvalue final obtained with xtail (more significant p values having darker color)
blue: for genes whos mRNA log2FC larger than 1 (transcriptional level).
red: for genes whos RPF log2FC larger than 1 (translational level).
green: for genes changing homodirectionally at both level.
yellow: for genes changing antidirectionally at two levels.
output tables¶
All output tables were transfered into excel format for easier handling. Additional information was added.
Column name |
Description |
---|---|
Genome |
The genome accession identifier. |
Start |
The start position of the analysed feature. |
Stop |
The stop position of the analysed feature. |
Strand |
The strand orientation of the analysed feature. (+/-) |
Locus_tag |
The locus tag of the analysed feature. (if available) |
Old_locus_tag |
The old locus tag of a gene (if available in the annotation) |
Identifier |
Unique identifier <Genome>:<start>-<stop>:<strand> |
Name |
The name of the annotated feature. (if available) |
RIBO_baseMean |
The avarage of normalized count values, divided by size factors for all (RIBO) samples. |
RIBO_log2FoldChange |
The log2 fold change between the conditions of the different RIBO samples |
RIBO_lfcSE |
The standard error value of the log2FC for the RIBO sample. |
RIBO_pvalue |
The unadjusted pvalue for the RIBO sample. |
RIBO_padj |
The adjusted pvalue (adjusted for multiple testing) for the RIBO sample. |
RNA_baseMean |
The avarage of normalized count values, divided by size factors for all (RNA) samples. |
RNA_log2FoldChange |
The log2 fold change between the conditions of the different RNA samples |
RNA_lfcSE |
The standard error value of the log2FC for the RNA sample. |
RNA_pvalue |
The unadjusted pvalue for the RNA sample. |
RNA_padj |
The adjusted pvalue (adjusted for multiple testing) for the RNA sample. |
TE_baseMean |
The avarage of normalized count values, divided by size factors for all (TE) samples. |
TE_log2FoldChange |
The log2 fold change between the conditions of the different TE samples |
TE_lfcSE |
The standard error value of the log2FC for the TE sample. |
TE_pvalue |
The unadjusted pvalue for the TE sample. |
TE_padj |
The adjusted pvalue (adjusted for multiple testing) for the TE sample. |
Length |
The length of the analysed feature. |
Codon_count |
The number of codons in the analysed feature. |
Start_codon |
The start codon of the analysed feature. |
Stop_codon |
The stop codon of the analysed feature. |
15nt_upstream |
The 15nt upstream of the start codon. |
Nucleotide_seq |
The nucleotide sequence of the analysed feature. |
Aminoacid_seq |
The amino acid sequence of the analysed feature. |
Note
RIBO and RNA columns are only available for deltaTE.
Each table has several sheets:
all: contains all entries
up: contains all entries with a log2FC > 1 that are considered significant (padj < 0.05).
down contains all entries with a log2FC < -1 that are considered significant (padj < 0.05).
Note
There are slight difference between the output tables of the different tools.
Metagene Profiling Analysis¶
Please refer to the metagene profiling page for further details.
Additional output¶
samples.xlsx¶
An excel representation of the input sample file.
manual.pdf¶
A PDF format file giving some explanations about the output files, contained in the final result report.
overview.xlsx¶
An overview table containing all information gathered from the prediction tools and differential expression analysis. The contents of this table change depending on which options are set. The overview table for the default workflow will contain annotation. reparation, deepribo and differential expression output.
Column name |
Description |
---|---|
Identifier |
Unique identifier describing the entry. |
Genome |
The genome accession identifier. |
Start |
The start position of the ORF. |
Stop |
The stop position of the ORF. |
Strand |
The strand of the ORF. (+/-) |
Locus_tag |
The locus tag of ORF. (if not novel) |
Overlapping_genes |
Genes that overlap with the predicted ORF |
Old_locus_tag |
The old locus tag of a gene (if available in the annotation) |
Name |
The name of the ORF. (if not novel) |
Gene_name |
The name of the ORFs associated gene feature. (if not novel) |
Length |
The length of the ORF. |
Codon_count |
The number of codons in the ORF. (length / 3) |
Start_codon |
The start codon of the annotated feature. |
Stop_codon |
The stop codon of the annotated feature. |
15nt_upstream |
The 15nt upstream of the start codon |
Nucleotide_seq |
The nucleotide sequence of the annotated feature. |
Aminoacid_seq |
The amino acid sequence of the annotated feature. |
<method>-<condition>-<replicate>_TE |
The translational efficiency for the given sample. |
<method>-<condition>-<replicate>_rpkm |
The RPKM for the given sample. (ReadsPerKilobaseMillion) |
Evidence_reparation |
The sample this ORF was predicted in (for reparation) |
Reparation_probability |
The probability of this ORF (0.5-1, 1 being the best value) |
Evidence_deepribo |
The sample this ORF was predicted in (for deepribo) |
Deepribo_rank |
The deepribo rank for this ORF. (1 being the best value, 999999 undefined) |
Deepribo_score |
The score the deepribo rank is based on. |
riborex_pvalue |
The pvalue (determined by riborex) |
riborex_pvalue_adjusted |
The adjusted pvalue (determined by riborex) |
riborex_log2FC |
The log2FC (determined by riborex) |
xtail_pvalue |
The pvalue (determined by xtail) |
xtail_pvalue_adjusted |
The adjusted pvalue (determined by xtail) |
xtail_log2FC |
The log2FC (determined by xtail) |
Metagene profiling¶
Note
Be aware that we are still actively working on our metagene profiling scripts. The main functionality is available, but the plotting scripts may be subject to further changes. Result tables in .tsv
and .xlsx
format are provided if you prefer using your own plotting scripts.
In this section, we describe the metagene profiling data provided by HRIBO
.
What is metagene profiling?¶
Metagene profiling analyses the distribution of mapped reads around annotated start codons and stop codons.
For Ribo-seq it is expected that the ribosome protects a specific range of read lengths, often typical for the investigated group of organisms, from digestion by nuclease. These reads should show a typical peak around the start codon which corresponds to the high frequency that ribosomes are bound there.
We output and plot the metagene profiling for each individual fragment length as a quality control for the Ribo-seq protocol. If the distribution for all read lengths is untypical, arresting the ribosomes failed.
Metagene output¶
We provide the possibility to analyse the metagene profiling for different coverage mapping methods. For each sample, both the start and stop codon profiling will be performed and plotted.
Furthermore, we provide normalized (cpm, window) and unnormalized (raw) version for each sample analysed. The normalized version takes into account the total value for each position, as well as the current window length.
The metagene profiling is done for each sample and each coverage mapping method.
global: full read is mapped
centered: region around the center is mapped
threeprime: region around the three prime end is mapped
fiveprime: region around the five prime end is mapped
For each of these sample/mapping combinations, result tables in excel .xlsx
format are provided, as well as an interactive plot or several static plots.
Note
New filtering methods were added. For details check the options section.
Example output¶
Note
This section contains old output figures, but these are still valid as the overall method did not change.
In the following, we show a few example figure of how the results can look like.

This example shows a very sharp peak around +17nt. (mapping: threeprime)
30 |
31 |
32 |
33 |
34 |
sum |
coordinates |
---|---|---|---|---|---|---|
… |
… |
… |
… |
… |
… |
… |
1321 |
1242 |
1269 |
773 |
92 |
4697 |
11 |
1564 |
1193 |
1358 |
624 |
172 |
4911 |
12 |
4272 |
2317 |
1659 |
906 |
125 |
9279 |
13 |
6870 |
3009 |
1814 |
738 |
91 |
12522 |
14 |
6214 |
7498 |
2733 |
1265 |
140 |
17850 |
15 |
17797 |
27399 |
28398 |
13322 |
1245 |
88161 |
16 |
7686 |
22687 |
62042 |
25575 |
5427 |
123417 |
17 |
1272 |
5200 |
11156 |
11202 |
1382 |
30212 |
18 |
1584 |
3520 |
10027 |
22530 |
5325 |
42986 |
19 |
792 |
2166 |
2636 |
5044 |
1969 |
12607 |
20 |
176 |
568 |
1082 |
931 |
307 |
3064 |
21 |
233 |
321 |
930 |
1102 |
219 |
2805 |
22 |
204 |
211 |
270 |
378 |
89 |
1152 |
23 |
… |
… |
… |
… |
… |
… |
… |
The tables give a better overview over the data, showing that the highest peak is found at +17 for a read length of 32nt.

In this example the peak is less sharp, but it is still clearly distinguishable and around +13nt. (mapping: threeprime)
26 |
27 |
28 |
29 |
30 |
sum |
coordinates |
---|---|---|---|---|---|---|
… |
… |
… |
… |
… |
… |
… |
294 |
308 |
198 |
297 |
321 |
1418 |
9 |
177 |
338 |
204 |
170 |
286 |
1175 |
10 |
217 |
236 |
248 |
164 |
162 |
1027 |
11 |
2545 |
804 |
976 |
1379 |
548 |
6252 |
12 |
8593 |
10334 |
5286 |
4359 |
5519 |
34091 |
13 |
6120 |
7662 |
3238 |
2403 |
3590 |
23013 |
14 |
3416 |
4134 |
3807 |
2633 |
1643 |
15633 |
15 |
1311 |
4097 |
5079 |
5528 |
3020 |
19035 |
16 |
270 |
340 |
1086 |
1682 |
1222 |
4600 |
17 |
202 |
282 |
269 |
632 |
761 |
2146 |
18 |
… |
… |
… |
… |
… |
… |
… |
This table shows a situation that is less clear, the offset at +13 for read length 27nt is the highest, while it is very close to the surounding values.
The metagene analysis is very dependant on the analysed data. We have observed cases with multiple peaks, with one large peak and a few slightly smaller peaks.

This shows a case with a strong peak at around -15nt, while there are multiple smaller peaks at -22, +112, +219, and +291. These kinds of peaks can have multiple reasons. Reasons we observed so far were:
tRNA or rRNA that was not filtered out correctly
singular genes that have a large number of reads attributed to them (due to a biological reason or a faulty sequencing)
Example workflow¶
The retrieval of input files and running the workflow locally and on a server cluster via a queuing system is demonstrated using an example with data available from NCBI.
This dataset is available under the accession number PRJNA379630
.
Note
In this tutorial, we will show the basic functionalities of our workflow, for information about additional options please refer to: workflow-configuration.
Note
Ensure that you have miniconda3
installed and a snakemake environment
set-up. Please refer to the overview for details on the installation.
Setup¶
First of all, we start by creating the project directory and changing to it.
mkdir project
cd project
We then download the latest version of HRIBO into the newly created project folder and unpack it.
wget https://github.com/RickGelhausen/HRIBO/archive/1.7.0.tar.gz
tar -xzf 1.7.0.tar.gz; mv HRIBO-1.7.0 HRIBO; rm 1.7.0.tar.gz;
Retrieve and prepare input files¶
Before starting the workflow, we have to acquire and prepare several input files. These files are the annotation file, the genome file, the fastq files, the configuration file and the sample sheet.
Annotation and genome files¶
First, we want to retrieve the annotation file and the genome file. In this case, we can find both on NCBI using the accession number NC_002516.2
.

On this page, we can directly retrieve both files by clicking on the according download links next to the file descriptions. Alternatively, you can directly download them using the following commands:
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/006/765/GCF_000006765.1_ASM676v1/GCF_000006765.1_ASM676v1_genomic.gff.gz
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/006/765/GCF_000006765.1_ASM676v1/GCF_000006765.1_ASM676v1_genomic.fna.gz
Then, we unpack and rename both files.
gunzip GCF_000006765.1_ASM676v1_genomic.gff.gz && mv GCF_000006765.1_ASM676v1_genomic.gff annotation.gff
gunzip GCF_000006765.1_ASM676v1_genomic.fna.gz && mv GCF_000006765.1_ASM676v1_genomic.fna genome.fa
.fastq files¶
Next, we want to acquire the fastq files. The fastq files are available under the accession number PRJNA379630
on NCBI.
The files have to be downloaded using the Sequence Read Archive (SRA).
There are multiple ways of downloading files from SRA as explained here.
As we already have conda installed, the easiest way is to install the sra-tools:
conda create -n sra-tools -c bioconda -c conda-forge sra-tools pigz
This will create a conda environment containing the sra-tools. Using these, we can simply pass the SRA identifiers and download the data:
conda activate sra-tools;
fasterq-dump SRR5356907; pigz -p 2 SRR5356907.fastq; mv SRR5356907.fastq.gz RIBO-PAO1-gly-1.fastq.gz;
fasterq-dump SRR5356908; pigz -p 2 SRR5356908.fastq; mv SRR5356908.fastq.gz RNA-PAO1-gly-1.fastq.gz;
conda deactivate;
Note
Due to the runtime of several tools, especially the mapping by segemehl
, this tutorial only uses one condition and replicate. If available, it is advisable to use as many replicates as possible.
Warning
If you have a bad internet connection, this step might take some time. If you prefer, you can also use your own .fastq
files. But ensure that you use the correct annotation and genome files and that you compress them in .gz
format (using gzip, pigz, etc …)
This will download compressed files for each of the required .fastq
files. We will move them into a folder called fastq
.
mkdir fastq;
mv *.fastq.gz fastq;
Sample sheet and configuration file¶
Finally, we will prepare the configuration file (config.yaml
) and the sample sheet (samples.tsv
). We start by copying templates for both files from the HRIBO/templates/
into the HRIBO/
folder.
cp HRIBO/templates/samples.tsv HRIBO/
The sample file looks as follows:
method |
condition |
replicate |
fastqFile |
---|---|---|---|
RIBO |
A |
1 |
fastq/RIBO-A-1.fastq.gz |
RIBO |
A |
2 |
fastq/RIBO-A-2.fastq.gz |
RIBO |
B |
1 |
fastq/RIBO-B-1.fastq.gz |
RIBO |
B |
2 |
fastq/RIBO-B-2.fastq.gz |
RNA |
A |
1 |
fastq/RNA-A-1.fastq.gz |
RNA |
A |
2 |
fastq/RNA-A-2.fastq.gz |
RNA |
B |
1 |
fastq/RNA-B-1.fastq.gz |
RNA |
B |
2 |
fastq/RNA-B-2.fastq.gz |
Note
When using your own data, use any editor (vi(m), gedit, nano, atom, …) to customize the sample sheet.
Warning
Please ensure not to replace any tabulator symbols with spaces while changing this file.
We will rewrite this file to fit the previously downloaded .fastq.gz files.
method |
condition |
replicate |
fastqFile |
---|---|---|---|
RIBO |
GLY |
1 |
fastq/RIBO-PAO1-gly-1.fastq.gz |
RNA |
GLY |
1 |
fastq/RNA-PAO1-gly-1.fastq.gz |
Next, we are going to set up the config.yaml
.
cp HRIBO/templates/config.yaml HRIBO/
The config file can be used to easily change the parameters of HRIBO
.
Note
For a detailed overview of the available options please refer to our workflow-configuration
In our small example, we will adjust the adapter sequence and to reduce the runtime, deactivate deepribo, which will lead to the following changes in the config.yaml
file:
biologySettings:
# Adapter sequence used
adapter: "AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC"
predictionSettings:
# Deepribo predictions: on / off
deepribo: "off"
Running the workflow¶
Now that all the required files are prepared, we can start running the workflow, either locally or in a cluster environment.
Warning
before you start using snakemake
remember to activate the environment first.
conda activate hribo_env
Run the workflow locally¶
Use the following steps when you plan to execute the workflow on a single server, cloud instance or workstation.
Warning
Please be aware that some steps of the workflow require a lot of memory or time, depending on the size of your input data. To get a better idea about the memory consumption, you can have a look at the provided sge.yaml
or torque.yaml
files.
Navigate to the project folder containing your annotation and genome files, as well as the HRIBO folder. Start the workflow locally from this folder by running:
snakemake --use-conda -s HRIBO/Snakefile --configfile HRIBO/config.yaml --directory ${PWD} -j 10 --latency-wait 60
This will start the workflow locally.
--use-conda
: instruct snakemake to download tool dependencies from conda.-s
: specifies the Snakefile to be used.--configfile
: specifies the config file to be used.--directory
: specifies your current path.-j
: specifies the maximum number of cores snakemake is allowed to use.--latency-wait
: specifies how long (in seconds) snakemake will wait for filesystem latencies until declaring a file to be missing.
Run Snakemake in a cluster environment¶
Use the following steps if you are executing the workflow via a queuing system. Edit the configuration file <cluster>.yaml
according to your queuing system setup and cluster hardware.
Navigate to the project folder on your cluster system. Start the workflow from this folder by running (The following system call shows the usage with Grid Engine.):
snakemake --use-conda -s HRIBO/Snakefile --configfile HRIBO/config.yaml --directory ${PWD} -j 20 --cluster-config HRIBO/sge.yaml
Note
Ensure that you use an appropriate <cluster>.yaml
for your cluster system. We provide one for SGE
and TORQUE
based systems.
Example: Run Snakemake in a cluster environment¶
Warning
Be advised that this is a specific example, the required options may change depending on your system.
We ran the tutorial workflow in a cluster environment, specifically a TORQUE cluster environment.
Therefore, we created a bash script torque.sh
in our project folder.
vi torque.sh
Note
Please note that all arguments enclosed in <> have to be customized. This script will only work if your cluster uses the TORQUE queuing system.
We proceeded by writing the queuing script:
#!/bin/bash
#PBS -N <ProjectName>
#PBS -S /bin/bash
#PBS -q "long"
#PBS -d <PATH/ProjectFolder>
#PBS -l nodes=1:ppn=1
#PBS -o <PATH/ProjectFolder>
#PBS -j oe
cd <PATH/ProjectFolder>
source activate HRIBO
snakemake --latency-wait 600 --use-conda -s HRIBO/Snakefile --configfile HRIBO/config.yaml --directory ${PWD} -j 20 --cluster-config HRIBO/torque.yaml --cluster "qsub -N {cluster.jobname} -S /bin/bash -q {cluster.qname} -d <PATH/ProjectFolder> -l {cluster.resources} -o {cluster.logoutputdir} -j oe"
We then simply submitted this job to the cluster:
qsub torque.sh
Using any of the presented methods, this will run the workflow on the tutorial dataset and create the desired output files.
Results¶
The last step will be to aggregate all the results once the workflow has finished running.
In order to do this, we provided a script in the scripts folder of HRIBO called makereport.sh
.
bash HRIBO/scripts/makereport.sh <reportname_prefix>
Running this command will create a folder where all the results are collected from the workflows final output, it will additionally create compressed file in .zip
format.
Note
A detailed explanation of the result files can be found in the result section.
Note
The final result of this example workflow, can be found here .
Warning
As many browsers stopped the support for viewing ftp files, you might have to use a ftp viewer instead.
Runtime¶
The total runtime of the example workflow, using 12 cores of an AMD EPYC Processor (with IBPB), 1996 MHz CPUs and 64 GB RAM, was 4h04m37s.
The runtime contains the automatic download and installation time of all dependencies by conda and singularity. This step is mainly dependent on the available network bandwidth. In our case it took about 7 minutes.
References¶
- GruningDSjodin+17
Björn Grüning, Ryan Dale, Andreas Sjödin, Jillian Rowe, Brad A. Chapman, Christopher H. Tomkins-Tinch, Renan Valieris, and Johannes Köster. Bioconda: a sustainable and comprehensive software distribution for the life sciences. bioRxiv, 2017. URL: https://www.biorxiv.org/content/early/2017/10/27/207092, arXiv:https://www.biorxiv.org/content/early/2017/10/27/207092.full.pdf, doi:10.1101/207092.
- KosterR18
Johannes Köster and Sven Rahmann. Snakemake—a scalable bioinformatics workflow engine. Bioinformatics, ():bty350, 2018. URL: http://dx.doi.org/10.1093/bioinformatics/bty350, arXiv:/oup/backfile/content_public/journal/bioinformatics/pap/10.1093_bioinformatics_bty350/2/bty350.pdf, doi:10.1093/bioinformatics/bty350.
- PGAR19
Anastasia H. Potts, Yinping Guo, Brian M. M. Ahmer, and Tony Romeo. Role of csra in stress responses and metabolism important for salmonella virulence revealed by integrated transcriptomics. PLOS ONE, 14(1):1–30, 01 2019. URL: https://doi.org/10.1371/journal.pone.0211430, doi:10.1371/journal.pone.0211430.
- ZAA+18
Daniel R Zerbino, Premanand Achuthan, Wasiu Akanni, M Ridwan Amode, Daniel Barrell, Jyothish Bhai, Konstantinos Billis, Carla Cummins, Astrid Gall, Carlos García Girón, Laurent Gil, Leo Gordon, Leanne Haggerty, Erin Haskell, Thibaut Hourlier, Osagie G Izuogu, Sophie H Janacek, Thomas Juettemann, Jimmy Kiang To, Matthew R Laird, Ilias Lavidas, Zhicheng Liu, Jane E Loveland, Thomas Maurel, William McLaren, Benjamin Moore, Jonathan Mudge, Daniel N Murphy, Victoria Newman, Michael Nuhn, Denye Ogeh, Chuang Kee Ong, Anne Parker, Mateus Patricio, Harpreet Singh Riat, Helen Schuilenburg, Dan Sheppard, Helen Sparrow, Kieron Taylor, Anja Thormann, Alessandro Vullo, Brandon Walts, Amonida Zadissa, Adam Frankish, Sarah E Hunt, Myrto Kostadima, Nicholas Langridge, Fergal J Martin, Matthieu Muffato, Emily Perry, Magali Ruffier, Dan M Staines, Stephen J Trevanion, Bronwen L Aken, Fiona Cunningham, Andrew Yates, and Paul Flicek. Ensembl 2018. Nucleic Acids Research, 46(D1):D754–D761, 2018. URL: http://dx.doi.org/10.1093/nar/gkx1098, arXiv:/oup/backfile/content_public/journal/nar/46/d1/10.1093_nar_gkx1098/2/gkx1098.pdf, doi:10.1093/nar/gkx1098.
Extended workflow¶
Warning
This tutorial shows a full run of the workflow with all options activated. For testing, we ran this example locally on a large cloud instance. The data is likely too large for running locally on an average laptop.
We show a run of the full workflow, including deepribo predictions and differential expression analysis, on data available from NCBI.
For this purpose, we use a salmonella enterica
dataset available under the accession number PRJNA421559
[PGAR19].
Warning
Ensure that you have miniconda3
and singularity
installed and a snakemake environment
set-up. Please refer to the overview for details on the installation.
Setup¶
First of all, we start by creating the project directory and changing to it. (you can choose any directory name)
mkdir project
cd project
We then download the latest version of HRIBO into the newly created project folder and unpack it.
wget https://github.com/RickGelhausen/HRIBO/archive/1.7.0.tar.gz
tar -xzf 1.7.0.tar.gz; mv HRIBO-1.7.0 HRIBO; rm 1.7.0.tar.gz;
Retrieve and prepare input files¶
Before starting the workflow, we have to acquire and prepare several input files. These files are the annotation file, the genome file, the fastq files, the configuration file and the sample sheet.
Annotation and genome files¶
First, we want to retrieve the annotation file and the genome file. In this case, we can find both on NCBI using the accession number NC_016856.1
.

Note
Ensure that you download the annotation for the correct strain str. 14028S
.
On this page, we can directly retrieve both files by clicking on the according download links next to the file descriptions. Alternatively, you can directly download them using the following commands:
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/022/165/GCF_000022165.1_ASM2216v1/GCF_000022165.1_ASM2216v1_genomic.gff.gz
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/022/165/GCF_000022165.1_ASM2216v1/GCF_000022165.1_ASM2216v1_genomic.fna.gz
Then, we unpack and rename both files.
gunzip GCF_000022165.1_ASM2216v1_genomic.gff.gz && mv GCF_000022165.1_ASM2216v1_genomic.gff annotation.gff
gunzip GCF_000022165.1_ASM2216v1_genomic.fna.gz && mv GCF_000022165.1_ASM2216v1_genomic.fna genome.fa
.fastq files¶
Next, we want to acquire the fastq files. The fastq files are available under the accession number PRJNA421559
on NCBI.
The files have to be downloaded using the Sequence Read Archive (SRA).
There are multiple ways of downloading files from SRA as explained here.
As we already have conda
installed, the easiest way is to install the sra-tools
:
conda create -n sra-tools -c bioconda -c conda-forge sra-tools pigz
This will create a conda environment containing the sra-tools
and pigz
. Using these, we can simply pass the SRA identifiers and download the data:
conda activate sra-tools;
fasterq-dump SRR6359966; pigz -p 2 SRR6359966.fastq; mv SRR6359966.fastq.gz RIBO-WT-1.fastq.gz
fasterq-dump SRR6359967; pigz -p 2 SRR6359967.fastq; mv SRR6359967.fastq.gz RIBO-WT-2.fastq.gz
fasterq-dump SRR6359974; pigz -p 2 SRR6359974.fastq; mv SRR6359974.fastq.gz RNA-WT-1.fastq.gz
fasterq-dump SRR6359975; pigz -p 2 SRR6359975.fastq; mv SRR6359975.fastq.gz RNA-WT-2.fastq.gz
fasterq-dump SRR6359970; pigz -p 2 SRR6359970.fastq; mv SRR6359970.fastq.gz RIBO-csrA-1.fastq.gz
fasterq-dump SRR6359971; pigz -p 2 SRR6359971.fastq; mv SRR6359971.fastq.gz RIBO-csrA-2.fastq.gz
fasterq-dump SRR6359978; pigz -p 2 SRR6359978.fastq; mv SRR6359978.fastq.gz RNA-csrA-1.fastq.gz
fasterq-dump SRR6359979; pigz -p 2 SRR6359979.fastq; mv SRR6359979.fastq.gz RNA-csrA-2.fastq.gz
conda deactivate;
Note
we will use two conditions and two replicates for each condition. There are 4 replicates available for each condition, we run it with two as this is just an example. If you run an analysis always try to use as many replicates as possible.
Warning
If you have a bad internet connection, this step might take some time. It is advised to run this workflow on a cluster or cloud instance.
This will download compressed files for each of the required .fastq
files. We will move them into a folder called fastq
.
mkdir fastq;
mv *.fastq.gz fastq;
Sample sheet and configuration file¶
Finally, we will prepare the configuration file (config.yaml
) and the sample sheet (samples.tsv
). We start by copying templates for both files from the HRIBO/templates/
into the HRIBO/
folder.
cp HRIBO/templates/samples.tsv HRIBO/
The sample file looks as follows:
method |
condition |
replicate |
fastqFile |
---|---|---|---|
RIBO |
A |
1 |
fastq/RIBO-A-1.fastq.gz |
RIBO |
A |
2 |
fastq/RIBO-A-2.fastq.gz |
RIBO |
B |
1 |
fastq/RIBO-B-1.fastq.gz |
RIBO |
B |
2 |
fastq/RIBO-B-2.fastq.gz |
RNA |
A |
1 |
fastq/RNA-A-1.fastq.gz |
RNA |
A |
2 |
fastq/RNA-A-2.fastq.gz |
RNA |
B |
1 |
fastq/RNA-B-1.fastq.gz |
RNA |
B |
2 |
fastq/RNA-B-2.fastq.gz |
Note
When using your own data, use any editor (vi(m), gedit, nano, atom, …) to customize the sample sheet.
Warning
Please ensure not to replace any tabulator symbols with spaces while changing this file.
We will rewrite this file to fit the previously downloaded .fastq.gz files.
method |
condition |
replicate |
fastqFile |
---|---|---|---|
RIBO |
WT |
1 |
fastq/RIBO-WT-1.fastq.gz |
RIBO |
WT |
2 |
fastq/RIBO-WT-2.fastq.gz |
RIBO |
csrA |
1 |
fastq/RIBO-csrA-1.fastq.gz |
RIBO |
csrA |
2 |
fastq/RIBO-csrA-2.fastq.gz |
RNA |
WT |
1 |
fastq/RNA-WT-1.fastq.gz |
RNA |
WT |
2 |
fastq/RNA-WT-2.fastq.gz |
RNA |
csrA |
1 |
fastq/RNA-csrA-1.fastq.gz |
RNA |
csrA |
2 |
fastq/RNA-csrA-1.fastq.gz |
Next, we are going to set up the config.yaml
.
cp HRIBO/templates/config.yaml HRIBO/
The config file can be used to easily change the parameters of HRIBO
.
Note
For a detailed overview of the available options please refer to our workflow-configuration
In our small example, we will adjust the adapter sequence which will lead to the following changes in the config.yaml
file:
biologySettings:
# Adapter sequence used
adapter: "AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC"
Running the workflow¶
Now that all the required files are prepared, we can start running the workflow, either locally or in a cluster environment.
Warning
if you have problems running deepribo
, please refer to Activating DeepRibo.
Warning
before you start using snakemake
remember to activate the environment first.
conda activate snakemake
Run the workflow locally¶
Use the following steps when you plan to execute the workflow on a single server, cloud instance or workstation.
Warning
Please be aware that some steps of the workflow require a lot of memory or time, depending on the size of your input data. To get a better idea about the memory consumption, you can have a look at the provided sge.yaml
or torque.yaml
files.
Navigate to the project folder containing your annotation and genome files, as well as the HRIBO folder. Start the workflow locally from this folder by running:
snakemake --use-conda --use-singularity --singularity-args " -c " --greediness 0 -s HRIBO/Snakefile --configfile HRIBO/config.yaml --directory ${PWD} -j 10 --latency-wait 60
This will start the workflow locally.
--use-conda
: instruct snakemake to download tool dependencies from conda.-s
: specifies the Snakefile to be used.--configfile
: specifies the config file to be used.--directory
: specifies your current path.-j
: specifies the maximum number of cores snakemake is allowed to use.--latency-wait
: specifies how long (in seconds) snakemake will wait for filesystem latencies until declaring a file to be missing.
Run Snakemake in a cluster environment¶
Use the following steps if you are executing the workflow via a queuing system. Edit the configuration file <cluster>.yaml
according to your queuing system setup and cluster hardware.
Navigate to the project folder on your cluster system. Start the workflow from this folder by running (The following system call shows the usage with Grid Engine):
snakemake --use-conda --use-singularity --singularity-args " -c " -s HRIBO/Snakefile --configfile HRIBO/config.yaml --directory ${PWD} -j 20 --cluster-config HRIBO/sge.yaml
Note
Ensure that you use an appropriate <cluster>.yaml
for your cluster system. We provide one for SGE
and TORQUE
based systems.
Example: Run Snakemake in a cluster environment¶
Warning
Be advised that this is a specific example, the required options may change depending on your system.
We ran the tutorial workflow in a cluster environment, specifically a TORQUE cluster environment.
Therefore, we created a bash script torque.sh
in our project folder.
vi torque.sh
Note
Please note that all arguments enclosed in <> have to be customized. This script will only work if your cluster uses the TORQUE queuing system.
We proceeded by writing the queuing script:
#!/bin/bash
#PBS -N <ProjectName>
#PBS -S /bin/bash
#PBS -q "long"
#PBS -d <PATH/ProjectFolder>
#PBS -l nodes=1:ppn=1
#PBS -o <PATH/ProjectFolder>
#PBS -j oe
cd <PATH/ProjectFolder>
source activate HRIBO
snakemake --latency-wait 600 --use-conda --use-singularity --singularity-args " -c " -s HRIBO/Snakefile --configfile HRIBO/config.yaml --directory ${PWD} -j 20 --cluster-config HRIBO/torque.yaml --cluster "qsub -N {cluster.jobname} -S /bin/bash -q {cluster.qname} -d <PATH/ProjectFolder> -l {cluster.resources} -o {cluster.logoutputdir} -j oe"
We then simply submitted this job to the cluster:
qsub torque.sh
Using any of the presented methods, this will run the workflow on the tutorial dataset and create the desired output files.
Results¶
The last step will be to aggregate all the results once the workflow has finished running.
In order to do this, we provided a script in the scripts folder of HRIBO called makereport.sh
.
bash HRIBO/scripts/makereport.sh <reportname>
Running this will create a folder where all the results are collected from the workflows final output, it will additionally create compressed file in .zip
format.
The <reportname>
will be extended by report_HRIBOX.X.X_dd-mm-yy
.
Note
A detailed explanation of the result files can be found in the result section.
Note
The final result of this example workflow, can be found here .
Warning
As many browsers stopped the support for viewing ftp files, you might have to use a ftp viewer instead.
Runtime¶
The total runtime of the extended workflow, using 12 cores of an AMD EPYC Processor (with IBPB), 1996 MHz CPUs and 64 GB RAM, was 5h51m14s.
The runtime contains the automatic download and installation time of all dependencies by conda and singularity. This step is mainly dependent on the available network bandwidth. In this case it took about 12 minutes.
The runtime difference compared to the example workflow is explained by the additional libraries and tools used.
References¶
- GruningDSjodin+17
Björn Grüning, Ryan Dale, Andreas Sjödin, Jillian Rowe, Brad A. Chapman, Christopher H. Tomkins-Tinch, Renan Valieris, and Johannes Köster. Bioconda: a sustainable and comprehensive software distribution for the life sciences. bioRxiv, 2017. URL: https://www.biorxiv.org/content/early/2017/10/27/207092, arXiv:https://www.biorxiv.org/content/early/2017/10/27/207092.full.pdf, doi:10.1101/207092.
- KosterR18
Johannes Köster and Sven Rahmann. Snakemake—a scalable bioinformatics workflow engine. Bioinformatics, ():bty350, 2018. URL: http://dx.doi.org/10.1093/bioinformatics/bty350, arXiv:/oup/backfile/content_public/journal/bioinformatics/pap/10.1093_bioinformatics_bty350/2/bty350.pdf, doi:10.1093/bioinformatics/bty350.
- PGAR19
Anastasia H. Potts, Yinping Guo, Brian M. M. Ahmer, and Tony Romeo. Role of csra in stress responses and metabolism important for salmonella virulence revealed by integrated transcriptomics. PLOS ONE, 14(1):1–30, 01 2019. URL: https://doi.org/10.1371/journal.pone.0211430, doi:10.1371/journal.pone.0211430.
- ZAA+18
Daniel R Zerbino, Premanand Achuthan, Wasiu Akanni, M Ridwan Amode, Daniel Barrell, Jyothish Bhai, Konstantinos Billis, Carla Cummins, Astrid Gall, Carlos García Girón, Laurent Gil, Leo Gordon, Leanne Haggerty, Erin Haskell, Thibaut Hourlier, Osagie G Izuogu, Sophie H Janacek, Thomas Juettemann, Jimmy Kiang To, Matthew R Laird, Ilias Lavidas, Zhicheng Liu, Jane E Loveland, Thomas Maurel, William McLaren, Benjamin Moore, Jonathan Mudge, Daniel N Murphy, Victoria Newman, Michael Nuhn, Denye Ogeh, Chuang Kee Ong, Anne Parker, Mateus Patricio, Harpreet Singh Riat, Helen Schuilenburg, Dan Sheppard, Helen Sparrow, Kieron Taylor, Anja Thormann, Alessandro Vullo, Brandon Walts, Amonida Zadissa, Adam Frankish, Sarah E Hunt, Myrto Kostadima, Nicholas Langridge, Fergal J Martin, Matthieu Muffato, Emily Perry, Magali Ruffier, Dan M Staines, Stephen J Trevanion, Bronwen L Aken, Fiona Cunningham, Andrew Yates, and Paul Flicek. Ensembl 2018. Nucleic Acids Research, 46(D1):D754–D761, 2018. URL: http://dx.doi.org/10.1093/nar/gkx1098, arXiv:/oup/backfile/content_public/journal/nar/46/d1/10.1093_nar_gkx1098/2/gkx1098.pdf, doi:10.1093/nar/gkx1098.