HRIBO

Introduction

HRIBO is a workflow for the analysis of prokaryotic Ribo-Seq data. HRIBO is available on github. It includes among others, prediction of novel open reading frames (ORFs), metagene profiling, quality control and differential expression analysis. The workflow is based on the workflow management system snakemake and handles installation of all dependencies via bioconda [GruningDSjodin+17] and docker, as well as all processings steps. The source code of HRIBO is open source and available under the License GNU General Public License 3. Installation and basic usage is described below.

Note

For a detailed step by step tutorial on how to use this workflow on a sample dataset, please refer to our example-workflow.

Requirements

In the following, we describe all the required files and tools needed to run our workflow.

Warning

HRIBO was tested on a linux system. We cannot guarantee that the workflow will run on other systems.

Tools

miniconda3

As this workflow is based on the workflow management system snakemake [KosterR18], Snakemake will download all necessary dependencies via conda.

We strongly recommend installing miniconda3 with python3.7.

After downloading the miniconda3 version suiting your linux system, execute the downloaded bash file and follow the instructions given.

HRIBO

Using the workflow requires HRIBO. The latest version is available on our GitHub page.

In order to run the workflow, we suggest that you download the HRIBO into your project directory. The following command creates an example directory and changes into it:

mkdir project
cd project

Now, download and unpack the latest version of HRIBO by entering the following commands:

wget https://github.com/RickGelhausen/HRIBO/archive/1.7.0.tar.gz
tar -xzf 1.7.0.tar.gz; mv HRIBO-1.7.0 HRIBO; rm 1.7.0.tar.gz;

HRIBO is now in a subdirectory of your project directory.

snakemake

Note

HRIBO was tested using snakemake (version>=7.24.2)

In order to support docker container, snakemake requires singularity. HRIBO requires snakemake and singularity to be installed on your system.

We suggest installing it using conda. To this end we provide an environment.yaml file in the HRIBO directory.

conda create --file HRIBO/environment.yaml

This creates a new conda environment called snakemake and installs snakemake and singularity into the environment. The environment can be activated using:

conda activate hribo_env

and deactivated using:

conda deactivate

Input files

Several input files are required in order to run the workflow, a genome file (.fa), an annotation file (.gff/.gtf) and compressed sequencing files (.fastq.gz).

File name

Description

annotation.gff

user-provided annotation file with genomic features

genome.fa

user-provided genome file containing the genome sequence

<method>-<conditon>-<replicate>.fastq.gz

user-provided compressed sequencing files

config.yaml

configuration file to customize the workflow

samples.tsv

sample file describing the relation between the input fastq files

annotation.gff and genome.fa

We recommend retrieving both the genome and the annotation files for your organism from National Center for Biotechnology Information (NCBI) or Ensembl Genomes [ZAA+18].

Warning

if you use custom annotation files, ensure that you adhere to the gtf/gff standard. Wrongly formatted files are likely to cause problems with downstream tools.

Note

For detailed information about downloading and unpacking these files, please refer to our example-workflow.

input .fastq files

These are the input files provided by the user. Both single end and paired end data is supported.

Note

As most downstream tools do not support paired end data, we combine the paired end data into single end data using flash2 . For more information about how to use paired-end data please refer to the workflow-configuration.

Note

Please ensure that you compress your files in .gz format.

Please ensure that you move all input .fastq.gz files into a folder called fastq (Located in your project folder):

mkdir fastq
cp *.fastq.gz fastq/

Sample sheet and configuration file

In order to run HRIBO, you have to provide a sample sheet and a configuration file. There are templates for both files available in the HRIBO folder, in the subfolder templates. The configuration file is used to allow the user to easily customize certain settings, like the adapter sequence. The sample sheet is used to specify the relation of the input .fastq files (condition / replicate etc…)

Copy the templates of the sample sheet and the configuration file into the HRIBO folder:

cp HRIBO/templates/samples.tsv HRIBO/
cp HRIBO/templates/config.yaml HRIBO/

Customize the config.yaml using your preferred editor.

Note

For a detailed overview of the available options please refer to our workflow-configuration

Edit the sample sheet corresponding to your project. It contains the following variables:

  • method: indicates the method used for this project, here RIBO for ribosome profiling and RNA for RNA-seq.

  • condition: indicates the applied condition (e.g. A, B, …).

  • replicate: ID used to distinguish between the different replicates (e.g. 1,2, …)

  • inputFile: indicates the according fastq file for a given sample.

Note

If you have paired end data, please ensure that you use the samples_pairedend.tsv file.

As seen in the samples.tsv template:

method

condition

replicate

fastqFile

RIBO

A

1

fastq/RIBO-A-1.fastq.gz

RIBO

A

2

fastq/RIBO-A-2.fastq.gz

RIBO

B

1

fastq/RIBO-B-1.fastq.gz

RIBO

B

2

fastq/RIBO-B-2.fastq.gz

RNA

A

1

fastq/RNA-A-1.fastq.gz

RNA

A

2

fastq/RNA-A-2.fastq.gz

RNA

B

1

fastq/RNA-B-1.fastq.gz

RNA

B

2

fastq/RNA-B-2.fastq.gz

Note

This is just an example, please refer to our example-workflow for another example.

cluster.yaml

Warning

As we are currently unable to test HRIBO on cluster systems, the support for cluster systems is experimental. As HRIBO is based on snakemake, cluster support is still possible and we added an example SLURM profile. Please check out the snakemake documentation for more detail snakemake

Output files

In the following tables all important output files of the workflow are listed.

Note

Files create as intermediate steps of the workflow are omitted from this list. (e.g. .bam files)

Note

For more details about the output files, please refer to the analysis results.

Single-file Output

File name

Description

samples.xlsx

Excel version of the input samples file.

manual.pdf

A PDF file describing the analysis.

annotation_total.xlsx

Excel file containing detailed measures for every feature in the input annotation using read counts containing multi-mapping reads.

annotation_unique.xlsx

Excel file containing detailed measures for every feature in the input annotation using read counts containing no multi-mapping reads.

total_read_counts.xlsx

Excel file containing read counts with multi-mapping reads.

unique_read_counts.xlsx

Excel file containing read counts without multi-mapping reads.

multiqc_report.html

Quality control report combining all finding of individual fastQC reports into a well structured overview file.

heatmap_SpearmanCorr_readCounts.pdf

PDF file showing the Spearman correlation between all samples.

predictions_reparation.xlsx

Excel file containing detailed measures for every ORF detected by reparation.

predictions_reparation.gff

GFF file containing ORFs detected by reparation, for genome browser visualization.

potentialStartCodons.gff

GFF file for genome browser visualization containing all potential start codons in the input genome.

potentialStopCodons.gff

GFF file for genome browser visualization containing all potential stop codons in the input genome.

potentialRibosomeBindingSite.gff

GFF file for genome browser visualization containing all potential ribosome binding sites in the input genome.

potentialAlternativeStartCodons.gff

GFF file for genome browser visualization containing all potential alternative start codons in the input genome.

Multi-file Output

File name

Description

riborex/<contrast>_sorted.csv

Differential expression results by Riborex, sorted by pvalue.

riborex/<contrast>_significant.csv

Differential expression results by Riborex, only significant results. (pvalue < 0.05)

xtail/<contrast>_sorted.csv

Differential expression results by xtail, sorted by pvalue.

xtail/<contrast>_significant.csv

Differential expression results by xtail, only significant results. (pvalue < 0.05)

xtail/r_<contrast>.pdf

Differential expression results by xtail, plot with RPF-to-mRNA ratios.

xtail/fc_<contrast>.pdf

Differential expression results by xtail, plot with log2 fold change of both mRNA and RPF.

<method>-<condition>-<replicate>.X.Y.Z.bw

BigWig file for genome browser visualization, containing a single nucleotide mapping around certain regions.

<accession>_Z.Y_profiling.xlsx/tsv

Excel and tsv files containing raw data of the metagene analysis.

<accession>_Z.Y_profiling.pdf

visualization of the metagene analysis.

Note

<contrast> represents a pair of conditions that are being compared.

Note

The BigWig files are available for different normalization methods, strands and regions, X=(min/mil) Y=(forward/reverse) Z=(fiveprime, threeprime, global, centered).

Tool Parameters

The tools used in our workflow are listed below, with links to their respective webpage and a short description.

Tool

Version

Special parameters used

cutadapt

4.1

Adapter removal and quality trimming

fastQC

0.11.9

Quality control

multiQC

1.13

Quality control report

segemehl

0.3.4

Mapping of reads

flash2

2.2.00

Merging paired end samples into single end

cufflinks

2.2.1

Used to convert gff to gtf

bedtools

2.30.0

Collection of useful processing tools (e.g. read counting etc…)

reparation_blast

1.0.9

Prediction of novel Open Reading frames

deepribo

1.1

Prediction of novel Open Reading frames

riborex

2.4.0

Differential expression analysis

xtail

1.1.5

Differential expression analysis

Report

In order to aggregate the final results into a single folder structure and receive a date-tagged .zip file, you can use the makereport.sh script. The current date will be used as a tag for the report folder.

bash HRIBO/scripts/makereport.sh <reportname>

Example-workflow

A detailed step by step tutorial is available at: example-workflow.

References

GruningDSjodin+17

Björn Grüning, Ryan Dale, Andreas Sjödin, Jillian Rowe, Brad A. Chapman, Christopher H. Tomkins-Tinch, Renan Valieris, and Johannes Köster. Bioconda: a sustainable and comprehensive software distribution for the life sciences. bioRxiv, 2017. URL: https://www.biorxiv.org/content/early/2017/10/27/207092, arXiv:https://www.biorxiv.org/content/early/2017/10/27/207092.full.pdf, doi:10.1101/207092.

KosterR18

Johannes Köster and Sven Rahmann. Snakemake—a scalable bioinformatics workflow engine. Bioinformatics, ():bty350, 2018. URL: http://dx.doi.org/10.1093/bioinformatics/bty350, arXiv:/oup/backfile/content_public/journal/bioinformatics/pap/10.1093_bioinformatics_bty350/2/bty350.pdf, doi:10.1093/bioinformatics/bty350.

PGAR19

Anastasia H. Potts, Yinping Guo, Brian M. M. Ahmer, and Tony Romeo. Role of csra in stress responses and metabolism important for salmonella virulence revealed by integrated transcriptomics. PLOS ONE, 14(1):1–30, 01 2019. URL: https://doi.org/10.1371/journal.pone.0211430, doi:10.1371/journal.pone.0211430.

ZAA+18

Daniel R Zerbino, Premanand Achuthan, Wasiu Akanni, M Ridwan Amode, Daniel Barrell, Jyothish Bhai, Konstantinos Billis, Carla Cummins, Astrid Gall, Carlos García Girón, Laurent Gil, Leo Gordon, Leanne Haggerty, Erin Haskell, Thibaut Hourlier, Osagie G Izuogu, Sophie H Janacek, Thomas Juettemann, Jimmy Kiang To, Matthew R Laird, Ilias Lavidas, Zhicheng Liu, Jane E Loveland, Thomas Maurel, William McLaren, Benjamin Moore, Jonathan Mudge, Daniel N Murphy, Victoria Newman, Michael Nuhn, Denye Ogeh, Chuang Kee Ong, Anne Parker, Mateus Patricio, Harpreet Singh Riat, Helen Schuilenburg, Dan Sheppard, Helen Sparrow, Kieron Taylor, Anja Thormann, Alessandro Vullo, Brandon Walts, Amonida Zadissa, Adam Frankish, Sarah E Hunt, Myrto Kostadima, Nicholas Langridge, Fergal J Martin, Matthieu Muffato, Emily Perry, Magali Ruffier, Dan M Staines, Stephen J Trevanion, Bronwen L Aken, Fiona Cunningham, Andrew Yates, and Paul Flicek. Ensembl 2018. Nucleic Acids Research, 46(D1):D754–D761, 2018. URL: http://dx.doi.org/10.1093/nar/gkx1098, arXiv:/oup/backfile/content_public/journal/nar/46/d1/10.1093_nar_gkx1098/2/gkx1098.pdf, doi:10.1093/nar/gkx1098.