HRIBO

HRIBO

Introduction

HRIBO is a workflow for the analysis of prokaryotic Ribo-Seq data. HRIBO is available on github. It includes among others, prediction of novel open reading frames (ORFs), metagene profiling, quality control and differential expression analysis. The workflow is based on the workflow management system snakemake and handles installation of all dependencies via bioconda [GruningDSjodin+17] and docker, as well as all processings steps. The source code of HRIBO is open source and available under the License GNU General Public License 3. Installation and basic usage is described below.

Note

For a detailed step by step tutorial on how to use this workflow on a sample dataset, please refer to our example-workflow.

Requirements

In the following, we describe all the required files and tools needed to run our workflow.

Warning

HRIBO was tested on a linux system. We cannot guarantee that the workflow will run on other systems.

Tools

miniconda3

As this workflow is based on the workflow management system snakemake [KosterR18], Snakemake will download all necessary dependencies via conda.

We strongly recommend installing miniconda3 with python3.7.

After downloading the miniconda3 version suiting your linux system, execute the downloaded bash file and follow the instructions given.

HRIBO

Using the workflow requires HRIBO. The latest version is available on our GitHub page.

In order to run the workflow, we suggest that you download the HRIBO into your project directory. The following command creates an example directory and changes into it:

mkdir project
cd project

Now, download and unpack the latest version of HRIBO by entering the following commands:

wget https://github.com/RickGelhausen/HRIBO/archive/1.7.0.tar.gz
tar -xzf 1.7.0.tar.gz; mv HRIBO-1.7.0 HRIBO; rm 1.7.0.tar.gz;

HRIBO is now in a subdirectory of your project directory.

snakemake

Note

HRIBO was tested using snakemake (version>=7.24.2)

In order to support docker container, snakemake requires singularity. HRIBO requires snakemake and singularity to be installed on your system.

We suggest installing it using conda. To this end we provide an environment.yaml file in the HRIBO directory.

conda create --file HRIBO/environment.yaml

This creates a new conda environment called snakemake and installs snakemake and singularity into the environment. The environment can be activated using:

conda activate hribo_env

and deactivated using:

conda deactivate

Input files

Several input files are required in order to run the workflow, a genome file (.fa), an annotation file (.gff/.gtf) and compressed sequencing files (.fastq.gz).

File name

Description

annotation.gff

user-provided annotation file with genomic features

genome.fa

user-provided genome file containing the genome sequence

<method>-<conditon>-<replicate>.fastq.gz

user-provided compressed sequencing files

config.yaml

configuration file to customize the workflow

samples.tsv

sample file describing the relation between the input fastq files

annotation.gff and genome.fa

We recommend retrieving both the genome and the annotation files for your organism from National Center for Biotechnology Information (NCBI) or Ensembl Genomes [ZAA+18].

Warning

if you use custom annotation files, ensure that you adhere to the gtf/gff standard. Wrongly formatted files are likely to cause problems with downstream tools.

Note

For detailed information about downloading and unpacking these files, please refer to our example-workflow.

input .fastq files

These are the input files provided by the user. Both single end and paired end data is supported.

Note

As most downstream tools do not support paired end data, we combine the paired end data into single end data using flash2 . For more information about how to use paired-end data please refer to the workflow-configuration.

Note

Please ensure that you compress your files in .gz format.

Please ensure that you move all input .fastq.gz files into a folder called fastq (Located in your project folder):

mkdir fastq
cp *.fastq.gz fastq/

Sample sheet and configuration file

In order to run HRIBO, you have to provide a sample sheet and a configuration file. There are templates for both files available in the HRIBO folder, in the subfolder templates. The configuration file is used to allow the user to easily customize certain settings, like the adapter sequence. The sample sheet is used to specify the relation of the input .fastq files (condition / replicate etc…)

Copy the templates of the sample sheet and the configuration file into the HRIBO folder:

cp HRIBO/templates/samples.tsv HRIBO/
cp HRIBO/templates/config.yaml HRIBO/

Customize the config.yaml using your preferred editor.

Note

For a detailed overview of the available options please refer to our workflow-configuration

Edit the sample sheet corresponding to your project. It contains the following variables:

  • method: indicates the method used for this project, here RIBO for ribosome profiling and RNA for RNA-seq.

  • condition: indicates the applied condition (e.g. A, B, …).

  • replicate: ID used to distinguish between the different replicates (e.g. 1,2, …)

  • inputFile: indicates the according fastq file for a given sample.

Note

If you have paired end data, please ensure that you use the samples_pairedend.tsv file.

As seen in the samples.tsv template:

method

condition

replicate

fastqFile

RIBO

A

1

fastq/RIBO-A-1.fastq.gz

RIBO

A

2

fastq/RIBO-A-2.fastq.gz

RIBO

B

1

fastq/RIBO-B-1.fastq.gz

RIBO

B

2

fastq/RIBO-B-2.fastq.gz

RNA

A

1

fastq/RNA-A-1.fastq.gz

RNA

A

2

fastq/RNA-A-2.fastq.gz

RNA

B

1

fastq/RNA-B-1.fastq.gz

RNA

B

2

fastq/RNA-B-2.fastq.gz

Note

This is just an example, please refer to our example-workflow for another example.

cluster.yaml

Warning

As we are currently unable to test HRIBO on cluster systems, the support for cluster systems is experimental. As HRIBO is based on snakemake, cluster support is still possible and we added an example SLURM profile. Please check out the snakemake documentation for more detail snakemake

Output files

In the following tables all important output files of the workflow are listed.

Note

Files create as intermediate steps of the workflow are omitted from this list. (e.g. .bam files)

Note

For more details about the output files, please refer to the analysis results.

Single-file Output

File name

Description

samples.xlsx

Excel version of the input samples file.

manual.pdf

A PDF file describing the analysis.

annotation_total.xlsx

Excel file containing detailed measures for every feature in the input annotation using read counts containing multi-mapping reads.

annotation_unique.xlsx

Excel file containing detailed measures for every feature in the input annotation using read counts containing no multi-mapping reads.

total_read_counts.xlsx

Excel file containing read counts with multi-mapping reads.

unique_read_counts.xlsx

Excel file containing read counts without multi-mapping reads.

multiqc_report.html

Quality control report combining all finding of individual fastQC reports into a well structured overview file.

heatmap_SpearmanCorr_readCounts.pdf

PDF file showing the Spearman correlation between all samples.

predictions_reparation.xlsx

Excel file containing detailed measures for every ORF detected by reparation.

predictions_reparation.gff

GFF file containing ORFs detected by reparation, for genome browser visualization.

potentialStartCodons.gff

GFF file for genome browser visualization containing all potential start codons in the input genome.

potentialStopCodons.gff

GFF file for genome browser visualization containing all potential stop codons in the input genome.

potentialRibosomeBindingSite.gff

GFF file for genome browser visualization containing all potential ribosome binding sites in the input genome.

potentialAlternativeStartCodons.gff

GFF file for genome browser visualization containing all potential alternative start codons in the input genome.

Multi-file Output

File name

Description

riborex/<contrast>_sorted.csv

Differential expression results by Riborex, sorted by pvalue.

riborex/<contrast>_significant.csv

Differential expression results by Riborex, only significant results. (pvalue < 0.05)

xtail/<contrast>_sorted.csv

Differential expression results by xtail, sorted by pvalue.

xtail/<contrast>_significant.csv

Differential expression results by xtail, only significant results. (pvalue < 0.05)

xtail/r_<contrast>.pdf

Differential expression results by xtail, plot with RPF-to-mRNA ratios.

xtail/fc_<contrast>.pdf

Differential expression results by xtail, plot with log2 fold change of both mRNA and RPF.

<method>-<condition>-<replicate>.X.Y.Z.bw

BigWig file for genome browser visualization, containing a single nucleotide mapping around certain regions.

<accession>_Z.Y_profiling.xlsx/tsv

Excel and tsv files containing raw data of the metagene analysis.

<accession>_Z.Y_profiling.pdf

visualization of the metagene analysis.

Note

<contrast> represents a pair of conditions that are being compared.

Note

The BigWig files are available for different normalization methods, strands and regions, X=(min/mil) Y=(forward/reverse) Z=(fiveprime, threeprime, global, centered).

Tool Parameters

The tools used in our workflow are listed below, with links to their respective webpage and a short description.

Tool

Version

Special parameters used

cutadapt

4.1

Adapter removal and quality trimming

fastQC

0.11.9

Quality control

multiQC

1.13

Quality control report

segemehl

0.3.4

Mapping of reads

flash2

2.2.00

Merging paired end samples into single end

cufflinks

2.2.1

Used to convert gff to gtf

bedtools

2.30.0

Collection of useful processing tools (e.g. read counting etc…)

reparation_blast

1.0.9

Prediction of novel Open Reading frames

deepribo

1.1

Prediction of novel Open Reading frames

riborex

2.4.0

Differential expression analysis

xtail

1.1.5

Differential expression analysis

Report

In order to aggregate the final results into a single folder structure and receive a date-tagged .zip file, you can use the makereport.sh script. The current date will be used as a tag for the report folder.

bash HRIBO/scripts/makereport.sh <reportname>

Example-workflow

A detailed step by step tutorial is available at: example-workflow.

References

GruningDSjodin+17

Björn Grüning, Ryan Dale, Andreas Sjödin, Jillian Rowe, Brad A. Chapman, Christopher H. Tomkins-Tinch, Renan Valieris, and Johannes Köster. Bioconda: a sustainable and comprehensive software distribution for the life sciences. bioRxiv, 2017. URL: https://www.biorxiv.org/content/early/2017/10/27/207092, arXiv:https://www.biorxiv.org/content/early/2017/10/27/207092.full.pdf, doi:10.1101/207092.

KosterR18

Johannes Köster and Sven Rahmann. Snakemake—a scalable bioinformatics workflow engine. Bioinformatics, ():bty350, 2018. URL: http://dx.doi.org/10.1093/bioinformatics/bty350, arXiv:/oup/backfile/content_public/journal/bioinformatics/pap/10.1093_bioinformatics_bty350/2/bty350.pdf, doi:10.1093/bioinformatics/bty350.

PGAR19

Anastasia H. Potts, Yinping Guo, Brian M. M. Ahmer, and Tony Romeo. Role of csra in stress responses and metabolism important for salmonella virulence revealed by integrated transcriptomics. PLOS ONE, 14(1):1–30, 01 2019. URL: https://doi.org/10.1371/journal.pone.0211430, doi:10.1371/journal.pone.0211430.

ZAA+18

Daniel R Zerbino, Premanand Achuthan, Wasiu Akanni, M Ridwan Amode, Daniel Barrell, Jyothish Bhai, Konstantinos Billis, Carla Cummins, Astrid Gall, Carlos García Girón, Laurent Gil, Leo Gordon, Leanne Haggerty, Erin Haskell, Thibaut Hourlier, Osagie G Izuogu, Sophie H Janacek, Thomas Juettemann, Jimmy Kiang To, Matthew R Laird, Ilias Lavidas, Zhicheng Liu, Jane E Loveland, Thomas Maurel, William McLaren, Benjamin Moore, Jonathan Mudge, Daniel N Murphy, Victoria Newman, Michael Nuhn, Denye Ogeh, Chuang Kee Ong, Anne Parker, Mateus Patricio, Harpreet Singh Riat, Helen Schuilenburg, Dan Sheppard, Helen Sparrow, Kieron Taylor, Anja Thormann, Alessandro Vullo, Brandon Walts, Amonida Zadissa, Adam Frankish, Sarah E Hunt, Myrto Kostadima, Nicholas Langridge, Fergal J Martin, Matthieu Muffato, Emily Perry, Magali Ruffier, Dan M Staines, Stephen J Trevanion, Bronwen L Aken, Fiona Cunningham, Andrew Yates, and Paul Flicek. Ensembl 2018. Nucleic Acids Research, 46(D1):D754–D761, 2018. URL: http://dx.doi.org/10.1093/nar/gkx1098, arXiv:/oup/backfile/content_public/journal/nar/46/d1/10.1093_nar_gkx1098/2/gkx1098.pdf, doi:10.1093/nar/gkx1098.

Workflow configuration

HRIBO allows different customization to be able to handle different types of input data and customize the analysis. On this page we explain the different options that can be set to easily customize the workflow.

Default workflow

We provide a default config.yaml with generally well-defined default values in the template folder of the HRIBO GitHub repository.

Biology Settings

This section contains general settings for the workflow.

adapter sequence

adapter: ""

The adapter sequence that will be removed using cutadapt. It is important to add it if you know that your data was not trimmed before.

genome file

genome: "genome.fa"

The path to the genome file in fasta format. Only fasta format is supported. Make sure that the genome file has the same identifiers as the reference annotation file.

annotation file

annotation: "annotation.gff"

The path to the annotation file in gff format. Only gff format is supported. Make sure that the annotation file has the same identifiers as the reference genome file.

sample sheet

samples: "HRIBO/samples.tsv"

The path to the sample sheet (example provided in the templates folder). The sample sheet describes the relation between the different samples used for the experiment.

alternative start codons

alternativestartcodons: ["GTG","TTG"]

A list of alternative start codons that will be used to predict ORFs. The default is ["GTG","TTG"].

Differential Expression Settings

differentialexpression

differentialexpression: "on"

This option allows you to turn on or off differential expression analysis. If you do not have multiple conditions defined in the sample sheet and differential expression is activated, you will receive an error message. Options are “on / off”.

features

features: ["CDS", "sRNA"]

This option allows you to specify which features should be used for differential expression analysis. Any feature that appears in your reference annotation is allowed. We suggest using CDS and sRNA features.

contrasts

contrasts: ["treated1-untreated1", "treated2-untreated2"]

This option allows you to specify which contrasts should be used for differential expression analysis. The order will affect the directionality of the log2FC values in the output files.

adjusted pvalue cutoff

padjCutoff: 0.05

This option allows you to specify the adjusted pvalue cutoff for differential expression analysis. The default is 0.05. All results will be present in the output, this will is soley used to add additional pre-filtered list in the output excel tables.

log2 fold change cutoff

log2fcCutoff: 1.0

This option allows you to specify the log2 fold change cutoff for differential expression analysis. The default is 1.0. All results will be present in the output, this will is soley used to add additional pre-filtered list in the output excel tables. Only positive values are allowed. Respective negative values to determine down-regulation will be generated automatically by multiplying by -1.

ORF predictions

deepribo: "on"

Activating DeepRibo predictions will give you a different file with ORF predictions. By experience, the top DeepRibo results tend to be better than those of reparation. For archea, where reparation performs very poorly, DeepRibo is the preferred option.

Warning

DeepRibo cannot cope with genomes containing special IUPAC symbols, ensure that your genome file contains only A, G, C, T, N symbols.

Read statistics Settings

readLengths: "10-80"

This option allows you to specify the read lengths that should be used for read statistics analysis. The default is 10-80. We allow combinations of intervals and single values, e.g. 10-40,55,70.

Metagene Profiling Settings

There exist multiple options to customize the metagene profiling analysis.

Positions outside of the ORF

positionsOutsideORF: 50

This option allows you to specify the number of positions outside of the ORF that should be considered for metagene profiling analysis. The default is 50.

Positions inside of the ORF

positionsInsideORF: 150

This option allows you to specify the number of positions inside of the ORF that should be considered for metagene profiling analysis. The default is 150. Genes that are shorter than the specified number of positions inside of the ORF will be ignored.

Filtering Methods

filteringMethods: ["overlap", "length", "rpkm"]

This option allows you to specify which filtering methods should be used for metagene profiling analysis. The default is ["overlap", "length", "rpkm"].

These methods are used to filter out genes that would cause artifacts in the metagene profiling analysis.

  • The overlap method filters out genes that overlap with other genes.

  • The length method filters out genes that are shorter than the threshold.

  • The rpkm method filters out genes below the rpkm threshold.

Neighboring Genes Distance

neighboringGenesDistance: 50

This option allows you to specify the distance to neighboring genes that will be considered to filter out overlapping genes. The default is 50. This means that genes that are closer than 50nt to another gene will be filtered out.

RPKM Threshold

rpkmThreshold: 10.0

This option allows you to specify the RPKM threshold that will be used to filter out genes below the threshold. The default is 10.0.

Length Cutoff

lengthCutoff: 50

This option allows you to specify the length cutoff that will be used to filter out genes below the threshold. Be aware that genes that are smaller than the positionsInsideORF will be filtered out anyway.

Mapping Methods

mappingMethods: ["fiveprime", "threeprime", "centered", "global"]

This option allows you to specify which mapping methods should be used for metagene profiling analysis. The default is ["fiveprime", "threeprime"]. Multiple mappings can be used and result in multiple output files/folders.

  • The fiveprime method will use the 5’ end of the reads to count the number of reads per position.

  • The threeprime method will use the 3’ end of the reads to count the number of reads per position.

  • The centered method will use the center nucleotides of each read to count the number of reads per position.

  • The global method will use the entire read to count the number of reads per position.

Read Lengths:

readLengths: "25-34"

This option allows you to specify the read lengths that should be used for metagene profiling analysis. The default is 25-34. We allow combinations of intervals and single values, e.g. 25-34,40,50.

Normalization Methods

normalizationMethods: ["raw", "cpm"]

This option allows you to specify which normalization methods should be used for metagene profiling analysis. The default is ["raw", "cpm"].

Output Formats

outputFormats: ["svg", "pdf", "png", "jpg", "interactive"]

This option allows you to specify which output formats should be used for metagene profiling analysis. The default is ["svg", "interactive"]. The interactive option will generate an interactive html file that can be used to explore the metagene profiling results.

PlotlyJS

This option is only used when the interactive output format is selected.

includePlotlyJS: "integrated"

This option allows you to specify how the plotly.js library should be included in the interactive html file. The default is integrated.

  • The integrated option will include the plotly.js library in the html file. The file will be larger, but can be used offline. (+3.7mb/file)

  • The online option will include a link to the plotly.js library in the html file.

  • The local option will include a link to a local plotly.js library in the html file. This option requires the plotly.js library to be available in the same folder as the html file.

Colors

colorList: []

This option allows you to specify a list of colors that should be used for the metagene profiling analysis. The default is []. Per default a list of colorblind-friendly colors will be used. If you want to change the colors, make sure that the number of colors matches the number of samples. We also suggest to use hex colors, e.g. #ff0000.

Paired-end support

Until full paired-end support is added, we allow paired-end data by merging it into single-end data. Therefore, we use the tool flash2 to convert paired-end data to single-end data.

In order to use paired-end data, simply replace the Snakefile with the Snakefile_pairedend. This will now require a special samples_pairedend.tsv, which is also available in the HRIBO templates folder.

method

condition

replicate

fastqFile

fastqFile2

RIBO

A

1

fastq/RIBO-A-1_R1.fastq.gz

fastq/RIBO-A-1_R2.fastq.gz

RIBO

A

2

fastq/RIBO-A-2_R1.fastq.gz

fastq/RIBO-A-2_R2.fastq.gz

RIBO

B

1

fastq/RIBO-B-1_R1.fastq.gz

fastq/RIBO-B-1_R2.fastq.gz

RIBO

B

2

fastq/RIBO-B-2_R1.fastq.gz

fastq/RIBO-B-2_R2.fastq.gz

RNA

A

1

fastq/RNA-A-1_R1.fastq.gz

fastq/RNA-A-1_R2.fastq.gz

RNA

A

2

fastq/RNA-A-2_R1.fastq.gz

fastq/RNA-A-2_R2.fastq.gz

RNA

B

1

fastq/RNA-B-1_R1.fastq.gz

fastq/RNA-A-1_R2.fastq.gz

RNA

B

2

fastq/RNA-B-2_R1.fastq.gz

fastq/RNA-A-1_R2.fastq.gz

Analysis result files

The important files in this workflow are listed and explained below.

ORF Predictions

The output files containing information about predicted Open Reading Frames, these also contain novel predictions.

predictions_reparation.xlsx

This file contains all reparation ORF predictions.

Column name

Description

Identifier

Unique identifier describing the entry.

Genome

The genome accession identifier.

Source

The source of the ORF. (here reparation)

Feature

The feature of the ORF (here CDS)

Start

The start position of the ORF.

Stop

The stop position of the ORF.

Strand

The strand of the ORF. (+/-)

Pred_probability

The probability of this ORF (0.5-1, 1 being the best value)

Locus_tag

If the detected ORF is already in the annotation, this gives its locus tag.

Old_locus_tag

The old locus tag of a gene (if available in the annotation)

Name

If the detected ORF is already in the annotation, this gives its name.

Length

The length of the ORF.

Codon_count

The number of codons in the ORF. (length/3)

<method>-<condition>-<replicate>_TE

The translational efficiency for the given sample.

<method>-<condition>-<replicate>_rpkm

The RPKM for the given sample.

Evidence

The <condition>-<replicate> sample in which this ORF was predicted.

Start_codon

The start codon of the ORF.

Stop_codon

The stop codon of the ORF.

15nt_upstream

The 15nt upstream of the start codon

Nucleotide_seq

The nucleotide sequence of the ORF.

Aminoacid_seq

The amino acid sequence of the ORF.

predictions_reparation.gff

An annotation file in .gff format containing all predictions of reparation for visualization in a genome browser.

predictions_deepribo.xlsx

Note

These files are only available when activating DeepRibo predictions in the config.yaml. (see configurations)

This file contains all DeepRibo ORF predictions.

Column name

Description

Identifier

Unique identifier describing the entry.

Genome

The genome accession identifier.

Source

The source of the ORF. (here reparation)

Feature

The feature of the ORF (here CDS)

Start

The start position of the ORF.

Stop

The stop position of the ORF.

Strand

The strand of the ORF. (+/-)

Pred_value

The value DeepRibo attributes the given prediction.

Pred_rank

The rank calculated from the prediction value. (the best prediction has rank 1)

Novel_rank

A special ranking involving only novel ORFs that are not in the annotation.

Locus_tag

If the detected ORF is already in the annotation, this gives its locus tag.

Old_locus_tag

The old locus tag of a gene (if available in the annotation)

Name

If the detected ORF is already in the annotation, this gives its name.

Length

The length of the ORF.

Codon_count

The number of codons in the ORF. (length/3)

<method>-<condition>-<replicate>_TE

The translational efficiency for the given sample.

<method>-<condition>-<replicate>_rpkm

The RPKM for the given sample.

Evidence

The <condition>-<replicate> sample in which this ORF was predicted.

Start_codon

The start codon of the ORF.

Stop_codon

The stop codon of the ORF.

15nt_upstream

The 15nt upstream of the start codon

Nucleotide_seq

The nucleotide sequence of the ORF.

Aminoacid_seq

The amino acid sequence of the ORF.

predictions_deepribo.gff

Note

These files are only available when activating DeepRibo predictions in the config.yaml. (see configurations)

An annotation file in .gff3 format containing all predictions of DeepRibo for visualization in a genome browser.

Quality control

This comprises all files that can help to perform quality control on all input samples.

multiqc_report.html

The multiQC report collects information from different tools, including fastQC and subread featurecounts. The general statistics give an overview over:

  • the number of duplicates

  • the GC content

  • the average read lengths

  • the number of reads (in millions)

These statistics are collected after each processing step of our pipeline.

  • raw: the unprocessed data

  • trimmed: the data after trimming the adapter sequences

  • mapped: the data after mapping with Segemehl

  • unique: the data after removing multi-mapping reads

  • norRNA: the data after filtering out the rRNA

Further, feature counts are provided for different features from the annotation file. (i.e. how many reads map to each feature) This includes, all(featurecount), rRNA, norRNA(after filtering), tRNA and ncRNA. Following is a fastQC report including sequence counts, sequence quality histograms, per sequence quality scores, per base sequence content, per sequence GC content, per base N content, sequence length distribution, sequence duplication levels, overrepresented features, adapter content and a status overview.

heatmap_SpearmanCorr_readCounts.pdf

Spearman correlation coefficients of read counts. The dendrogram indicates which samples read counts are most similar to each other. Since there should be always a higher correlation between experiments with the same condition and experiment type (e.g. replicates) and not others, this is a rapid way to quality-control the labeling/consistency of input data.

annotation_total.xlsx

This file contains detailed measures for every feature in the input annotation using read counts including multi-mapping reads.

Column name

Description

Identifier

Unique identifier describing the entry.

Genome

The genome accession identifier.

Source

The source of the annotated feature.

Feature

The feature of the annotated feature.

Start

The start position of the annotated feature.

Stop

The stop position of the annotated feature.

Strand

The strand of the annotated feature. (+/-)

Locus_tag

The locus tag of the annotated feature. (if available)

Old_locus_tag

The old locus tag of a gene (if available in the annotation)

Name

The name of the annotated feature. (if available)

Length

The length of the annotated feature.

Codon_count

The number of codons in the annotated feature. (length / 3)

<method>-<condition>-<replicate>_TE

The translational efficiency for the given sample.

<method>-<condition>-<replicate>_rpkm

The RPKM for the given sample. (ReadsPerKilobaseMillion)

Start_codon

The start codon of the annotated feature.

Stop_codon

The stop codon of the annotated feature.

15nt_upstream

The 15nt upstream of the start codon

Nucleotide_seq

The nucleotide sequence of the annotated feature.

Aminoacid_seq

The amino acid sequence of the annotated feature.

Product

The product of the annotated feature. (if available)

Note

The note of the annotated feature. (if available)

total_read_counts.xlsx

This file shows the overall read-counts for each feature annotated in the user-provided annotation, after mapping and before removal of multi-mapping reads.

annotation_unique.xlsx

This file contains detailed measures for every feature in the input annotation using read counts after removal of multi-mapping reads.

Column name

Description

Identifier

Unique identifier describing the entry.

Genome

The genome accession identifier.

Source

The source of the annotated feature.

Feature

The feature of the annotated feature.

Start

The start position of the annotated feature.

Stop

The stop position of the annotated feature.

Strand

The strand of the annotated feature. (+/-)

Locus_tag

The locus tag of the annotated feature. (if available)

Old_locus_tag

The old locus tag of a gene (if available in the annotation)

Name

The name of the annotated feature. (if available)

Length

The length of the annotated feature.

Codon_count

The number of codons in the annotated feature. (length / 3)

<method>-<condition>-<replicate>_TE

The translational efficiency for the given sample.

<method>-<condition>-<replicate>_rpkm

The RPKM for the given sample. (ReadsPerKilobaseMillion)

Start_codon

The start codon of the annotated feature.

Stop_codon

The stop codon of the annotated feature.

15nt_upstream

The 15nt upstream of the start codon

Nucleotide_seq

The nucleotide sequence of the annotated feature.

Aminoacid_seq

The amino acid sequence of the annotated feature.

Product

The product of the annotated feature. (if available)

Note

The note of the annotated feature. (if available)

unique_read_counts.xlsx

This file shows the overall read-counts for each feature annotated in the user-provided annotation, after mapping and after removal of multi-mapping reads.

genome-browser

The files that can be used for visualization in a genome browser.

updated_annotation.gff

A gff track containing both the original annotation together with the new predictions by reparation.

potentialStartCodons.gff

A genome browser track with all possible start codons.

potentialStopCodons.gff

A genome browser track with all possible stop codons.

potentialRibosomeBindingSite.gff

A genome browser track with possible ribosome binding sites.

potentialAlternativeStartCodons.gff

A genome browser track with alternative start codons.

BigWig coverage files

We offer many different single nucleotide mapping bigwig files for genome browser visualization. These files are available for different regions and performed with different methods.

  • global: full read is mapped

  • centered: region around the center.

  • threeprime: region around the three prime end.

  • fiveprime: region around the five prime end.

These are all available with the following normalization methods:

  • raw: raw, unprocessed files. This should only be used to check the coverage of a single file. It should not be used to compare to other files.

  • min: normalized by number of minimal total reads per sample (factor = min. number of reads / number of reads). This is the recommended normalization when comparing different samples from the same experiment.

  • mil: normalized by 1000000 (factor = 1000000 / number of reads). This is the recommended normalization when comparing different samples from the different experiments.

Differential Expression

Files related to the differential expression analysis. All output tables contain multiple pre-filtered sheets using padj cutoff (0.05) and showing up or downregulated genes.

deltate/<contrast>_sorted.xlsx

Table containing all differential expression results from deltate.

riborex/<contrast>_sorted.xlsx

Table containing all differential expression results from riborex.

xtail/<contrast>_sorted.xlsx

Table containing all differential expression results from xtail.

xtail/r_<contrast>.pdf

This figure shows the RPF-to-mRNA ratios in two conditions, where the position of each gene is determined by its RPF-to-mRNA ratio (log2R) in two conditions, represented on the x-axis and y-axis respectively. The points will be color-coded with the pvalue final obtained with xtail (more significant p values having darker color)

  • blue: for genes with log2R larger in first condition than second condition.

  • red: for genes with log2R larger in second condition than the first condition.

  • green: for genes with log2R changing homodirectionally in two condition.

  • yellow: for genes with log2R changing antidirectionally in two condition.

xtail/fc_<contrast>.pdf

This figure shows the result of the differential expression at the two expression levels, where each gene is a dot whose position is determined by its log2 fold change (log2FC) of transcriptional level (mRNA log2FC), represented on the x-axis, and the log2FC of translational level (RPF log2FC), represented on the y-axis. The points will be color-coded with the pvalue final obtained with xtail (more significant p values having darker color)

  • blue: for genes whos mRNA log2FC larger than 1 (transcriptional level).

  • red: for genes whos RPF log2FC larger than 1 (translational level).

  • green: for genes changing homodirectionally at both level.

  • yellow: for genes changing antidirectionally at two levels.

output tables

All output tables were transfered into excel format for easier handling. Additional information was added.

Column name

Description

Genome

The genome accession identifier.

Start

The start position of the analysed feature.

Stop

The stop position of the analysed feature.

Strand

The strand orientation of the analysed feature. (+/-)

Locus_tag

The locus tag of the analysed feature. (if available)

Old_locus_tag

The old locus tag of a gene (if available in the annotation)

Identifier

Unique identifier <Genome>:<start>-<stop>:<strand>

Name

The name of the annotated feature. (if available)

RIBO_baseMean

The avarage of normalized count values, divided by size factors for all (RIBO) samples.

RIBO_log2FoldChange

The log2 fold change between the conditions of the different RIBO samples

RIBO_lfcSE

The standard error value of the log2FC for the RIBO sample.

RIBO_pvalue

The unadjusted pvalue for the RIBO sample.

RIBO_padj

The adjusted pvalue (adjusted for multiple testing) for the RIBO sample.

RNA_baseMean

The avarage of normalized count values, divided by size factors for all (RNA) samples.

RNA_log2FoldChange

The log2 fold change between the conditions of the different RNA samples

RNA_lfcSE

The standard error value of the log2FC for the RNA sample.

RNA_pvalue

The unadjusted pvalue for the RNA sample.

RNA_padj

The adjusted pvalue (adjusted for multiple testing) for the RNA sample.

TE_baseMean

The avarage of normalized count values, divided by size factors for all (TE) samples.

TE_log2FoldChange

The log2 fold change between the conditions of the different TE samples

TE_lfcSE

The standard error value of the log2FC for the TE sample.

TE_pvalue

The unadjusted pvalue for the TE sample.

TE_padj

The adjusted pvalue (adjusted for multiple testing) for the TE sample.

Length

The length of the analysed feature.

Codon_count

The number of codons in the analysed feature.

Start_codon

The start codon of the analysed feature.

Stop_codon

The stop codon of the analysed feature.

15nt_upstream

The 15nt upstream of the start codon.

Nucleotide_seq

The nucleotide sequence of the analysed feature.

Aminoacid_seq

The amino acid sequence of the analysed feature.

Note

RIBO and RNA columns are only available for deltaTE.

Each table has several sheets:

  • all: contains all entries

  • up: contains all entries with a log2FC > 1 that are considered significant (padj < 0.05).

  • down contains all entries with a log2FC < -1 that are considered significant (padj < 0.05).

Note

There are slight difference between the output tables of the different tools.

Metagene Profiling Analysis

Please refer to the metagene profiling page for further details.

Additional output

samples.xlsx

An excel representation of the input sample file.

manual.pdf

A PDF format file giving some explanations about the output files, contained in the final result report.

overview.xlsx

An overview table containing all information gathered from the prediction tools and differential expression analysis. The contents of this table change depending on which options are set. The overview table for the default workflow will contain annotation. reparation, deepribo and differential expression output.

Column name

Description

Identifier

Unique identifier describing the entry.

Genome

The genome accession identifier.

Start

The start position of the ORF.

Stop

The stop position of the ORF.

Strand

The strand of the ORF. (+/-)

Locus_tag

The locus tag of ORF. (if not novel)

Overlapping_genes

Genes that overlap with the predicted ORF

Old_locus_tag

The old locus tag of a gene (if available in the annotation)

Name

The name of the ORF. (if not novel)

Gene_name

The name of the ORFs associated gene feature. (if not novel)

Length

The length of the ORF.

Codon_count

The number of codons in the ORF. (length / 3)

Start_codon

The start codon of the annotated feature.

Stop_codon

The stop codon of the annotated feature.

15nt_upstream

The 15nt upstream of the start codon

Nucleotide_seq

The nucleotide sequence of the annotated feature.

Aminoacid_seq

The amino acid sequence of the annotated feature.

<method>-<condition>-<replicate>_TE

The translational efficiency for the given sample.

<method>-<condition>-<replicate>_rpkm

The RPKM for the given sample. (ReadsPerKilobaseMillion)

Evidence_reparation

The sample this ORF was predicted in (for reparation)

Reparation_probability

The probability of this ORF (0.5-1, 1 being the best value)

Evidence_deepribo

The sample this ORF was predicted in (for deepribo)

Deepribo_rank

The deepribo rank for this ORF. (1 being the best value, 999999 undefined)

Deepribo_score

The score the deepribo rank is based on.

riborex_pvalue

The pvalue (determined by riborex)

riborex_pvalue_adjusted

The adjusted pvalue (determined by riborex)

riborex_log2FC

The log2FC (determined by riborex)

xtail_pvalue

The pvalue (determined by xtail)

xtail_pvalue_adjusted

The adjusted pvalue (determined by xtail)

xtail_log2FC

The log2FC (determined by xtail)

Metagene profiling

Note

Be aware that we are still actively working on our metagene profiling scripts. The main functionality is available, but the plotting scripts may be subject to further changes. Result tables in .tsv and .xlsx format are provided if you prefer using your own plotting scripts.

In this section, we describe the metagene profiling data provided by HRIBO.

What is metagene profiling?

Metagene profiling analyses the distribution of mapped reads around annotated start codons and stop codons.

For Ribo-seq it is expected that the ribosome protects a specific range of read lengths, often typical for the investigated group of organisms, from digestion by nuclease. These reads should show a typical peak around the start codon which corresponds to the high frequency that ribosomes are bound there.

We output and plot the metagene profiling for each individual fragment length as a quality control for the Ribo-seq protocol. If the distribution for all read lengths is untypical, arresting the ribosomes failed.

Metagene output

We provide the possibility to analyse the metagene profiling for different coverage mapping methods. For each sample, both the start and stop codon profiling will be performed and plotted.

Furthermore, we provide normalized (cpm, window) and unnormalized (raw) version for each sample analysed. The normalized version takes into account the total value for each position, as well as the current window length.

The metagene profiling is done for each sample and each coverage mapping method.

  • global: full read is mapped

  • centered: region around the center is mapped

  • threeprime: region around the three prime end is mapped

  • fiveprime: region around the five prime end is mapped

For each of these sample/mapping combinations, result tables in excel .xlsx format are provided, as well as an interactive plot or several static plots.

Note

New filtering methods were added. For details check the options section.

Example output

Note

This section contains old output figures, but these are still valid as the overall method did not change.

In the following, we show a few example figure of how the results can look like.

_images/example_threeprime.png

This example shows a very sharp peak around +17nt. (mapping: threeprime)

30

31

32

33

34

sum

coordinates

1321

1242

1269

773

92

4697

11

1564

1193

1358

624

172

4911

12

4272

2317

1659

906

125

9279

13

6870

3009

1814

738

91

12522

14

6214

7498

2733

1265

140

17850

15

17797

27399

28398

13322

1245

88161

16

7686

22687

62042

25575

5427

123417

17

1272

5200

11156

11202

1382

30212

18

1584

3520

10027

22530

5325

42986

19

792

2166

2636

5044

1969

12607

20

176

568

1082

931

307

3064

21

233

321

930

1102

219

2805

22

204

211

270

378

89

1152

23

The tables give a better overview over the data, showing that the highest peak is found at +17 for a read length of 32nt.

_images/ecoli_example.png

In this example the peak is less sharp, but it is still clearly distinguishable and around +13nt. (mapping: threeprime)

26

27

28

29

30

sum

coordinates

294

308

198

297

321

1418

9

177

338

204

170

286

1175

10

217

236

248

164

162

1027

11

2545

804

976

1379

548

6252

12

8593

10334

5286

4359

5519

34091

13

6120

7662

3238

2403

3590

23013

14

3416

4134

3807

2633

1643

15633

15

1311

4097

5079

5528

3020

19035

16

270

340

1086

1682

1222

4600

17

202

282

269

632

761

2146

18

This table shows a situation that is less clear, the offset at +13 for read length 27nt is the highest, while it is very close to the surounding values.

The metagene analysis is very dependant on the analysed data. We have observed cases with multiple peaks, with one large peak and a few slightly smaller peaks.

_images/example_fiveprime.png

This shows a case with a strong peak at around -15nt, while there are multiple smaller peaks at -22, +112, +219, and +291. These kinds of peaks can have multiple reasons. Reasons we observed so far were:

  • tRNA or rRNA that was not filtered out correctly

  • singular genes that have a large number of reads attributed to them (due to a biological reason or a faulty sequencing)

Example workflow

The retrieval of input files and running the workflow locally and on a server cluster via a queuing system is demonstrated using an example with data available from NCBI. This dataset is available under the accession number PRJNA379630.

Note

In this tutorial, we will show the basic functionalities of our workflow, for information about additional options please refer to: workflow-configuration.

Note

Ensure that you have miniconda3 installed and a snakemake environment set-up. Please refer to the overview for details on the installation.

Setup

First of all, we start by creating the project directory and changing to it.

mkdir project
cd project

We then download the latest version of HRIBO into the newly created project folder and unpack it.

wget https://github.com/RickGelhausen/HRIBO/archive/1.7.0.tar.gz
tar -xzf 1.7.0.tar.gz; mv HRIBO-1.7.0 HRIBO; rm 1.7.0.tar.gz;

Retrieve and prepare input files

Before starting the workflow, we have to acquire and prepare several input files. These files are the annotation file, the genome file, the fastq files, the configuration file and the sample sheet.

Annotation and genome files

First, we want to retrieve the annotation file and the genome file. In this case, we can find both on NCBI using the accession number NC_002516.2.

_images/NCBI-1.png

On this page, we can directly retrieve both files by clicking on the according download links next to the file descriptions. Alternatively, you can directly download them using the following commands:

wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/006/765/GCF_000006765.1_ASM676v1/GCF_000006765.1_ASM676v1_genomic.gff.gz
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/006/765/GCF_000006765.1_ASM676v1/GCF_000006765.1_ASM676v1_genomic.fna.gz

Then, we unpack and rename both files.

gunzip GCF_000006765.1_ASM676v1_genomic.gff.gz && mv GCF_000006765.1_ASM676v1_genomic.gff annotation.gff
gunzip GCF_000006765.1_ASM676v1_genomic.fna.gz && mv GCF_000006765.1_ASM676v1_genomic.fna genome.fa

.fastq files

Next, we want to acquire the fastq files. The fastq files are available under the accession number PRJNA379630 on NCBI. The files have to be downloaded using the Sequence Read Archive (SRA). There are multiple ways of downloading files from SRA as explained here.

As we already have conda installed, the easiest way is to install the sra-tools:

conda create -n sra-tools -c bioconda -c conda-forge sra-tools pigz

This will create a conda environment containing the sra-tools. Using these, we can simply pass the SRA identifiers and download the data:

conda activate sra-tools;
fasterq-dump SRR5356907; pigz -p 2 SRR5356907.fastq; mv SRR5356907.fastq.gz RIBO-PAO1-gly-1.fastq.gz;
fasterq-dump SRR5356908; pigz -p 2 SRR5356908.fastq; mv SRR5356908.fastq.gz RNA-PAO1-gly-1.fastq.gz;
conda deactivate;

Note

Due to the runtime of several tools, especially the mapping by segemehl, this tutorial only uses one condition and replicate. If available, it is advisable to use as many replicates as possible.

Warning

If you have a bad internet connection, this step might take some time. If you prefer, you can also use your own .fastq files. But ensure that you use the correct annotation and genome files and that you compress them in .gz format (using gzip, pigz, etc …)

This will download compressed files for each of the required .fastq files. We will move them into a folder called fastq.

mkdir fastq;
mv *.fastq.gz fastq;

Sample sheet and configuration file

Finally, we will prepare the configuration file (config.yaml) and the sample sheet (samples.tsv). We start by copying templates for both files from the HRIBO/templates/ into the HRIBO/ folder.

cp HRIBO/templates/samples.tsv HRIBO/

The sample file looks as follows:

method

condition

replicate

fastqFile

RIBO

A

1

fastq/RIBO-A-1.fastq.gz

RIBO

A

2

fastq/RIBO-A-2.fastq.gz

RIBO

B

1

fastq/RIBO-B-1.fastq.gz

RIBO

B

2

fastq/RIBO-B-2.fastq.gz

RNA

A

1

fastq/RNA-A-1.fastq.gz

RNA

A

2

fastq/RNA-A-2.fastq.gz

RNA

B

1

fastq/RNA-B-1.fastq.gz

RNA

B

2

fastq/RNA-B-2.fastq.gz

Note

When using your own data, use any editor (vi(m), gedit, nano, atom, …) to customize the sample sheet.

Warning

Please ensure not to replace any tabulator symbols with spaces while changing this file.

We will rewrite this file to fit the previously downloaded .fastq.gz files.

method

condition

replicate

fastqFile

RIBO

GLY

1

fastq/RIBO-PAO1-gly-1.fastq.gz

RNA

GLY

1

fastq/RNA-PAO1-gly-1.fastq.gz

Next, we are going to set up the config.yaml.

cp HRIBO/templates/config.yaml HRIBO/

The config file can be used to easily change the parameters of HRIBO.

Note

For a detailed overview of the available options please refer to our workflow-configuration

In our small example, we will adjust the adapter sequence and to reduce the runtime, deactivate deepribo, which will lead to the following changes in the config.yaml file:

biologySettings:
    # Adapter sequence used
    adapter: "AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC"

predictionSettings:
    # Deepribo predictions: on / off
    deepribo: "off"

Running the workflow

Now that all the required files are prepared, we can start running the workflow, either locally or in a cluster environment.

Warning

before you start using snakemake remember to activate the environment first.

conda activate hribo_env

Run the workflow locally

Use the following steps when you plan to execute the workflow on a single server, cloud instance or workstation.

Warning

Please be aware that some steps of the workflow require a lot of memory or time, depending on the size of your input data. To get a better idea about the memory consumption, you can have a look at the provided sge.yaml or torque.yaml files.

Navigate to the project folder containing your annotation and genome files, as well as the HRIBO folder. Start the workflow locally from this folder by running:

snakemake --use-conda -s HRIBO/Snakefile --configfile HRIBO/config.yaml --directory ${PWD} -j 10 --latency-wait 60

This will start the workflow locally.

  • --use-conda: instruct snakemake to download tool dependencies from conda.

  • -s: specifies the Snakefile to be used.

  • --configfile: specifies the config file to be used.

  • --directory: specifies your current path.

  • -j: specifies the maximum number of cores snakemake is allowed to use.

  • --latency-wait: specifies how long (in seconds) snakemake will wait for filesystem latencies until declaring a file to be missing.

Run Snakemake in a cluster environment

Use the following steps if you are executing the workflow via a queuing system. Edit the configuration file <cluster>.yaml according to your queuing system setup and cluster hardware.

Navigate to the project folder on your cluster system. Start the workflow from this folder by running (The following system call shows the usage with Grid Engine.):

snakemake --use-conda -s HRIBO/Snakefile --configfile HRIBO/config.yaml --directory ${PWD} -j 20 --cluster-config HRIBO/sge.yaml

Note

Ensure that you use an appropriate <cluster>.yaml for your cluster system. We provide one for SGE and TORQUE based systems.

Example: Run Snakemake in a cluster environment

Warning

Be advised that this is a specific example, the required options may change depending on your system.

We ran the tutorial workflow in a cluster environment, specifically a TORQUE cluster environment. Therefore, we created a bash script torque.sh in our project folder.

vi torque.sh

Note

Please note that all arguments enclosed in <> have to be customized. This script will only work if your cluster uses the TORQUE queuing system.

We proceeded by writing the queuing script:

#!/bin/bash
#PBS -N <ProjectName>
#PBS -S /bin/bash
#PBS -q "long"
#PBS -d <PATH/ProjectFolder>
#PBS -l nodes=1:ppn=1
#PBS -o <PATH/ProjectFolder>
#PBS -j oe
cd <PATH/ProjectFolder>
source activate HRIBO
snakemake --latency-wait 600 --use-conda -s HRIBO/Snakefile --configfile HRIBO/config.yaml --directory ${PWD} -j 20 --cluster-config HRIBO/torque.yaml --cluster "qsub -N {cluster.jobname} -S /bin/bash -q {cluster.qname} -d <PATH/ProjectFolder> -l {cluster.resources} -o {cluster.logoutputdir} -j oe"

We then simply submitted this job to the cluster:

qsub torque.sh

Using any of the presented methods, this will run the workflow on the tutorial dataset and create the desired output files.

Results

The last step will be to aggregate all the results once the workflow has finished running. In order to do this, we provided a script in the scripts folder of HRIBO called makereport.sh.

bash HRIBO/scripts/makereport.sh <reportname_prefix>

Running this command will create a folder where all the results are collected from the workflows final output, it will additionally create compressed file in .zip format.

Note

A detailed explanation of the result files can be found in the result section.

Note

The final result of this example workflow, can be found here .

Warning

As many browsers stopped the support for viewing ftp files, you might have to use a ftp viewer instead.

Runtime

The total runtime of the example workflow, using 12 cores of an AMD EPYC Processor (with IBPB), 1996 MHz CPUs and 64 GB RAM, was 4h04m37s.

The runtime contains the automatic download and installation time of all dependencies by conda and singularity. This step is mainly dependent on the available network bandwidth. In our case it took about 7 minutes.

References

GruningDSjodin+17

Björn Grüning, Ryan Dale, Andreas Sjödin, Jillian Rowe, Brad A. Chapman, Christopher H. Tomkins-Tinch, Renan Valieris, and Johannes Köster. Bioconda: a sustainable and comprehensive software distribution for the life sciences. bioRxiv, 2017. URL: https://www.biorxiv.org/content/early/2017/10/27/207092, arXiv:https://www.biorxiv.org/content/early/2017/10/27/207092.full.pdf, doi:10.1101/207092.

KosterR18

Johannes Köster and Sven Rahmann. Snakemake—a scalable bioinformatics workflow engine. Bioinformatics, ():bty350, 2018. URL: http://dx.doi.org/10.1093/bioinformatics/bty350, arXiv:/oup/backfile/content_public/journal/bioinformatics/pap/10.1093_bioinformatics_bty350/2/bty350.pdf, doi:10.1093/bioinformatics/bty350.

PGAR19

Anastasia H. Potts, Yinping Guo, Brian M. M. Ahmer, and Tony Romeo. Role of csra in stress responses and metabolism important for salmonella virulence revealed by integrated transcriptomics. PLOS ONE, 14(1):1–30, 01 2019. URL: https://doi.org/10.1371/journal.pone.0211430, doi:10.1371/journal.pone.0211430.

ZAA+18

Daniel R Zerbino, Premanand Achuthan, Wasiu Akanni, M Ridwan Amode, Daniel Barrell, Jyothish Bhai, Konstantinos Billis, Carla Cummins, Astrid Gall, Carlos García Girón, Laurent Gil, Leo Gordon, Leanne Haggerty, Erin Haskell, Thibaut Hourlier, Osagie G Izuogu, Sophie H Janacek, Thomas Juettemann, Jimmy Kiang To, Matthew R Laird, Ilias Lavidas, Zhicheng Liu, Jane E Loveland, Thomas Maurel, William McLaren, Benjamin Moore, Jonathan Mudge, Daniel N Murphy, Victoria Newman, Michael Nuhn, Denye Ogeh, Chuang Kee Ong, Anne Parker, Mateus Patricio, Harpreet Singh Riat, Helen Schuilenburg, Dan Sheppard, Helen Sparrow, Kieron Taylor, Anja Thormann, Alessandro Vullo, Brandon Walts, Amonida Zadissa, Adam Frankish, Sarah E Hunt, Myrto Kostadima, Nicholas Langridge, Fergal J Martin, Matthieu Muffato, Emily Perry, Magali Ruffier, Dan M Staines, Stephen J Trevanion, Bronwen L Aken, Fiona Cunningham, Andrew Yates, and Paul Flicek. Ensembl 2018. Nucleic Acids Research, 46(D1):D754–D761, 2018. URL: http://dx.doi.org/10.1093/nar/gkx1098, arXiv:/oup/backfile/content_public/journal/nar/46/d1/10.1093_nar_gkx1098/2/gkx1098.pdf, doi:10.1093/nar/gkx1098.

Extended workflow

Warning

This tutorial shows a full run of the workflow with all options activated. For testing, we ran this example locally on a large cloud instance. The data is likely too large for running locally on an average laptop.

We show a run of the full workflow, including deepribo predictions and differential expression analysis, on data available from NCBI. For this purpose, we use a salmonella enterica dataset available under the accession number PRJNA421559 [PGAR19].

Warning

Ensure that you have miniconda3 and singularity installed and a snakemake environment set-up. Please refer to the overview for details on the installation.

Setup

First of all, we start by creating the project directory and changing to it. (you can choose any directory name)

mkdir project
cd project

We then download the latest version of HRIBO into the newly created project folder and unpack it.

wget https://github.com/RickGelhausen/HRIBO/archive/1.7.0.tar.gz
tar -xzf 1.7.0.tar.gz; mv HRIBO-1.7.0 HRIBO; rm 1.7.0.tar.gz;

Retrieve and prepare input files

Before starting the workflow, we have to acquire and prepare several input files. These files are the annotation file, the genome file, the fastq files, the configuration file and the sample sheet.

Annotation and genome files

First, we want to retrieve the annotation file and the genome file. In this case, we can find both on NCBI using the accession number NC_016856.1.

_images/HRIBO-2.png

Note

Ensure that you download the annotation for the correct strain str. 14028S.

On this page, we can directly retrieve both files by clicking on the according download links next to the file descriptions. Alternatively, you can directly download them using the following commands:

wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/022/165/GCF_000022165.1_ASM2216v1/GCF_000022165.1_ASM2216v1_genomic.gff.gz
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/022/165/GCF_000022165.1_ASM2216v1/GCF_000022165.1_ASM2216v1_genomic.fna.gz

Then, we unpack and rename both files.

gunzip GCF_000022165.1_ASM2216v1_genomic.gff.gz && mv GCF_000022165.1_ASM2216v1_genomic.gff annotation.gff
gunzip GCF_000022165.1_ASM2216v1_genomic.fna.gz && mv GCF_000022165.1_ASM2216v1_genomic.fna genome.fa

.fastq files

Next, we want to acquire the fastq files. The fastq files are available under the accession number PRJNA421559 on NCBI. The files have to be downloaded using the Sequence Read Archive (SRA). There are multiple ways of downloading files from SRA as explained here.

As we already have conda installed, the easiest way is to install the sra-tools:

conda create -n sra-tools -c bioconda -c conda-forge sra-tools pigz

This will create a conda environment containing the sra-tools and pigz. Using these, we can simply pass the SRA identifiers and download the data:

conda activate sra-tools;
fasterq-dump SRR6359966; pigz -p 2 SRR6359966.fastq; mv SRR6359966.fastq.gz RIBO-WT-1.fastq.gz
fasterq-dump SRR6359967; pigz -p 2 SRR6359967.fastq; mv SRR6359967.fastq.gz RIBO-WT-2.fastq.gz
fasterq-dump SRR6359974; pigz -p 2 SRR6359974.fastq; mv SRR6359974.fastq.gz RNA-WT-1.fastq.gz
fasterq-dump SRR6359975; pigz -p 2 SRR6359975.fastq; mv SRR6359975.fastq.gz RNA-WT-2.fastq.gz
fasterq-dump SRR6359970; pigz -p 2 SRR6359970.fastq; mv SRR6359970.fastq.gz RIBO-csrA-1.fastq.gz
fasterq-dump SRR6359971; pigz -p 2 SRR6359971.fastq; mv SRR6359971.fastq.gz RIBO-csrA-2.fastq.gz
fasterq-dump SRR6359978; pigz -p 2 SRR6359978.fastq; mv SRR6359978.fastq.gz RNA-csrA-1.fastq.gz
fasterq-dump SRR6359979; pigz -p 2 SRR6359979.fastq; mv SRR6359979.fastq.gz RNA-csrA-2.fastq.gz
conda deactivate;

Note

we will use two conditions and two replicates for each condition. There are 4 replicates available for each condition, we run it with two as this is just an example. If you run an analysis always try to use as many replicates as possible.

Warning

If you have a bad internet connection, this step might take some time. It is advised to run this workflow on a cluster or cloud instance.

This will download compressed files for each of the required .fastq files. We will move them into a folder called fastq.

mkdir fastq;
mv *.fastq.gz fastq;

Sample sheet and configuration file

Finally, we will prepare the configuration file (config.yaml) and the sample sheet (samples.tsv). We start by copying templates for both files from the HRIBO/templates/ into the HRIBO/ folder.

cp HRIBO/templates/samples.tsv HRIBO/

The sample file looks as follows:

method

condition

replicate

fastqFile

RIBO

A

1

fastq/RIBO-A-1.fastq.gz

RIBO

A

2

fastq/RIBO-A-2.fastq.gz

RIBO

B

1

fastq/RIBO-B-1.fastq.gz

RIBO

B

2

fastq/RIBO-B-2.fastq.gz

RNA

A

1

fastq/RNA-A-1.fastq.gz

RNA

A

2

fastq/RNA-A-2.fastq.gz

RNA

B

1

fastq/RNA-B-1.fastq.gz

RNA

B

2

fastq/RNA-B-2.fastq.gz

Note

When using your own data, use any editor (vi(m), gedit, nano, atom, …) to customize the sample sheet.

Warning

Please ensure not to replace any tabulator symbols with spaces while changing this file.

We will rewrite this file to fit the previously downloaded .fastq.gz files.

method

condition

replicate

fastqFile

RIBO

WT

1

fastq/RIBO-WT-1.fastq.gz

RIBO

WT

2

fastq/RIBO-WT-2.fastq.gz

RIBO

csrA

1

fastq/RIBO-csrA-1.fastq.gz

RIBO

csrA

2

fastq/RIBO-csrA-2.fastq.gz

RNA

WT

1

fastq/RNA-WT-1.fastq.gz

RNA

WT

2

fastq/RNA-WT-2.fastq.gz

RNA

csrA

1

fastq/RNA-csrA-1.fastq.gz

RNA

csrA

2

fastq/RNA-csrA-1.fastq.gz

Next, we are going to set up the config.yaml.

cp HRIBO/templates/config.yaml HRIBO/

The config file can be used to easily change the parameters of HRIBO.

Note

For a detailed overview of the available options please refer to our workflow-configuration

In our small example, we will adjust the adapter sequence which will lead to the following changes in the config.yaml file:

biologySettings:
    # Adapter sequence used
    adapter: "AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC"

Running the workflow

Now that all the required files are prepared, we can start running the workflow, either locally or in a cluster environment.

Warning

if you have problems running deepribo, please refer to Activating DeepRibo.

Warning

before you start using snakemake remember to activate the environment first.

conda activate snakemake

Run the workflow locally

Use the following steps when you plan to execute the workflow on a single server, cloud instance or workstation.

Warning

Please be aware that some steps of the workflow require a lot of memory or time, depending on the size of your input data. To get a better idea about the memory consumption, you can have a look at the provided sge.yaml or torque.yaml files.

Navigate to the project folder containing your annotation and genome files, as well as the HRIBO folder. Start the workflow locally from this folder by running:

snakemake --use-conda --use-singularity --singularity-args " -c " --greediness 0 -s HRIBO/Snakefile --configfile HRIBO/config.yaml --directory ${PWD} -j 10 --latency-wait 60

This will start the workflow locally.

  • --use-conda: instruct snakemake to download tool dependencies from conda.

  • -s: specifies the Snakefile to be used.

  • --configfile: specifies the config file to be used.

  • --directory: specifies your current path.

  • -j: specifies the maximum number of cores snakemake is allowed to use.

  • --latency-wait: specifies how long (in seconds) snakemake will wait for filesystem latencies until declaring a file to be missing.

Run Snakemake in a cluster environment

Use the following steps if you are executing the workflow via a queuing system. Edit the configuration file <cluster>.yaml according to your queuing system setup and cluster hardware.

Navigate to the project folder on your cluster system. Start the workflow from this folder by running (The following system call shows the usage with Grid Engine):

snakemake --use-conda --use-singularity --singularity-args " -c " -s HRIBO/Snakefile --configfile HRIBO/config.yaml --directory ${PWD} -j 20 --cluster-config HRIBO/sge.yaml

Note

Ensure that you use an appropriate <cluster>.yaml for your cluster system. We provide one for SGE and TORQUE based systems.

Example: Run Snakemake in a cluster environment

Warning

Be advised that this is a specific example, the required options may change depending on your system.

We ran the tutorial workflow in a cluster environment, specifically a TORQUE cluster environment. Therefore, we created a bash script torque.sh in our project folder.

vi torque.sh

Note

Please note that all arguments enclosed in <> have to be customized. This script will only work if your cluster uses the TORQUE queuing system.

We proceeded by writing the queuing script:

#!/bin/bash
#PBS -N <ProjectName>
#PBS -S /bin/bash
#PBS -q "long"
#PBS -d <PATH/ProjectFolder>
#PBS -l nodes=1:ppn=1
#PBS -o <PATH/ProjectFolder>
#PBS -j oe
cd <PATH/ProjectFolder>
source activate HRIBO
snakemake --latency-wait 600 --use-conda --use-singularity --singularity-args " -c " -s HRIBO/Snakefile --configfile HRIBO/config.yaml --directory ${PWD} -j 20 --cluster-config HRIBO/torque.yaml --cluster "qsub -N {cluster.jobname} -S /bin/bash -q {cluster.qname} -d <PATH/ProjectFolder> -l {cluster.resources} -o {cluster.logoutputdir} -j oe"

We then simply submitted this job to the cluster:

qsub torque.sh

Using any of the presented methods, this will run the workflow on the tutorial dataset and create the desired output files.

Results

The last step will be to aggregate all the results once the workflow has finished running. In order to do this, we provided a script in the scripts folder of HRIBO called makereport.sh.

bash HRIBO/scripts/makereport.sh <reportname>

Running this will create a folder where all the results are collected from the workflows final output, it will additionally create compressed file in .zip format. The <reportname> will be extended by report_HRIBOX.X.X_dd-mm-yy.

Note

A detailed explanation of the result files can be found in the result section.

Note

The final result of this example workflow, can be found here .

Warning

As many browsers stopped the support for viewing ftp files, you might have to use a ftp viewer instead.

Runtime

The total runtime of the extended workflow, using 12 cores of an AMD EPYC Processor (with IBPB), 1996 MHz CPUs and 64 GB RAM, was 5h51m14s.

The runtime contains the automatic download and installation time of all dependencies by conda and singularity. This step is mainly dependent on the available network bandwidth. In this case it took about 12 minutes.

The runtime difference compared to the example workflow is explained by the additional libraries and tools used.

References

GruningDSjodin+17

Björn Grüning, Ryan Dale, Andreas Sjödin, Jillian Rowe, Brad A. Chapman, Christopher H. Tomkins-Tinch, Renan Valieris, and Johannes Köster. Bioconda: a sustainable and comprehensive software distribution for the life sciences. bioRxiv, 2017. URL: https://www.biorxiv.org/content/early/2017/10/27/207092, arXiv:https://www.biorxiv.org/content/early/2017/10/27/207092.full.pdf, doi:10.1101/207092.

KosterR18

Johannes Köster and Sven Rahmann. Snakemake—a scalable bioinformatics workflow engine. Bioinformatics, ():bty350, 2018. URL: http://dx.doi.org/10.1093/bioinformatics/bty350, arXiv:/oup/backfile/content_public/journal/bioinformatics/pap/10.1093_bioinformatics_bty350/2/bty350.pdf, doi:10.1093/bioinformatics/bty350.

PGAR19

Anastasia H. Potts, Yinping Guo, Brian M. M. Ahmer, and Tony Romeo. Role of csra in stress responses and metabolism important for salmonella virulence revealed by integrated transcriptomics. PLOS ONE, 14(1):1–30, 01 2019. URL: https://doi.org/10.1371/journal.pone.0211430, doi:10.1371/journal.pone.0211430.

ZAA+18

Daniel R Zerbino, Premanand Achuthan, Wasiu Akanni, M Ridwan Amode, Daniel Barrell, Jyothish Bhai, Konstantinos Billis, Carla Cummins, Astrid Gall, Carlos García Girón, Laurent Gil, Leo Gordon, Leanne Haggerty, Erin Haskell, Thibaut Hourlier, Osagie G Izuogu, Sophie H Janacek, Thomas Juettemann, Jimmy Kiang To, Matthew R Laird, Ilias Lavidas, Zhicheng Liu, Jane E Loveland, Thomas Maurel, William McLaren, Benjamin Moore, Jonathan Mudge, Daniel N Murphy, Victoria Newman, Michael Nuhn, Denye Ogeh, Chuang Kee Ong, Anne Parker, Mateus Patricio, Harpreet Singh Riat, Helen Schuilenburg, Dan Sheppard, Helen Sparrow, Kieron Taylor, Anja Thormann, Alessandro Vullo, Brandon Walts, Amonida Zadissa, Adam Frankish, Sarah E Hunt, Myrto Kostadima, Nicholas Langridge, Fergal J Martin, Matthieu Muffato, Emily Perry, Magali Ruffier, Dan M Staines, Stephen J Trevanion, Bronwen L Aken, Fiona Cunningham, Andrew Yates, and Paul Flicek. Ensembl 2018. Nucleic Acids Research, 46(D1):D754–D761, 2018. URL: http://dx.doi.org/10.1093/nar/gkx1098, arXiv:/oup/backfile/content_public/journal/nar/46/d1/10.1093_nar_gkx1098/2/gkx1098.pdf, doi:10.1093/nar/gkx1098.

Frequently asked questions

Q: When using singularity I get ERROR  : Failed to set effective UID to 0.

Prior to the installation of singularity change with_suid=1 to with_suid=0 in the mconfig file in the singularity folder. This should not be necessary for newer versions of singularity.