Workflow configuration

HRIBO allows different customization to be able to handle different types of input data and customize the analysis. On this page we explain the different options that can be set to easily customize the workflow.

Default workflow

We provide a default config.yaml with generally well-defined default values in the template folder of the HRIBO GitHub repository.

Biology Settings

This section contains general settings for the workflow.

adapter sequence

The adapter section is split into single end and paired end adapters. If you have single-end data, you only need to specify the single-end adatpers.

adapterS3: "" #3prime
adapterS5: "" #5prime

If you have mixed or paired-end data, you need to specify the appropriate paired-end adapters.

adapterP3R1: "" #3prime mate1
adapterP5R1: "" #5prime mate1

adapterP3R2: "" #3prime mate2
adapterP5R2: "" #5prime mate2

The adapter sequences will be removed using cutadapt. It is important to add them if you know that your data was not trimmed before.

genome file

genome: "genome.fa"

The path to the genome file in fasta format. Only fasta format is supported. Make sure that the genome file has the same identifiers as the reference annotation file.

annotation file

annotation: "annotation.gff"

The path to the annotation file in gff format. Only gff format is supported. Make sure that the annotation file has the same identifiers as the reference genome file.

sample sheet

samples: "HRIBO/samples.tsv"

The path to the sample sheet (example provided in the templates folder). The sample sheet describes the relation between the different samples used for the experiment.

alternative start codons

alternativestartcodons: ["GTG","TTG"]

A list of alternative start codons that will be used to predict ORFs. The default is ["GTG","TTG"].

Differential Expression Settings

differentialexpression

differentialexpression: "on"

This option allows you to turn on or off differential expression analysis. If you do not have multiple conditions defined in the sample sheet and differential expression is activated, you will receive an error message. Options are “on / off”.

features

features: ["CDS", "sRNA"]

This option allows you to specify which features should be used for differential expression analysis. Any feature that appears in your reference annotation is allowed. We suggest using CDS and sRNA features.

contrasts

contrasts: ["treated1-untreated1", "treated2-untreated2"]

This option allows you to specify which contrasts should be used for differential expression analysis. The order will affect the directionality of the log2FC values in the output files.

adjusted pvalue cutoff

padjCutoff: 0.05

This option allows you to specify the adjusted pvalue cutoff for differential expression analysis. The default is 0.05. All results will be present in the output, this will is soley used to add additional pre-filtered list in the output excel tables.

log2 fold change cutoff

log2fcCutoff: 1.0

This option allows you to specify the log2 fold change cutoff for differential expression analysis. The default is 1.0. All results will be present in the output, this will is soley used to add additional pre-filtered list in the output excel tables. Only positive values are allowed. Respective negative values to determine down-regulation will be generated automatically by multiplying by -1.

ORF predictions

deepribo: "on"

Activating DeepRibo predictions will give you a different file with ORF predictions. By experience, the top DeepRibo results tend to be better than those of reparation. For archea, where reparation performs very poorly, DeepRibo is the preferred option.

Warning

DeepRibo cannot cope with genomes containing special IUPAC symbols, ensure that your genome file contains only A, G, C, T, N symbols.

Read statistics Settings

readLengths: "10-80"

This option allows you to specify the read lengths that should be used for read statistics analysis. The default is 10-80. We allow combinations of intervals and single values, e.g. 10-40,55,70.

Metagene Profiling Settings

There exist multiple options to customize the metagene profiling analysis.

Positions outside of the ORF

positionsOutsideORF: 50

This option allows you to specify the number of positions outside of the ORF that should be considered for metagene profiling analysis. The default is 50.

Positions inside of the ORF

positionsInsideORF: 150

This option allows you to specify the number of positions inside of the ORF that should be considered for metagene profiling analysis. The default is 150. Genes that are shorter than the specified number of positions inside of the ORF will be ignored.

Filtering Methods

filteringMethods: ["overlap", "length", "rpkm"]

This option allows you to specify which filtering methods should be used for metagene profiling analysis. The default is ["overlap", "length", "rpkm"].

These methods are used to filter out genes that would cause artifacts in the metagene profiling analysis.

The overlap method filters out genes that overlap with other genes.
The length method filters out genes that are shorter than the threshold.
The rpkm method filters out genes below the rpkm threshold.

Neighboring Genes Distance

neighboringGenesDistance: 50

This option allows you to specify the distance to neighboring genes that will be considered to filter out overlapping genes. The default is 50. This means that genes that are closer than 50nt to another gene will be filtered out.

RPKM Threshold

rpkmThreshold: 10.0

This option allows you to specify the RPKM threshold that will be used to filter out genes below the threshold. The default is 10.0.

Length Cutoff

lengthCutoff: 50

This option allows you to specify the length cutoff that will be used to filter out genes below the threshold. Be aware that genes that are smaller than the positionsInsideORF will be filtered out anyway.

Mapping Methods

mappingMethods: ["fiveprime", "threeprime", "centered", "global"]

This option allows you to specify which mapping methods should be used for metagene profiling analysis. The default is ["fiveprime", "threeprime"]. Multiple mappings can be used and result in multiple output files/folders.

The fiveprime method will use the 5’ end of the reads to count the number of reads per position.
The threeprime method will use the 3’ end of the reads to count the number of reads per position.
The centered method will use the center nucleotides of each read to count the number of reads per position.
The global method will use the entire read to count the number of reads per position.

Read Lengths:

readLengths: "25-34"

This option allows you to specify the read lengths that should be used for metagene profiling analysis. The default is 25-34. We allow combinations of intervals and single values, e.g. 25-34,40,50.

Normalization Methods

normalizationMethods: ["raw", "cpm"]

This option allows you to specify which normalization methods should be used for metagene profiling analysis. The default is ["raw", "cpm"].

Output Formats

outputFormats: ["svg", "pdf", "png", "jpg", "interactive"]

This option allows you to specify which output formats should be used for metagene profiling analysis. The default is ["svg", "interactive"]. The interactive option will generate an interactive html file that can be used to explore the metagene profiling results.

PlotlyJS

This option is only used when the interactive output format is selected.

includePlotlyJS: "integrated"

This option allows you to specify how the plotly.js library should be included in the interactive html file. The default is integrated.

The integrated option will include the plotly.js library in the html file. The file will be larger, but can be used offline. (+3.7mb/file)
The online option will include a link to the plotly.js library in the html file.
The local option will include a link to a local plotly.js library in the html file. This option requires the plotly.js library to be available in the same folder as the html file.

Colors

colorList: []

This option allows you to specify a list of colors that should be used for the metagene profiling analysis. The default is []. Per default a list of colorblind-friendly colors will be used. If you want to change the colors, make sure that the number of colors matches the number of samples. We also suggest to use hex colors, e.g. #ff0000.

workflow

workflow: "full"

This option allows you to specify which workflow should be used:

The full workflow will the full HRIBO workflow.
The preprocessing workflow will only run the preprocessing steps up to the generation of the .bam files.

Paired-end support

Paired-end files can now be specified in the samples.tsv file. It is possible to mix paired-end and single-end data in the same analysis.

As most downstream tools do not support the alignment flags for paired-end data, the workflow will transform the paired-end data to single-end data after the trimming step.

This is an example sample.tsv file, which is also available in the HRIBO templates folder.

method	condition	replicate	fastqFile	fastqFile2
RIBO	A	1	fastq/RIBO-A-1_R1.fastq.gz	fastq/RIBO-A-1_R2.fastq.gz
RIBO	A	2	fastq/RIBO-A-2_R1.fastq.gz	fastq/RIBO-A-2_R2.fastq.gz
RIBO	B	1	fastq/RIBO-B-1_R1.fastq.gz	fastq/RIBO-B-1_R2.fastq.gz
RIBO	B	2	fastq/RIBO-B-2_R1.fastq.gz	fastq/RIBO-B-2_R2.fastq.gz
RNA	A	1	fastq/RNA-A-1_R1.fastq.gz	fastq/RNA-A-1_R2.fastq.gz
RNA	A	2	fastq/RNA-A-2_R1.fastq.gz	fastq/RNA-A-2_R2.fastq.gz
RNA	B	1	fastq/RNA-B-1_R1.fastq.gz	fastq/RNA-A-1_R2.fastq.gz
RNA	B	2	fastq/RNA-B-2_R1.fastq.gz	fastq/RNA-A-1_R2.fastq.gz