Skip to content

Overview

Jeffrey Barrick edited this page Jan 26, 2026 · 11 revisions

brefito is a wrapper to make running several related Snakemake pipelines easier!

Usage

usage: brefito <workflow> [sample1 sample2, ...]

If you run brefito with no options or the --help option, it will display all of the valid <workflows> options. Additional information on how to use specific workflows is in later sections of the manual.

Providing any sample parameters will restrict running the workflow to just those samples.

brefito has several options that are pass-throughs to Snakemake.

  --config NAME=VALUE
  --resources NAME=VALUE
  --rerun-incomplete
  --unlock
  --keep-going
  --dry-run
  --notemp

These can allow you to globally change certain resources or configuration variables specific to certain workflow. They can also allow you to resume and control execution after failed/interrupted runs. They are explained in the Snakemake documentation.

Specifying Input Data Using a CSV file

Each line in the input file specifies a sample and a reference or read file or program option associated with that sample.

By default brefito looks for and uses the file data.csv in the current directory, but you can ask it to use a different CSV settings file adding this to

Here's an example of a data.csv file:

sample,type,setting
Ara-1_50000gen_11331,reference,https://raw.githubusercontent.com/barricklab/LTEE/7da91974eafac0c5a8f903ae57275795d4395af2/reference/REL606.gbk
Ara-1_50000gen_11331,illumina-R1,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR259/007/SRR2591047/SRR2591047_1.fastq.gz
Ara-1_50000gen_11331,illumina-R2,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR259/007/SRR2591047/SRR2591047_2.fastq.gz

Now, let's dig in on what goes in each column.

Sample

This is just the name of the sample associated with the specified input files and/or options. You will typically have multiple lines for each sample to specify all of its settings.

Currently, you can use the * wildcard name for sample when you want a file to be associated with all samples, BUT this must be at the end of the file. (It only gets applied to samples it already knows about when it reads that line.)

Type

The value in the type column tells brefito how to interpret the setting column.

Valid values for type are:

Type Description
reference Location of reference sequence file.
illumina or illumina-SE Location of a file with single-end (unpaired) Illumina reads.
illumina-R1 or illumina-R2 Location of a file with read 1 or read 2, respectively, for a set of paired-end Illumina reads. You must have both an R1 and R2 entry for a sample, if using these types.
illumina-paired or illumina-PE Locations of two files with read 1 and read 2, respectively. You must use {1|2} in the filename where the file with read 1 has "1" and the file with read 2 has "2" where this occurs unless downloading an SRA run using sra-tools.
nanopore Location of a file with Oxford nanopore long sequencing reads.

Settings

File locations settings are treated in different ways depending on their prefixes:

Prefix/Path Second Header
http://<url> Downloaded using wget
https://<url> Downloaded using wget
sra://SRRXXXXXX Downloaded from the NCBI SRA using sra-tools. An accession number to a run in the NCBI SRA must be provided. Ex: SRR2588645
ncbi-genome://GCAXXXXXX Downloaded from NCBI GenBank using ncbi-datasets-cli. An accession number to a genome database record must be provided. Ex: GCA_004006375.1
ncbi-nt://XXXXXXXXX Downloaded from NCBI GenBank using esearch. An accession number to a nucleotide database record must be provided. Ex: CP050855.1
ftp://<url> Downloaded using wget. Can only be used when anonymous login allowed.
lftp@<bookmark>://<path> Downloaded using lftp. This option can be used to download files from a private server if you set up a bookmark that has access.
rclone@<bookmark>://<path> Downloaded using rclone. This option can be used to download files from a private server if you set up a bookmark that has access.
relative/path Local symlink created to file
/absolute/path Local symlink created to file

See the examples directory of the repository for data.csv files you can use for testing and as templates.

For the common use case in which you already have all of the input files downloaded to your machine, you can put them in an input directory with your main run directory and then specify the relative paths as input/reference.gbff, input/sample1.fastq.gz etc. in the data.csv file located in your run directory.

File names and types

Read files must be in gzipped FASTQ format and have filenames of the form *.fastq.gz.

Reference files can be in FASTA (*.fasta, *.fna, *.fa filenames), GenBank (*.genbank, *.gbk, *.gb filenames), or GFF3 (*.gff, *.gff3 filenames) format. Workflows that require FASTA files will generally automatically convert from the other formats.

Using a different directory of reference sequences

Many workflows that operate, by default, on the reference sequences in the references directory can be used on a different directory of reference files. The main cases when this is useful is when re-running analyses on assemblies or mutants generated by other workflows to check them for accuracy.

In this case you can use the normal command with another dash and the desired directory of reference files appended. For example: predict-mutations-breseq-assemblies or predict-mutations-breseq-mutants, coverage-plots-breseq-assemblies or coverage-plots-breseq-mutants, align-reads-assemblies, check-soft-clipping-assemblies, etc.

These commands will also work on input directories other than assemblies or mutants, just replace the same part of the command with your input folder name.

For example, you can "chain" execution of annotate-genomes-assemblies and predict-mutations-breseq-annotated-assemblies to use annotated assemblies as the reference sequences for running breseq.

Clone this wiki locally