BactSeq is a Nextflow pipeline for performing bacterial RNA-Seq analysis. The pipeline supports both BWA alignment and Kallisto pseudo-alignment approaches, with optional differential expression and functional enrichment analysis.
# Run with test data
nextflow run BactSeq -profile test,docker
# Run with your own data
nextflow run BactSeq \
--data_dir /path/to/fastq/files \
--sample_file samples.tsv \
--ref_genome genome.fasta \
--ref_ann genome.gff3 \
-profile docker
The pipeline performs the following steps:
-
Quality Control & Trimming
- Trim adaptors from reads (
Trim Galore!
) - Read QC (
FastQC
)
- Trim adaptors from reads (
-
Read Alignment (choose one)
-
Expression Quantification & Normalization
-
Exploratory Analysis
- Principal component analysis (PCA) of normalized expression values
- Sample clustering and visualization
-
Differential Expression (
DESeq2
)- Pairwise comparisons based on provided contrasts
- Volcano plots and summary statistics
-
Functional Enrichment (
topGO
)- GO term enrichment analysis of differentially expressed genes
- Enrichment plots and gene lists
You will need to install Nextflow
(version 21.10.3+).
# Install Nextflow
curl -s https://get.nextflow.io | bash
# Or via conda
conda install -c bioconda nextflow
nextflow run BactSeq \
--data_dir /path/to/fastq/files \
--sample_file samples.tsv \
--ref_genome genome.fasta \
--ref_ann genome.gff3 \
-profile docker
Parameter | Description |
---|---|
--data_dir |
Path to directory containing FastQ files |
--sample_file |
Path to file containing sample information |
--ref_genome |
Path to FASTA file containing reference genome sequence (BWA) or coding gene sequences (Kallisto) |
-profile |
Configuration profile to use: conda , docker , singularity |
Parameter | Default | Description |
---|---|---|
--aligner |
bwa |
Aligner to use: bwa , kallisto |
--ref_ann |
- | Path to GFF file containing reference genome annotation (required for BWA) |
--contrast_file |
- | Path to TSV file containing contrasts for differential expression |
--func_file |
- | Path to functional annotation file for enrichment analysis |
--strandedness |
reverse |
Data strandedness: unstranded , forward , reverse |
--p_thresh |
0.05 |
Adjusted p-value threshold for differential expression |
--l2fc_thresh |
1 |
Absolute log2(FoldChange) threshold for differential expression |
--fragment_len |
150 |
Average fragment length for Kallisto (single-end only) |
--fragment_sd |
20 |
Fragment length standard deviation for Kallisto (single-end only) |
--skip_trimming |
false |
Skip adapter trimming |
--outdir |
./results |
Output directory for results |
Parameter | Description |
---|---|
-resume |
Resume a previous run |
-name |
Name for the pipeline run |
-work-dir |
Work directory for temporary files |
Note: See the test data folder for example inputs.
-
Sample Sheet (
samples.tsv
)- TSV file containing sample information with the following columns:
sample
: Sample IDfile1
: Name of R1 FastQ filefile2
: Name of R2 FastQ file (leave blank for single-end)group
: Grouping factor for differential expressionrep_no
: Replicate number
Example:
sample file1 file2 group rep_no AS_1 SRX1607051_T1.fastq.gz Artificial_Sputum 1 AS_2 SRX1607052_T1.fastq.gz Artificial_Sputum 2 MB_1 SRX1607054_T1.fastq.gz Middlebrook 1 MB_2 SRX1607055_T1.fastq.gz Middlebrook 2
- TSV file containing sample information with the following columns:
-
Reference Genome (
genome.fasta
)- FASTA file containing the reference genome sequence
- Can be downloaded from NCBI RefSeq
-
Gene Annotation (
genome.gff3
) - Required for BWA only- GFF3 file containing gene annotations
- Can be downloaded from NCBI RefSeq
-
Contrasts Table (
contrasts.tsv
) - For differential expression- TSV file with 2 columns defining comparisons to perform
- Column names:
Condition1
,Condition2
Example:
Condition1 Condition2 Artificial_Sputum Middlebrook Artificial_Sputum Kanamycin Middlebrook Kanamycin
-
Functional Annotation File (
functional_annotation.csv
) - For enrichment analysis- CSV file containing GO terms for genes
- Column 1: Gene ID (must match
locus_tag
in GFF) - Column 2: GO terms (comma-separated)
Example:
Gene,GO_terms MAB_0001,"GO:0005737,GO:0008150" MAB_0002,"GO:0016020,GO:0006810"
The pipeline generates the following output directories:
Directory | Contents |
---|---|
trim_galore/ |
Adapter-trimmed FastQ files and FastQC reports |
read_counts/ |
Raw and normalized gene count matrices |
PCA_samples/ |
Principal component analysis plots and coordinates |
diff_expr/ |
Differential expression results and volcano plots |
func_enrich/ |
Functional enrichment analysis results |
pipeline_info/ |
Pipeline execution reports and metadata |
Count Matrices:
gene_counts.tsv
: Raw read counts per genedeseq_counts.tsv
: DESeq2 normalized counts (log2 transformed)cpm_counts.tsv
: Counts per million (CPM) normalizedrpkm_counts.tsv
: RPKM normalized counts
Analysis Results:
DGE_*.tsv
: Differential expression results for each contrastvolcano_plot_*.png
: Volcano plots for each contrastpca_grouped.png
: PCA plot colored by sample groups*_enrich.tsv
: GO enrichment results (if functional annotation provided)
If you use BactSeq in your research, please cite:
@software{bactseq,
author = {Adam Dinan},
title = {BactSeq: A Nextflow pipeline for bacterial RNA-seq analysis},
url = {https://github.com/adamd3/BactSeq},
version = {dev},
year = {2024}
}
- 2025-07-19: Docker image is now compatible with nextflow 25.04.0+
- 2023-10-06: Docker and Singularity support restored
- 2023-09-28: Added example contrasts table and functional enrichment file