A workflow for quantifying bacterial flagellins in human gut microbiome sequencing data, annotating their human TLR5 interaction phenotypes and conducting statistical analysis and visualization.
git clone https://github.com/leylabmpi/human-fla-profiling.git
cd ./human-fla-profiling/snakemake/bin/
git submodule add https://github.com/leylabmpi/ll_pipeline_utils.git
git submodule update --remote --init --recursive
cd ../
conda env create -f snakemake8_min.yaml
One have to download usearch if needed in the following directory:
bin/scripts/usearch/usearch11.0.667_i86linux32
An example of running FlaPro is provided as a bash script:
./runLLHFP.sh
User-provided
- Metagenomic or Metatranscriptomic samples - reads in FASTQ or FASTA format (optionally compressed)
Reference data
- Taxonomical annotation of flagellins
- Functional annotation of flagellins
- Marker sequences from human gut microbiome derived flagellins
The config.yaml file is organized into several main sections:
- Input data
- Output directory
- Workflow control settings
- Parameters
Specify the path to a sample file containing your metagenomic or meta-transcriptomic sample-to-read-file mappings:
# format example file:
samples_file: datatest/input_MTG4_nano.txt
The sample sheet file should include:
- Sample ID
- Relative or absolute path to the forward reads (R1)
- Relative or absolute path to the reverse reads (R2), in case of paired-end sequencing
Define the root folder corresponding to the relative paths above:
read_file_path: None # when the paths are absolute
# read_file_path: /path/to/your/reads/ # when the paths are relative
Provide the destination directory for the primary analysis output, for example:
output_dir: out/test_ibd_MTG4_nano_test/
Specify the location for the temporary files (ensure there is enough room in case of large datasets):
tmp_dir: tmp/ # Adjust based on your system's temp directory
Enable/disable major pipeline components:
run_pipeline_steps:
alpha_div: True #or False # Enable alpha diversity calculations
Configure the Snakemake workflow execution:
pipeline:
snakemake_folder: ./ # Path to Snakemake files
export_conda: True # Export conda environment
name: LLHFP # Pipeline name identifier
#just_read1: True #used when there is only R1 reads
params:
shortbred_quantify:
aligner: diamond # Options: diamond, usearch
# usearch_path: bin/scripts/usearch/usearch11.0.667_i86linux32 # uncomment, if using USEARCH
markers: ref/Curated_fla_markers_4_04-12-24.fasta # Flagellin marker database
pct_length: 0.3 # Minimum alignment length (30%)
Aligner Options:
diamond
: Faster, fewer false positives, recommended for large datasetsusearch
: More sensitive; the freely available version might not work with large datasets
merge_realcounts:
merge_script: snakemake/llhfp_demo/bin/scripts/merge_realcounts.R
See config_dmnd.yaml
(for DIAMOND), config_usearch.yaml
(USEARCH) and config.yaml
.
Example:
./real_counts |
`SRR5935740.txt` - output per sample with Family (Cluster), Hits,
`merged_realcounts.txt` - merged output for all the samples by real counts
`psq.RData` - psq object with taxonomy and abundance table
./diversity |
`alpha_div.txt` - calculated alpha diversity tables
After the primary analysis has finished successfully to yield the annotated flagellin relative abundance tables, you can add your sample metadata and do exploratory analysis using the secondary analysis code. It is provided in the form of R Jupyter notebooks (.ipynb files).
To set up the environment for the secondary analysis, you will need:
- Conda (https://docs.conda.io/en/latest/)
- Visual Studio Code (or an alternative integrated development enviroment supporting running R notebooks via a defined Conda environment)
Create a specific Conda environment using the YAML file provided in the envs/ folder:
conda env create -f r_433_nb.yaml
conda activate r_433_nb
Then install the following non-Conda -based packages into it:
R
devtools::install_github("tpq/balance")
devtools::install_github("malucalle/selbal")
devtools::install_bitbucket("knomics/nearestbalance")
devtools::install_github("leylabmpi/LeyLabRMisc")
(these instructions can be also run using the provided envs/...postBuild.sh script)
Open the notebook in VS Code, select the R Jupyter kernel of the installed environment and run the notebook. Further information on how to generate your own notebooks easily synchronizable across multiple projects is provided in a separate readme file.
Note: while the main input files for the secondary analysis are generated during the primary analysis, you have to prepare the additional files with the number of reads per sample (sample coverage), for example, using the scripts/count_reads.sh
script.