SNP2Cluster is a K-means clustering-based method that identifies transmission clusters by integrating genomic and epidemiological data.
Four types of cluster analyses are possible using SNP2Cluster:
- Core SNP cluster analysis - using SNP data only
- Transmission cluster analysis which integrates genomic and epi data:
- Facility level analysis - STs and clusters visualized strictly per facility
- Area/location level analysis - STs and facilities grouped by location
- Community level analysis - facilities grouped by ST
To perform enhanced transmission cluster analysis, paths to the following information/data sources should be provided in a configuration file:
- Epidemiological data file including collection dates and facility information
- Pairwise single nucleotide polymorphism (SNP) distance matrix
- Multi-locus sequence type (MLST) profiles
- Output directory
Closely-related isolates are initially grouped together in clusters based on an enhanced K-means clustering method that employs a silhouette score and 500 bootstraps to determine the optimal K (which defines the maximum number of clusters) K-means clusters. Custom R functions are applied to each pre-grouped cluster to generate SNP cluster chains based on the provided SNP cut-off to make a SNP cluster, and resetting when SNP threshold is exceeded to make the next cluster. Each isolate is only assigned once to a cluster Core SNP cluster analysis.
If MLST profiles and epidemiological data are provided, the final transmission clusters are generated in the context of sequence types and epidemiological timeline Transmission clusters. Publication-ready graphs are automatically generated for visualization of the integrated SNP/Epi transmission clusters, including a heat-map, minimum-spanning tree and scatter plots.
1. Provide paths to the input files in the configuration file - refer to the config_file_template in the conf folder
Refer to the example-data folder to see examples of input files
# Set paths to files ------------------------------------------------------
dates_path = "./example-data/example_metadata.csv" # Epidemiological data file including collection dates and facility information
filepath = "./example-data/" # Pairwise single nucleotide polymorphism (SNP) distance matrix
mlst_profile = "./example-data/05.mlst.xlsx" # Multi-locus sequence type (MLST) profiles
out_dir <- "./example-output" # Output directory
2. The following variables should be defined in the configuration file as well
# Define variables --------------------------------------------------------
# The first column of the metadata file (dates_path) should have the sample_ids
# The other variables such as Facility, collection dates etc should be assigned to the variables below:
Main_var = "" # Mandatory main variable e.g. Hospital or Facility etc.
Var_01 = "" # Optional second variable e.g. Ward_name, Ward_type etc.
Var_02 = "" # Mandatory variable for specimen collection dates
clust_type = "" # "Core" or "Transmission"
snpco = 20 # set SNP cut-off (Set to 20 by default)
if(clust_type == "Transmission"){
trans_lvl = "Facility" #"Community" # Community or "Facility" #Default
# facility
daysco = 45 # Time interval in days (Default: 45 days)
3. Specify format of collection dates in the epi data file
# Specify format of collection dates --------------------------------------
lubri_fmt <- "ymd"
# collection date format
# options include: dym, dmy, ymd, ydm, etc.. -- based on the lubridate package
Save the configuration file in the conf folder
4. Set additional parameters in the execution file
a. Override the defaults for snp threshold and days interval and provide your preferred in intervals
snpco=20 # Preferred SNP threshold
daysco=45 # Time interval in days
# Optionally a vector of SNP threshols can be provided together with matching
# day intervals in a second vector if you want to perform multiple comparisons
# snpco=c(11,20,20,25,11)
# snpco=c(60,14,60,45,14)
b. Provide name of the configuration file saved in the conf folder
4. Run the analysis
Kwenda, S., Shuping, L., Mashau, R., Ismail, H., & Govender, N. P. (2024). SNP2Cluster: A core SNP and K-means clustering-based tool for enhanced transmission cluster detection in outbreak scenarios (v0.5.3). Core SNP-based clustering for enhanced transmission cluster detection in outbreak scenarios (SNP2Cluster), Klebsiella Epidemiology and Biology Symposium 2024, Institut Pasteur, Paris, France. Zenodo.