Skip to content

Code associated with the first publication describing the Human Microbiome Compendium

Notifications You must be signed in to change notification settings

blekhmanlab/compendium_v1

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Human Microbiome Compendium v1

Code supplement

This repository contains code related to the manuscript "Integration of 168,000 samples reveals global patterns of the human gut microbiome" by Abdill, Graham et al.

Processing

  • The /pipeline directory contains code used for retrieving, processing and consolidating the raw data from the Sequence Read Archive.
  • /visualization/setup.R contains helper functions used in the scripts below.
  • /analysis/make_filtered_data.R requires one input file, taxonomic_table.csv. This can be downloaded from Zenodo (as taxonomic_table.csv.gz), decompressed, and used without modification. This script generates the data files used in all the other scripts.

Analysis

The /analysis directory contains code used to generate and evaluate data for the project.

  • pcoa.R contains the code used for the principal coordinates analysis in Figure 2.
  • rarefaction.R contains the code used for the taxonomic discovery rate analysis in Figure 1.
  • rarefaction_diversity.R contains the code used for the rarefaction analysis of Shannon diversity described in Figure 2.
  • cluster_evaluation.R contains the code used for the bootstrap analysis of clustering strength described in the manuscript.
  • pca.R contains the code used for the principal components analysis used to determine regional signatures described in the manuscript. It relies on one external file, sample_metadata.tsv, available in the paper's associated Zenodo repository.
  • country_inference_check.R contains the code used for the manual evaluation of the accuracy of the world region inference steps. The power calculation is first, followed by the procedure used to generate the randomly selected samples to validate.
  • phylogenetic.sh describes generating the Greengenes2-based classifications
  • The gain analysis illustrated in Supplementary Figure 5 has several files:
    • gain_setup.sh does the data preparation
    • gain_iteration.sh performs a proportion of the permutations
    • gain.R plots the data as seen in the figure.
  • evident.R shows the script used to calculate the effect sizes illustrated in Figure 3G.
  • Several files show the process for the PERMANOVA analysis described in the results section about Figure 3:
    • filter_dist.py does the data preparation
    • permanova.R is the script used to run the analysis

Visualization

The /visualization directory contains the R code used to generate the figures in our manuscript.

  • map_setup.R lists the steps for installing the dependencies for generating the map in Figure 2A.
  • setup.R loads helper functions used in the generation of several figures.
  • figure1.R generates the panels in Figure 1 and associated supplementary material.
    • It requires one external file, rarefaction.rds, that is stored in the /data directory.
  • figure2.R generates the panels in Figure 2 and associated supplementary material. It requires several external files:
    • In the data/ directory:
      • rarefaction_diversity.rds
      • regions.csv
    • From the paper's associated Zenodo repository:
      • sample_metadata.tsv
    • Generated by pcoa.R:
      • nmds.rds
      • pcoa_points.rds
  • figure3.R generates Figure 3 and its supplements. It requires several external files:
    • sra_samples.tsv, available from the publication as Supplementary Table 7.
    • tech.txt in the /data directory
    • diff_abundance_results_20240705.tsv in the /data directory
  • Figure 4 and its supplements are generated by code across several files:
    • figure4A.R generates the panels in Figure 4A, and calls the code for generating figures 4B and 4C. It requires several external files:
      • Figure 4A requires one external file, unfiltered_rarefaction_by_read.rds, that is stored in the /data directory.
      • Figure 4C requires sample_metadata.tsv from Zenodo
      • Figures 4C–F require metadata_from_rpackage.rds from the /data directory.
    • figure4D.R and figure4EF.R generate the remaining panels.
      • These require taxa_names.txt from the /data directory.
  • figure5.R generates the panels in Figure 5. It requires several external files, all available in the /data directory:
    • diff_taxa_counts_for_5A.rds
    • fig5A_labels.rds
    • diff_abundant_pvalues_for_5B.rds
    • metadata_for_diffAbundance.rds
    • taxon_names.tsv
  • figure6.R generates the panels in Figure 6 and its supplements. It requires several eternal files, all available in the /data directory:
    • compendium_metadata.csv
    • compendium_pca.csv
    • country_cluster_bootstrap.100min.rds
    • country_cluster_bootstrap.100min.REAL.rds

-- If you have any questions, please contact corresponding author Ran Blekhman at blekhman (at) uchicago.edu. Thanks.

About

Code associated with the first publication describing the Human Microbiome Compendium

Resources

Stars

Watchers

Forks