This repository contains code related to the manuscript "Integration of 168,000 samples reveals global patterns of the human gut microbiome" by Abdill, Graham et al.
- The
/pipeline
directory contains code used for retrieving, processing and consolidating the raw data from the Sequence Read Archive. /visualization/setup.R
contains helper functions used in the scripts below./analysis/make_filtered_data.R
requires one input file,taxonomic_table.csv
. This can be downloaded from Zenodo (astaxonomic_table.csv.gz
), decompressed, and used without modification. This script generates the data files used in all the other scripts.
The /analysis
directory contains code used to generate and evaluate data for the project.
pcoa.R
contains the code used for the principal coordinates analysis in Figure 2.rarefaction.R
contains the code used for the taxonomic discovery rate analysis in Figure 1.rarefaction_diversity.R
contains the code used for the rarefaction analysis of Shannon diversity described in Figure 2.cluster_evaluation.R
contains the code used for the bootstrap analysis of clustering strength described in the manuscript.pca.R
contains the code used for the principal components analysis used to determine regional signatures described in the manuscript. It relies on one external file,sample_metadata.tsv
, available in the paper's associated Zenodo repository.country_inference_check.R
contains the code used for the manual evaluation of the accuracy of the world region inference steps. The power calculation is first, followed by the procedure used to generate the randomly selected samples to validate.phylogenetic.sh
describes generating the Greengenes2-based classifications- The gain analysis illustrated in Supplementary Figure 5 has several files:
gain_setup.sh
does the data preparationgain_iteration.sh
performs a proportion of the permutationsgain.R
plots the data as seen in the figure.
evident.R
shows the script used to calculate the effect sizes illustrated in Figure 3G.- Several files show the process for the PERMANOVA analysis described in the results section about Figure 3:
filter_dist.py
does the data preparationpermanova.R
is the script used to run the analysis
The /visualization
directory contains the R code used to generate the figures in our manuscript.
map_setup.R
lists the steps for installing the dependencies for generating the map in Figure 2A.setup.R
loads helper functions used in the generation of several figures.figure1.R
generates the panels in Figure 1 and associated supplementary material.- It requires one external file,
rarefaction.rds
, that is stored in the/data
directory.
- It requires one external file,
figure2.R
generates the panels in Figure 2 and associated supplementary material. It requires several external files:- In the
data/
directory:rarefaction_diversity.rds
regions.csv
- From the paper's associated Zenodo repository:
sample_metadata.tsv
- Generated by
pcoa.R
:nmds.rds
pcoa_points.rds
- In the
figure3.R
generates Figure 3 and its supplements. It requires several external files:sra_samples.tsv
, available from the publication as Supplementary Table 7.tech.txt
in the/data
directorydiff_abundance_results_20240705.tsv
in the/data
directory
- Figure 4 and its supplements are generated by code across several files:
figure4A.R
generates the panels in Figure 4A, and calls the code for generating figures 4B and 4C. It requires several external files:- Figure 4A requires one external file,
unfiltered_rarefaction_by_read.rds
, that is stored in the/data
directory. - Figure 4C requires
sample_metadata.tsv
from Zenodo - Figures 4C–F require
metadata_from_rpackage.rds
from the/data
directory.
- Figure 4A requires one external file,
figure4D.R
andfigure4EF.R
generate the remaining panels.- These require
taxa_names.txt
from the/data
directory.
- These require
figure5.R
generates the panels in Figure 5. It requires several external files, all available in the/data
directory:diff_taxa_counts_for_5A.rds
fig5A_labels.rds
diff_abundant_pvalues_for_5B.rds
metadata_for_diffAbundance.rds
taxon_names.tsv
figure6.R
generates the panels in Figure 6 and its supplements. It requires several eternal files, all available in the/data
directory:compendium_metadata.csv
compendium_pca.csv
country_cluster_bootstrap.100min.rds
country_cluster_bootstrap.100min.REAL.rds
-- If you have any questions, please contact corresponding author Ran Blekhman at blekhman (at) uchicago.edu. Thanks.