CÓDIGO is the Consortium for Genomic Diversity, Ancestry, and Health in Colombia. It aims to build a community of Colombian researchers in support of local capacity in genomics, bioinformatics, and precision health. The CÓDIGO web browser contains 95 million genetic variants, and we aim to provide more through our analysis in this repository. This repository aims to merge and analyze the individual samples from various regions in Colombia to upload to the database.
CÓDIGO browser: https://codigo.biosci.gatech.edu/
CÓDIGO flagship paper: https://www.nature.com/articles/s42003-025-08496-9
(Datasets ingested into CODIGOv2)
| Name | Code | Location |
|---|---|---|
| Magdalena | BAQ | BAQ_Magdalena |
| --- | --- | --- |
- Installation
- Preprocessing Sample Data
- Summary Statistics for CODIGO
- Ancestry Estimates
The base files required for this workflow to work include the following:
| File | Located in |
|---|---|
| Target reference genome (FASTA file) | /home/sharedFolder/referenceData/<Reference_Genome>/<Reference_Genome>.fasta |
| Target reference genome (dict file) | /home/sharedFolder/referenceData/<Reference_Genome>/<Reference_Genome>.dict |
| Region samples | /home/sharedFolder/referenceData/<Region>/ |
Notes: The above files are located on the teralab server. Here's how to copy them to your current directory if we'd like to use hg38 reference genome.
cp /home/sharedFolder/referenceData/hg38/hg38.fasta .
cp /home/sharedFolder/referenceData/hg38/hg38.dict .
cp -r cp /home/sharedFolder/referenceData/<Region>/
Chain file (Download desired one for conversion): https://hgdownload.cse.ucsc.edu/goldenpath/<current_build>/liftOver/<conversion>.over.chain.gz.
For example, to convert from hg19 to hg38, the link would be: Hg19ToHg38
wget https://hgdownload.soe.ucsc.edu/goldenPath/hg19/liftOver/hg19ToHg38.over.chain.gz
environment.yaml is used to create the 'CODIGO_analysis_env'. This environment isvrequired to use the notebooks present in the repository.
# Clone the repository in your working directory
git clone https://github.com/jordanlab/CODIGO.git
# Copy the .yaml file to your current working directory
cp CODIGO/environment.yaml .
# Create the conda environment from the .yaml file
conda env create -f environment.yaml
# Activate the environment
conda activate CODIGO_analysis_env
These are the instructions to prepare population-specific genomic variant data for inclusion in the CÓDIGO database. <SampleName> is the region where we get the samples from (e.g. Magdalena, Cordoba, etc.). If there's ever code to use PLINK version 1.9 instead of 2.0, please ensure to use the --keep-allele-order flag.
Ensure your kernel is able to access the 'CODIGO_analysis_env'. Once the kernel is active, make sure the following commands all run successfully.
| Tool | Command |
|---|---|
| Bcftools | bcftools --version |
| tabix | tabix --version |
| KING | king |
| PLINK 2.0 | plink2 --version |
| GATK | gatk |
Notes: The above tools are run throughout the notebook. Please ensure your environment is configured correctly so all of them run without issues.
- Combine all samples from your dataset and merge them using bcftools
# Place all pathways of the sample vcf files into <SampleName>_vcf_list.txt file
find <SampleName/Folders_Containing_Sample_VCFs> -type f -name "*.vcf" > <SampleName>_vcf_list.txt
# Ensure the paths above command executed and contains the number of samples that are present in the study
wc -l <SampleName>_vcf_list.txt
# Create index, gzipped version of your region's vcf files
mkdir vcf_gzipped
xargs -I {} -P 4 sh -c 'bgzip -c "{}" > "vcf_gzipped/$(basename "{}").gz" && tabix -p vcf "vcf_gzipped/$(basename "{}").gz"' <SampleName>_vcf_list.txt
# Merge the files
ls --color=never vcf_gzipped/*.vcf.gz > vcf_gzipped_list.txt
bcftools merge -Oz -o merged.vcf.gz -l vcf_gzipped_list.txt
tabix -p vcf merged.vcf.gz
# Ensure that there are contig headers in the merged file to verify that the merge happened successfully
zgrep "##contig" merged.vcf.gz | head -n 5
- Create a clean bed/bim/fam dataset (Only for biallelic variants within the samples)
# Grab the sampleIDs from all samples
zgrep '^#CHROM' merged.vcf.gz | head -n 1 | cut -f10- | tr '\t' '\n' > temp_ids.txt
bcftools query -l merged.vcf.gz > ids.txt
# Unzip the merged vcf for later analysis
gunzip merged.vcf.gz
# Create a psam file that contains important information regarding FamilyID, SampleID, Sex, & Phenotype
awk 'BEGIN{OFS="\t";print "#FID","IID","SEX","PHENOTYPE"}{print $1,$1,0,-9}' ids.txt > <SampleName>.psam
# Create a temp fam file that may be potentially used downstream in the analysis (potentially not needed)
awk '{print $1, $1, 0, 0, 0, 0}' temp_ids.txt > temp.fam
# Place multiallelic variants into <SampleName>_clean
plink2 --vcf merged.vcf --psam <SampleName>.psam --split-par b37 --max-alleles 2 --make-bed --out <SampleName>_clean
- Rename the samples based on a convention that's agreed upon.
# Rename the FamilyID($1) and SampleID($2) to the new convention -> It usually is 'Convention_SampleID'; if it's different, change this command accordingly
cut -f 1-2 -d ' ' <SampleName>_clean.fam | awk '{print $1, $2, "<Your Convention>_"$1, "<Your Convention>_"$2}' > sample_rename.txt
# Make the bed files of the renamed samples
plink2 --bfile <SampleName>_clean --update-ids sample_rename.txt --make-bed --out <SampleName>_renamed_samples
- Rename variants to gnomad format
(chr:pos:ref:alt)using PLINK2.0.
# Reformatting the variants into gnomad format; b37 is the agreed upon convention for grch37
# @ = chromosome number
# # = base-pair position on chr
# $r = ref allele
# $a = alt allele
plink2 --bfile <SampleName>_renamed_samples --new-id-max-allele-len 1791 --set-all-var-ids chr@:#:\$r:\$a --make-bed --out <SampleName<_renamed_variants
- Remove samples with more than 99% of their genotyping data missing
# Remove samples that are missing more than 99% of their genotyping data
plink2 --bfile <SampleName<_renamed_variants --mind 0.99 --make-bed --out <SampleName<_qc1
- Remove any duplicate samples and first degree relatives present in the samples
# Removes first degree relatives
king -b <SampleName>_qc1.bed --duplicate --degree 1 --out duplicates
# Grab duplicate samples, place them in a .txt file, and completely remove the from analysis
tail -n +2 king.con | awk '{print $3, $4}' > duplicate_samples_to_remove.txt
plink2 --bfile <SampleName>_qc1 --remove duplicate_samples_to_remove.txt --make-bed --out <SampleName>_qc2
- If the current build is GRCh37, perform liftover to GRCh38 (gets harmonized automatically using GATK4)
# Convert our bed/bim/fam files to vcf
plink2 --bfile <SampleName>_qc2 \
--recode vcf-iid \
--out <SampleName>_qc2
# Annotate the vcf files to have 'chr' before the chromosome number as it's required for the gatk command to perform the proper liftover
bcftools annotate \
--rename-chrs <(for i in {1..22} X Y MT; do echo -e "$i\tchr$i"; done) \
<SampleName>_qc2.vcf -Oz -o <SampleName>_qc2_chr.vcf.gz
# Liftover Command
java -jar /home/<Your_teralab_Username>/miniconda3/envs/CODIGO_analysis_env/share/gatk4-4.6.2.0-0/gatk-package-4.6.2.0-local.jar LiftoverVcf \
-I <SampleName>_qc2_chr.vcf.gz \
-O <SampleName>_qc2_grch38.vcf \
-CHAIN hg19ToHg38.over.chain.gz \
-WARN_ON_MISSING_CONTIG true \
-RECOVER_SWAPPED_REF_ALT true \
-R hg38.fasta \
-REJECT rejected_variants.vcf
# See if the output file contains any info
tail -n10 /home/<Your_Teralab_Username>/CODIGOv2/practice_run/<SampleName>_qc2_grch38.vcf
Notes: If the build is GRCh38, liftover to GRCh37 and perform a re-liftover to GRCh38 (Both get harmonized).
- Rescue missing GRCh38 variants by matching with CODIGO chr:pos:ref:alt
- Combine rescued variants with bcf tools
This information should be provided.
- Group Code:
- Ancestry Group:
- Department:
# Allele Frequencies
plink2 --bfile <SampleName>_qc2 \
--freq \
--out <SampleName>_allele_freq
# Allele Counts
plink2 --bfile <SampleName>_qc2 \
--freq counts \
--out <SampleName>_allele_counts
# Genotype Count
plink2 --bfile <SampleName>_qc2 \
--geno-counts \
--out <SampleName>_genotype_count
# Genotype Frequencies
awk 'BEGIN {OFS="\t"} NR==1 {print $0, "HOM_A1_FREQ", "HET_FREQ", "HOM_A2_FREQ"; next} {total=$6+$7+$8; if(total>0) {print $0, $6/total, $7/total, $8/total} else {print $0, 0, 0, 0}}' <SampleName>_genotype_count.gcount > <SampleName>_genotype_freq.gfreq
Note: See Ensembl 1000 genome browser for 8 & 9
Run a python script to create a file that will contains the summary statistics from the files in step 8 that will be uploaded to CÓDIGO.
import pandas as pd
# Please change the following paths to your own paths for the files
ALLELE_COUNT_PATH = "/home/<Teralab_Username>/CODIGOv2/CODIGO_<SampleName</<SampleName>_allele_counts.acount"
ALLELE_FREQ_PATH = "/home/<Teralab_Username>/CODIGOv2/CODIGO_<SampleName</<SampleName>_allele_freq.afreq"
OUTPUT_PATH = "/home/<Teralab_Username>/CODIGOv2/CODIGO_<SampleName</BATCH_VARIANT_STATS.tsv"
COLUMNS_TO_CHECK = ['ID', 'REF', 'ALT']
def check_columns() -> None:
'''
Takes a look at all the columns in allele counts and allele frequencies and checks to see that we have matches in all values.
If there aren't, we need to fix something previously done in the workflow.
params:
None: Files required are global.
Returns:
None: Prints whether all values match or shows differences.
'''
# Load tab-delimited files
try:
df1 = pd.read_csv(ALLELE_COUNT_PATH, sep='\t')
df2 = pd.read_csv(ALLELE_FREQ_PATH, sep='\t')
except FileNotFoundError as e:
print(f"Error: {e}")
return
# Get the specific column values as sets
set1 = set(df1[COLUMNS_TO_CHECK])
set2 = set(df2[COLUMNS_TO_CHECK])
# Find values that are different
only_in_file1 = set1 - set2
only_in_file2 = set2 - set1
# Check if all values are the same
if not only_in_file1 and not only_in_file2:
print("All values in the column are the same in both files!")
else:
print("Values only in file1:", only_in_file1)
print("Values only in file2:", only_in_file2)
def create_summary_stats_file() -> None:
'''
Creates final summary stats file using allele counts and allele frequencies in the population file.
params:
None: Files required are global.
Returns:
None: Saves VARIANT_BATCH_STATS.tsv file to output file path.
'''
try:
df1 = pd.read_csv(ALLELE_COUNT_PATH, sep='\t')
df2 = pd.read_csv(ALLELE_FREQ_PATH, sep='\t')
except FileNotFoundError as e:
print(f"Error: {e}")
return
# Create a new dataframe that will contain our new file information
return_df = pd.DataFrame()
# Populate this new dataframe with the values that we need from the .acount and .afreq files
return_df['Chromosome'] = 'chr' + df1['ID'].astype(str).str.split(':').str[0]
return_df['Position'] = df1['ID'].astype(str).str.split(':').str[1]
return_df['SNP'] = df1['ID']
return_df['Ref'] = df1['REF']
return_df['Alt'] = df1['ALT']
return_df['Alt_Count'] = df1['ALT_CTS']
return_df['Total_Count'] = df1['OBS_CT']
return_df['Alt_Frequency'] = df2['ALT_FREQS']
# Output the file
return_df.to_csv(OUTPUT_PATH, sep='\t', index=False)
print("Summary Stats File Created!")
check_columns()
create_summary_stats_file()
This step is used used to infer the relative ancestries of our samples. For more information regarding the ancestry estimates and the specific steps chosen, please refer to the Jordan Lab Docs manual: JordanLab
The reference data can be merged with the sample data only on the overalapping variants. Below, we see the overlapping variants in each of the files, and we merge over these overlapping variants after.
# For BAQ samples
awk '{print $2}' <SampleName>_qc2_grch38_final.bim > <SampleName>_qc2_grch38_final_snps.txt
# Shivam Reference Sample
awk '{print $2}' extractedChrAll.bim > extractedChrAll_snps.txt
# Sort all the SNPs that are present in reference
sort <SampleName>_qc2_grch38_final_snps.txt -o <SampleName>_qc2_grch38_final_snps.txt
sort extractedChrAll_snps.txt -o extractedChrAll_snps.txt
# Find the common SNPs between reference & samples
grep -Fxf <SampleName>_qc2_grch38_final_snps.txt extractedChrAll_snps.txt > common_snps_extractedChr_<SampleName>.txt
# Get the variant counts for each of the files
wc -l <SampleName>_qc2_grch38_final_snps.txt
wc -l extractedChrAll_snps.txt
wc -l common_snps_extractedChr_<SampleName>.txt
Mind Filter
plink2 --bfile Merged_Sample<SampleName>_ReferenceExtractedChr --allow-extra-chr --mind 0.99 --make-bed --out step1_mind
Geno Filter
plink2 --bfile step1_mind --allow-extra-chr --geno 0.02 --make-bed --out step2_geno
Indel Filter
plink2 --bfile step2_geno --allow-extra-chr --snps-only just-acgt --make-bed --out step3_no_indels
Hardy-Weinberg Filter
plink2 --bfile step3_no_indels --allow-extra-chr --hwe 1e-6 --make-bed --out step4_hwe
Minor Allele Frequency Filter
plink2 --bfile step4_hwe --allow-extra-chr --maf 0.01 --make-bed --out step5_maf
LD Pruning Filters
plink --bfile step5_maf --allow-extra-chr --keep-allele-order --indep-pairwise 50 5 0.2 --out step6_ld_pruned
plink --bfile step5_maf --allow-extra-chr --keep-allele-order --extract step6_ld_pruned.prune.in --make-bed --out final_qc_for_pca
In this substep, we obtain the eigenvectors and eigenvalues for our PCA plot.
Note: It's important to obtain these eigenvectors and eigenvalues because they are required to create a PCA plot that will give the ancestry estimates for each sample.
plink --bfile final_qc_for_pca --keep-allele-order --pca 20 --allow-extra-chr --out <SampleName>_PCA_results
import pandas as pd
from plotnine import *
# --- 1. Load and Prepare the Data ---
# Define the path to the eigenvector file
eigenvec_file = "<SampleName>_PCA_results.eigenvec"
# Load the eigenvector data using pandas.
# PLINK's .eigenvec files are space-separated and don't have a header.
# We'll use a regular expression `\s+` to handle one or more spaces as a delimiter.
df = pd.read_csv(eigenvec_file, sep=r'\s+', header=None)
# The first two columns are Family ID (FID) and Individual ID (IID). The rest are the PCs.
# We'll create column names for all 20 principal components.
num_pcs = df.shape[1] - 2
pc_columns = [f'PC{i}' for i in range(1, num_pcs + 1)]
df.columns = ['FID', 'IndividualID'] + pc_columns
# Create the 'PopGroup' column by taking the first 3 characters of the 'IndividualID'
df['PopGroup'] = df['IndividualID'].str[:3]
pop_to_remove = ['ACB', 'ASW', 'MXL', 'PUR']
df = df[~df['PopGroup'].isin(pop_to_remove)]
# For clarity and efficiency, create a final DataFrame with only the columns needed for plotting
plot_df = df[['IndividualID', 'PopGroup', 'PC1', 'PC2']]
print(plot_df[['IndividualID', 'PopGroup']].head())
# --- 2. Create the PCA Plot ---
# Define the color mapping for population groups as a dictionary
colors = {
# Our Sample
"<SampleName>": "Gray",
# African
"ESN": "blue",
"GWD": "blue",
"YRI": "blue",
# European
"GBR": "orange",
"IBS": "orange",
"TSI": "orange",
# Indigenous American
"Kar": "aqua",
"May": "aqua",
"PEL": "aqua",
"Pim": "aqua",
"Sur": "aqua"
}
# Generate the plot using plotnine's ggplot
# The syntax mirrors R's ggplot2 very closely
pca_plot = (
ggplot(plot_df, aes(x='PC1', y='PC2', fill='PopGroup'))
+ scale_fill_manual(values=colors)
+ geom_point(size=5, shape='o', color="black") # shape='o' with a fill is equivalent to R's pch=21
+ theme_classic()
+ theme(
figure_size=(10, 10), # Set plot dimensions (width, height in inches)
axis_title=element_text(size=23),
axis_text=element_text(size=14),
legend_key_size=20, # Controls height and width of legend keys
legend_title=element_text(size=14),
legend_text=element_text(size=14)
)
)
# Print the plot object to display it
print(pca_plot)
pca_plot.draw()