CÓDIGO

CÓDIGO is the Consortium for Genomic Diversity, Ancestry, and Health in Colombia. It aims to build a community of Colombian researchers in support of local capacity in genomics, bioinformatics, and precision health. The CÓDIGO web browser contains 95 million genetic variants, and we aim to provide more through our analysis in this repository. This repository aims to merge and analyze the individual samples from various regions in Colombia to upload to the database.

CÓDIGO browser: https://codigo.biosci.gatech.edu/

CÓDIGO flagship paper: https://www.nature.com/articles/s42003-025-08496-9

(Datasets ingested into CODIGOv2)

Name	Code	Location
Magdalena	BAQ	BAQ_Magdalena
---	---	---

Installation

Base Files

The base files required for this workflow to work include the following:

File	Located in
Target reference genome (FASTA file)	`/home/sharedFolder/referenceData/<Reference_Genome>/<Reference_Genome>.fasta`
Target reference genome (dict file)	`/home/sharedFolder/referenceData/<Reference_Genome>/<Reference_Genome>.dict`
Region samples	`/home/sharedFolder/referenceData/<Region>/`

Notes: The above files are located on the teralab server. Here's how to copy them to your current directory if we'd like to use hg38 reference genome.

cp /home/sharedFolder/referenceData/hg38/hg38.fasta .

cp /home/sharedFolder/referenceData/hg38/hg38.dict .

cp -r cp /home/sharedFolder/referenceData/<Region>/

Chain file (Download desired one for conversion): https://hgdownload.cse.ucsc.edu/goldenpath/<current_build>/liftOver/<conversion>.over.chain.gz.
For example, to convert from hg19 to hg38, the link would be: Hg19ToHg38

wget https://hgdownload.soe.ucsc.edu/goldenPath/hg19/liftOver/hg19ToHg38.over.chain.gz

Dependencies

environment.yaml is used to create the 'CODIGO_analysis_env'. This environment isvrequired to use the notebooks present in the repository.

# Clone the repository in your working directory
git clone https://github.com/jordanlab/CODIGO.git

# Copy the .yaml file to your current working directory
cp CODIGO/environment.yaml . 

# Create the conda environment from the .yaml file
conda env create -f environment.yaml

# Activate the environment
conda activate CODIGO_analysis_env

Preprocessing Sample Data

These are the instructions to prepare population-specific genomic variant data for inclusion in the CÓDIGO database. <SampleName> is the region where we get the samples from (e.g. Magdalena, Cordoba, etc.). If there's ever code to use PLINK version 1.9 instead of 2.0, please ensure to use the --keep-allele-order flag.

0. Configure your Environment

Ensure your kernel is able to access the 'CODIGO_analysis_env'. Once the kernel is active, make sure the following commands all run successfully.

Tool	Command
Bcftools	`bcftools --version`
tabix	`tabix --version`
KING	`king`
PLINK 2.0	`plink2 --version`
GATK	`gatk`

Notes: The above tools are run throughout the notebook. Please ensure your environment is configured correctly so all of them run without issues.

1. Create multi-vcf file from population source data

Combine all samples from your dataset and merge them using bcftools

# Place all pathways of the sample vcf files into <SampleName>_vcf_list.txt file
find <SampleName/Folders_Containing_Sample_VCFs> -type f -name "*.vcf" > <SampleName>_vcf_list.txt

# Ensure the paths above command executed and contains the number of samples that are present in the study
wc -l <SampleName>_vcf_list.txt

# Create index, gzipped version of your region's vcf files
mkdir vcf_gzipped
xargs -I {} -P 4 sh -c 'bgzip -c "{}" > "vcf_gzipped/$(basename "{}").gz" && tabix -p vcf "vcf_gzipped/$(basename "{}").gz"' <SampleName>_vcf_list.txt

# Merge the files
ls --color=never vcf_gzipped/*.vcf.gz > vcf_gzipped_list.txt
bcftools merge -Oz -o merged.vcf.gz -l vcf_gzipped_list.txt
tabix -p vcf merged.vcf.gz

# Ensure that there are contig headers in the merged file to verify that the merge happened successfully
zgrep "##contig" merged.vcf.gz | head -n 5

2. Convert to PLINK format (bed/bim/fam)

Create a clean bed/bim/fam dataset (Only for biallelic variants within the samples)

# Grab the sampleIDs from all samples
zgrep '^#CHROM' merged.vcf.gz | head -n 1 | cut -f10- | tr '\t' '\n' > temp_ids.txt
bcftools query -l merged.vcf.gz > ids.txt

# Unzip the merged vcf for later analysis
gunzip merged.vcf.gz

# Create a psam file that contains important information regarding FamilyID, SampleID, Sex, & Phenotype
awk 'BEGIN{OFS="\t";print "#FID","IID","SEX","PHENOTYPE"}{print $1,$1,0,-9}' ids.txt > <SampleName>.psam

# Create a temp fam file that may be potentially used downstream in the analysis (potentially not needed)
awk '{print $1, $1, 0, 0, 0, 0}' temp_ids.txt > temp.fam

# Place  multiallelic variants into <SampleName>_clean
plink2 --vcf merged.vcf --psam <SampleName>.psam --split-par b37 --max-alleles 2 --make-bed --out <SampleName>_clean

3. Reformat data

a. Rename samples

Rename the samples based on a convention that's agreed upon.

# Rename the FamilyID($1) and SampleID($2) to the new convention -> It usually is 'Convention_SampleID'; if it's different, change this command accordingly
cut -f 1-2 -d ' ' <SampleName>_clean.fam | awk '{print $1, $2, "<Your Convention>_"$1, "<Your Convention>_"$2}' > sample_rename.txt

# Make the bed files of the renamed samples
plink2 --bfile <SampleName>_clean --update-ids sample_rename.txt --make-bed --out <SampleName>_renamed_samples

b. Rename variants

Rename variants to gnomad format (chr:pos:ref:alt) using PLINK2.0.

# Reformatting the variants into gnomad format; b37 is the agreed upon convention for grch37
# @ = chromosome number
# # = base-pair position on chr
# $r = ref allele
# $a = alt allele
plink2 --bfile <SampleName>_renamed_samples --new-id-max-allele-len 1791 --set-all-var-ids chr@:#:\$r:\$a --make-bed --out <SampleName<_renamed_variants

4. Perform QC

a. Remove samples that are missing a majority of their variants

Remove samples with more than 99% of their genotyping data missing

# Remove samples that are missing more than 99% of their genotyping data
plink2 --bfile <SampleName<_renamed_variants --mind 0.99 --make-bed --out <SampleName<_qc1

b. Remove duplicates samples and first degree relatives

Remove any duplicate samples and first degree relatives present in the samples

# Removes first degree relatives
king -b <SampleName>_qc1.bed --duplicate --degree 1 --out duplicates

# Grab duplicate samples, place them in a .txt file, and completely remove the from analysis
tail -n +2 king.con | awk '{print $3, $4}' > duplicate_samples_to_remove.txt
plink2 --bfile <SampleName>_qc1 --remove duplicate_samples_to_remove.txt --make-bed --out <SampleName>_qc2

5. Liftover

If the current build is GRCh37, perform liftover to GRCh38 (gets harmonized automatically using GATK4)

# Convert our bed/bim/fam files to vcf
plink2 --bfile <SampleName>_qc2 \
      --recode vcf-iid \
      --out <SampleName>_qc2

# Annotate the vcf files to have 'chr' before the chromosome number as it's required for the gatk command to perform the proper liftover
bcftools annotate \
  --rename-chrs <(for i in {1..22} X Y MT; do echo -e "$i\tchr$i"; done) \
  <SampleName>_qc2.vcf -Oz -o <SampleName>_qc2_chr.vcf.gz

# Liftover Command
java -jar /home/<Your_teralab_Username>/miniconda3/envs/CODIGO_analysis_env/share/gatk4-4.6.2.0-0/gatk-package-4.6.2.0-local.jar LiftoverVcf \
   -I <SampleName>_qc2_chr.vcf.gz \
   -O <SampleName>_qc2_grch38.vcf \
   -CHAIN hg19ToHg38.over.chain.gz \
   -WARN_ON_MISSING_CONTIG true \
   -RECOVER_SWAPPED_REF_ALT true \
   -R hg38.fasta \
   -REJECT rejected_variants.vcf

# See if the output file contains any info
tail -n10 /home/<Your_Teralab_Username>/CODIGOv2/practice_run/<SampleName>_qc2_grch38.vcf

Notes: If the build is GRCh38, liftover to GRCh37 and perform a re-liftover to GRCh38 (Both get harmonized).
- Rescue missing GRCh38 variants by matching with CODIGO chr:pos:ref:alt
- Combine rescued variants with bcf tools

Summary Statistics for CODIGO

6. Population metadata – group code, ancestry group, Department(s)

This information should be provided.

Group Code:
Ancestry Group:
Department:

7. Ref and Alt allele frequency & count

# Allele Frequencies
plink2 --bfile <SampleName>_qc2 \
      --freq \
      --out <SampleName>_allele_freq

# Allele Counts
plink2 --bfile <SampleName>_qc2 \
      --freq counts \
      --out <SampleName>_allele_counts

8. Genotype frequency & count

# Genotype Count
plink2 --bfile <SampleName>_qc2 \
      --geno-counts \
      --out <SampleName>_genotype_count

# Genotype Frequencies
awk 'BEGIN {OFS="\t"} NR==1 {print $0, "HOM_A1_FREQ", "HET_FREQ", "HOM_A2_FREQ"; next} {total=$6+$7+$8; if(total>0) {print $0, $6/total, $7/total, $8/total} else {print $0, 0, 0, 0}}' <SampleName>_genotype_count.gcount > <SampleName>_genotype_freq.gfreq

Note: See Ensembl 1000 genome browser for 8 & 9

9: Preparing Summary Stats for CODIGO Upload

Run a python script to create a file that will contains the summary statistics from the files in step 8 that will be uploaded to CÓDIGO.

import pandas as pd

# Please change the following paths to your own paths for the files
ALLELE_COUNT_PATH = "/home/<Teralab_Username>/CODIGOv2/CODIGO_<SampleName</<SampleName>_allele_counts.acount"
ALLELE_FREQ_PATH = "/home/<Teralab_Username>/CODIGOv2/CODIGO_<SampleName</<SampleName>_allele_freq.afreq"
OUTPUT_PATH = "/home/<Teralab_Username>/CODIGOv2/CODIGO_<SampleName</BATCH_VARIANT_STATS.tsv"
COLUMNS_TO_CHECK = ['ID', 'REF', 'ALT']

def check_columns() -> None:
    '''
        Takes a look at all the columns in allele counts and allele frequencies and checks to see that we have matches in all values.
        If there aren't, we need to fix something previously done in the workflow.
        
        params: 
            None: Files required are global.
        Returns:
            None: Prints whether all values match or shows differences.
    '''
    # Load tab-delimited files
    try:
        df1 = pd.read_csv(ALLELE_COUNT_PATH, sep='\t')
        df2 = pd.read_csv(ALLELE_FREQ_PATH, sep='\t')
    except FileNotFoundError as e:
        print(f"Error: {e}")
        return

    # Get the specific column values as sets
    set1 = set(df1[COLUMNS_TO_CHECK])
    set2 = set(df2[COLUMNS_TO_CHECK])

    # Find values that are different
    only_in_file1 = set1 - set2
    only_in_file2 = set2 - set1

    # Check if all values are the same
    if not only_in_file1 and not only_in_file2:
        print("All values in the column are the same in both files!")
    else:
        print("Values only in file1:", only_in_file1)
        print("Values only in file2:", only_in_file2)

def create_summary_stats_file() -> None:
    '''
        Creates final summary stats file using allele counts and allele frequencies in the population file.
        
        params: 
            None: Files required are global.
        Returns:
            None: Saves VARIANT_BATCH_STATS.tsv file to output file path. 
    '''
    try:
        df1 = pd.read_csv(ALLELE_COUNT_PATH, sep='\t')
        df2 = pd.read_csv(ALLELE_FREQ_PATH, sep='\t')
    except FileNotFoundError as e:
        print(f"Error: {e}")
        return

    # Create a new dataframe that will contain our new file information
    return_df = pd.DataFrame()

    # Populate this new dataframe with the values that we need from the .acount and .afreq files
    return_df['Chromosome'] = 'chr' + df1['ID'].astype(str).str.split(':').str[0]
    return_df['Position'] = df1['ID'].astype(str).str.split(':').str[1]
    return_df['SNP'] = df1['ID']
    return_df['Ref'] = df1['REF']
    return_df['Alt'] = df1['ALT']
    return_df['Alt_Count'] = df1['ALT_CTS']
    return_df['Total_Count'] = df1['OBS_CT']
    return_df['Alt_Frequency'] = df2['ALT_FREQS']
    
    # Output the file
    return_df.to_csv(OUTPUT_PATH, sep='\t', index=False)
    print("Summary Stats File Created!")

check_columns()
create_summary_stats_file()

Ancestry Estimates

This step is used used to infer the relative ancestries of our samples. For more information regarding the ancestry estimates and the specific steps chosen, please refer to the Jordan Lab Docs manual: JordanLab

10. Get overlapping variants and merge reference w/ samples

The reference data can be merged with the sample data only on the overalapping variants. Below, we see the overlapping variants in each of the files, and we merge over these overlapping variants after.

# For BAQ samples
awk '{print $2}' <SampleName>_qc2_grch38_final.bim > <SampleName>_qc2_grch38_final_snps.txt

# Shivam Reference Sample
awk '{print $2}' extractedChrAll.bim > extractedChrAll_snps.txt

# Sort all the SNPs that are present in reference 
sort <SampleName>_qc2_grch38_final_snps.txt -o <SampleName>_qc2_grch38_final_snps.txt
sort extractedChrAll_snps.txt -o extractedChrAll_snps.txt

# Find the common SNPs between reference & samples
grep -Fxf <SampleName>_qc2_grch38_final_snps.txt extractedChrAll_snps.txt > common_snps_extractedChr_<SampleName>.txt

# Get the variant counts for each of the files
wc -l <SampleName>_qc2_grch38_final_snps.txt
wc -l extractedChrAll_snps.txt
wc -l common_snps_extractedChr_<SampleName>.txt

11. Ancestry Filters & Plotting

a. Ancestry Filters

Mind Filter

plink2 --bfile Merged_Sample<SampleName>_ReferenceExtractedChr --allow-extra-chr --mind 0.99 --make-bed --out step1_mind

Geno Filter

plink2 --bfile step1_mind --allow-extra-chr --geno 0.02 --make-bed --out step2_geno

Indel Filter

plink2 --bfile step2_geno --allow-extra-chr --snps-only just-acgt --make-bed --out step3_no_indels

Hardy-Weinberg Filter

plink2 --bfile step3_no_indels --allow-extra-chr --hwe 1e-6 --make-bed --out step4_hwe

Minor Allele Frequency Filter

plink2 --bfile step4_hwe --allow-extra-chr --maf 0.01 --make-bed --out step5_maf

LD Pruning Filters

plink --bfile step5_maf --allow-extra-chr --keep-allele-order --indep-pairwise 50 5 0.2 --out step6_ld_pruned
plink --bfile step5_maf  --allow-extra-chr --keep-allele-order --extract step6_ld_pruned.prune.in --make-bed --out final_qc_for_pca

b. Obtain PCA Eigenvector/Eigenvalues and Plotting

In this substep, we obtain the eigenvectors and eigenvalues for our PCA plot.
Note: It's important to obtain these eigenvectors and eigenvalues because they are required to create a PCA plot that will give the ancestry estimates for each sample.

plink --bfile final_qc_for_pca --keep-allele-order --pca 20 --allow-extra-chr --out <SampleName>_PCA_results

import pandas as pd
from plotnine import *

# --- 1. Load and Prepare the Data ---

# Define the path to the eigenvector file
eigenvec_file = "<SampleName>_PCA_results.eigenvec"

# Load the eigenvector data using pandas.
# PLINK's .eigenvec files are space-separated and don't have a header.
# We'll use a regular expression `\s+` to handle one or more spaces as a delimiter.
df = pd.read_csv(eigenvec_file, sep=r'\s+', header=None)

# The first two columns are Family ID (FID) and Individual ID (IID). The rest are the PCs.
# We'll create column names for all 20 principal components.
num_pcs = df.shape[1] - 2
pc_columns = [f'PC{i}' for i in range(1, num_pcs + 1)]
df.columns = ['FID', 'IndividualID'] + pc_columns

# Create the 'PopGroup' column by taking the first 3 characters of the 'IndividualID'
df['PopGroup'] = df['IndividualID'].str[:3]
pop_to_remove = ['ACB', 'ASW', 'MXL', 'PUR']
df = df[~df['PopGroup'].isin(pop_to_remove)]

# For clarity and efficiency, create a final DataFrame with only the columns needed for plotting
plot_df = df[['IndividualID', 'PopGroup', 'PC1', 'PC2']]
print(plot_df[['IndividualID', 'PopGroup']].head())

# --- 2. Create the PCA Plot ---

# Define the color mapping for population groups as a dictionary
colors = {
    # Our Sample
    "<SampleName>": "Gray",

    # African
    "ESN": "blue",
    "GWD": "blue",
    "YRI": "blue",

    # European
    "GBR": "orange",
    "IBS": "orange",
    "TSI": "orange",

    # Indigenous American
    "Kar": "aqua",
    "May": "aqua",
    "PEL": "aqua",
    "Pim": "aqua",
    "Sur": "aqua"
}

# Generate the plot using plotnine's ggplot
# The syntax mirrors R's ggplot2 very closely
pca_plot = (
    ggplot(plot_df, aes(x='PC1', y='PC2', fill='PopGroup'))
    + scale_fill_manual(values=colors)
    + geom_point(size=5, shape='o', color="black") # shape='o' with a fill is equivalent to R's pch=21
    + theme_classic()
    + theme(        
        figure_size=(10, 10), # Set plot dimensions (width, height in inches)
        axis_title=element_text(size=23),
        axis_text=element_text(size=14),
        legend_key_size=20, # Controls height and width of legend keys
        legend_title=element_text(size=14),
        legend_text=element_text(size=14)
    )
)

# Print the plot object to display it
print(pca_plot)

pca_plot.draw()

Name		Name	Last commit message	Last commit date
Latest commit History 92 Commits
BAQ_Magdalena		BAQ_Magdalena
CODIGO_ATQ3		CODIGO_ATQ3
CODIGO_COR		CODIGO_COR
Preparing_Variant_Data_Magdalena.ipynb		Preparing_Variant_Data_Magdalena.ipynb
README.md		README.md
environment.yml		environment.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CÓDIGO

Table of Contents

Installation

Base Files

Dependencies

Preprocessing Sample Data

0. Configure your Environment

1. Create multi-vcf file from population source data

2. Convert to PLINK format (bed/bim/fam)

3. Reformat data

a. Rename samples

b. Rename variants

4. Perform QC

a. Remove samples that are missing a majority of their variants

b. Remove duplicates samples and first degree relatives

5. Liftover

Summary Statistics for CODIGO

6. Population metadata – group code, ancestry group, Department(s)

7. Ref and Alt allele frequency & count

8. Genotype frequency & count

9: Preparing Summary Stats for CODIGO Upload

Ancestry Estimates

10. Get overlapping variants and merge reference w/ samples

11. Ancestry Filters & Plotting

a. Ancestry Filters

b. Obtain PCA Eigenvector/Eigenvalues and Plotting

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Languages

jordanlab/CODIGO

Folders and files

Latest commit

History

Repository files navigation

CÓDIGO

Table of Contents

Installation

Base Files

Dependencies

Preprocessing Sample Data

0. Configure your Environment

1. Create multi-vcf file from population source data

2. Convert to PLINK format (bed/bim/fam)

3. Reformat data

a. Rename samples

b. Rename variants

4. Perform QC

a. Remove samples that are missing a majority of their variants

b. Remove duplicates samples and first degree relatives

5. Liftover

Summary Statistics for CODIGO

6. Population metadata – group code, ancestry group, Department(s)

7. Ref and Alt allele frequency & count

8. Genotype frequency & count

9: Preparing Summary Stats for CODIGO Upload

Ancestry Estimates

10. Get overlapping variants and merge reference w/ samples

11. Ancestry Filters & Plotting

a. Ancestry Filters

b. Obtain PCA Eigenvector/Eigenvalues and Plotting

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Languages

Packages