Skip to content

QingrunZhangLab/EBSAS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

40 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

EBSAS: Expression-Bridged Somatic Association Studies

Despite the success of genotype–phenotype association studies in many complex traits, their application to cancers has largely focused on germline variants, leaving the role of somatic mutations underexplored. This gap arises partly due to the non-inheritable nature of somatic mutations and technical challenges caused by their heterogeneity. Motivated by the gene-centric approach of transcriptome-wide association studies (TWAS) and evidence from somatic expression Quantitative Trait Loci (eQTLs), we developed EBSAS (Expression-Bridged Somatic Association Studies) to identify somatic mutations underlying cancer traits through their aggregated effects on gene expression.

EBSAS shows well-controlled type I error and improved power compared to existing methods. Applied to cancer transcriptomic and proteomic datasets, EBSAS revealed context-specific expression of cancer-associated proteins (e.g., PAX6 in breast cancer, IDO in prostate cancer) that modulate the tumor immune microenvironment in distinct ways. These findings illuminate mechanisms by which somatic mutations influence immune evasion and immunotherapy response.


This repository demonstrates the main analysis pipeline—from raw inputs including germline mutation data, mRNA expression, protein expression, and corresponding sample information (all sourced and integrated from the TCGA data portal)—to identifying protein-pathway associations.

Demo Files Description

Below are the demo files included in this repository, each showing the structure of input data used in the analyses. These files are for illustration purposes only.

  • Annotated_snp_file.csv
    Annotated germline mutation file.

  • gdc_sample_SimpleGermlineMutation.tsv
    Germline sample information file.

  • gdc_sample_GeneExpression.tsv
    Gene/mRNA expression sample information file.

  • combined_gene_expression_tpm.csv
    PEER corrected gene expression file.

  • protein_expression_matrix.csv
    Protein expression data matrix.

  • gdc_sample_Protein.tsv
    Protein sample information file.

  • Somatic_mutation_matrix.csv
    Somatic mutation matrix file.

  • allnameidlist_v43GTF_hg38_GRCh38.txt
    Gene annotation file containing gene ID, name, start/end positions, etc.

  • Gene_weights.txt
    Example output file from Jawamix5.sh.

  • allpathway_geneid.txt
    Pathway gene file listing all genes within each pathway.

  • TCGA_antibodies_descriptions.gencode.v36.tsv
    Protein annotation details.

  • Pathways_GeneIDs.csv
    Pathway names, IDs, and associated genes.


Note: All demo files are provided only to illustrate input file structure.

Code Files Description

Germline SNP Filtering for Tumor Samples

Code name: GermlineMutation_filter.py

Description

This script filters germline SNP data to retain only tumor-related samples, including Primary Tumor and Metastatic tissues. It:

  1. Reads an annotated SNP matrix and a sample metadata file.
  2. Selects only tumor-related samples.
  3. Restricts SNPs to autosomal chromosomes (chr1chr22).
  4. Outputs a filtered SNP matrix.

Inputs

Argument Description
--snp_file Path to the input SNP matrix (.csv) with rows as SNPs and columns as sample IDs.
--mapping_file Sample metadata file (.tsv) containing tissue type and sample ID information.
--output_file Path to save the filtered SNP matrix containing only tumor/metastatic samples and SNPs from chr1chr22.

Usage Example

python filter_germline_tumor_snps.py \
  --snp_file Annotated_snp_file.csv \
  --mapping_file gdc_sample_SimpleGermlineMutation.tsv \
  --output_file PrimaryTumor_snp_data.csv

Gene Expression Filtering for Tumor Samples

Code name: Expression_filter.py

Description

This script filters a gene expression matrix to retain only tumor-related samples, including Primary Tumor and Metastatic tissues. It:

  1. Reads a sample metadata file and a gene expression file.
  2. Selects only tumor-related samples based on tissue type.
  3. Outputs a filtered gene expression matrix.

Inputs

Argument Description
--mapping_file Sample metadata file (.tsv) containing sample IDs and tissue type information.
--expression_file Gene expression file (.csv) where columns are TCGA sample IDs and rows are genes.
--output_file Filtered expression matrix containing only tumor/metastatic samples.

Usage Example

python filter_expression_tumor_samples.py \
  --mapping_file gdc_sample_GeneExpression.tsv \
  --expression_file combined_gene_expression_tpm.csv \
  --output_file filtered_expression_primary_tumor.csv

Protein Expression Filtering for Tumor Samples

Code name: filter_protein_tumor_samples.py

Description

This script filters a protein expression matrix to retain only tumor-related samples, including Primary Tumor and Metastatic tissues. It is designed for preparing protein-level data for downstream association tests such as SKAT.

  1. Reads a sample metadata file and a protein expression file.
  2. Selects only tumor-related samples based on tissue type.
  3. Outputs a filtered protein expression matrix.

Inputs

Argument Description
--mapping_file Sample metadata file (.tsv) containing Sample ID and Tissue Type columns.
--protein_file Protein expression file (.csv) where samples are in columns, proteins are in rows, and AGID is used as the identifier column.
--output_file Path to save the filtered protein matrix containing only tumor/metastatic samples.

Usage Example

python filter_protein_tumor_samples.py \
  --mapping_file gdc_sample_Protein.tsv \
  --protein_file protein_expression_matrix.csv \
  --output_file filtered_protein_primary_tumor.csv

Somatic Mutation Matrix Rescaling by Mutation Frequency

Code name: rescale_somatic_mutation_by_maf.py

Description

This script rescales values in a somatic mutation matrix based on the number of non-zero entries per row, which serves as a proxy for mutation frequency or Minor Allele Frequency (MAF).

For each row:

  • Count the number of non-zero elements.
  • Multiply each value in that row by this count.

Inputs

Argument Description
--input_file Path to the somatic mutation matrix (.csv). The first column should contain variant identifiers (e.g., Chrom_Pos), and the remaining columns are sample values.
--output_file Path to save the output CSV with values rescaled by row-wise mutation frequency.

Usage Example

python rescale_somatic_mutation_by_maf.py \
  --input_file Somatic_mutation_matrix.csv \
  --output_file Somatic_mutation_matrix_rescaled.csv

Gene-Level Aggregation of Somatic Mutations

Code name: aggregate_somatic_mutations_by_gene.py

Description

This script aggregates rescaled somatic mutation data to the gene level by mapping SNPs to nearby genes. SNPs located within ±50 kb of a gene’s coordinates are assigned to that gene, and their weighted mutation values are combined to create a gene-level mutation matrix suitable for downstream association analyses.

Method

  1. For each SNP, check if it lies within ±50 kb of a gene’s coordinates.
  2. If so, add the SNP’s mutation value to the corresponding gene’s total for each sample.
  3. Retain only genes with at least one non-zero entry across samples in the final output.

Inputs

Argument Description
--annotation_file Tab-separated gene annotation file with columns: chr, start, end, gene_id.
--somatic_file CSV file containing rescaled somatic mutations (Chrom_Pos followed by sample columns).
--output_file CSV file to save the aggregated gene-level somatic mutation matrix.

Usage Example

python aggregate_somatic_mutations_by_gene.py \
  --annotation_file allnameidlist_v43GTF_hg38_GRCh38.txt \
  --somatic_file Somatic_mutation_matrix_rescaled.csv \
  --output_file Somatic_genecomb_sp_50k.csv

Sample Consistency Across Multi-Omics Data

Code name: harmonize_tcga_samples_across_omics.py

Description

This script ensures sample consistency across different omics data types in TCGA studies, including:

  • Gene Expression
  • SNP (Germline)
  • Somatic Mutation
  • Protein Expression

It performs filtering and reordering steps to keep only shared samples (and genes when applicable) for downstream integrative analyses.


Input/Output Overview

Step Input Files Output Files Description
Expression + SNP filtered_expression_primary_tumor.csv, PrimaryTumor_snp_data.csv filtered_gene_expression.csv, filtered_snp_data.csv Keep only shared samples
Expression + Somatic filtered_gene_expression.csv, Somatic_genecomb_sp_50k.csv filtered_expression_file2.csv, filtered_reordered_mutation_file.csv Keep shared samples & genes, reorder
Protein + Somatic filtered_protein_primary_tumor.csv, filtered_reordered_mutation_file.csv filtered_protein_SKAT.csv, filtered_somatic_SKAT.csv Match sample columns in protein and somatic files

Features

  • Retains only samples shared across multiple omics datasets
  • Maintains gene and sample alignment for downstream analysis
  • Modular design for flexible use

Usage

Update file paths inside the script as needed, then run:

python harmonize_tcga_samples_across_omics.py

Gene Filtering Based on Expression and Mutation Properties

Code name: filter_genes_by_properties.py

Description

This script filters genes from expression and somatic mutation matrices based on gene-level properties and hyperparameters. It applies thresholds to remove potentially uninformative or problematic genes by considering:

  • Gene length
  • Expression variance
  • Mutation frequency (mutation sparsity as a proxy for MAF)

Filtering Criteria

  • Gene length: Removes the shortest 1% of genes.
  • Expression variance: Removes genes with zero variance (adjustable).
  • Mutation sparsity: Removes genes in the lowest 1% of non-zero mutation proportion.

Inputs

Argument Description
--gene_info_file Gene annotation file with chromosome, start, end, and gene ID.
--expression_file Filtered gene expression matrix (.csv, genes × samples).
--mutation_file Filtered somatic mutation matrix (.csv, genes × samples).

Outputs

Argument Description
--output_expression Expression matrix after gene filtering.
--output_mutation Mutation matrix after gene filtering.
--output_transposed_expression Transposed expression matrix (samples × genes) for downstream analysis.

Usage Example

python filter_genes_by_properties.py \
  --gene_info_file allnameidlist_v43GTF_hg38_GRCh38.txt \
  --expression_file filtered_gene_expression.csv \
  --mutation_file filtered_reordered_mutation_file.csv \
  --output_expression filtered_gene_expression_Association.csv \
  --output_mutation filtered_reordered_mutation_Association.csv \
  --output_transposed_expression filtered_gene_Tranposed_expression_Association.csv

SNP Preprocessing for Kernel-Based Association Models

Code name: prepare_snp_for_kernel.py

Description

This script preprocesses SNP data to prepare it for kernel-based association models such as JAWAMIX5. The processing includes:

  • Renaming columns
  • Filtering autosomal chromosomes (chr1chr22) only
  • Removing the rsID column
  • Sorting rows by genomic location (CHR and POS)

Processing Steps

  1. Rename the first two columns to CHR and POS.
  2. Drop the third column (rsID).
  3. Filter to keep only SNPs on autosomal chromosomes (122).
  4. Sort rows by chromosome and position.

Inputs

Argument Description
--input_file Raw SNP matrix (.csv) with first three columns: chromosome, position, rsID, followed by genotype columns.

Outputs

Argument Description
--cleaned_output SNP matrix after column renaming and autosome filtering.
--sorted_output Final SNP matrix sorted by CHR and POS, suitable for kernel input.

Usage Example

python prepare_snp_for_kernel.py \
  --input_file filtered_snp_data.csv \
  --cleaned_output filtered_snp_data_clean.csv \
  --sorted_output filtered_snp_data_Kernel_Re.csv

JAWAMIX5 Kernel Preprocessing Script

Code name: JAWA_kernal.sh

Description

This script runs JAWAMIX5 to preprocess germline SNP data for kernel-based genetic analysis. It performs two main steps:

  1. Converts the SNP matrix from .csv format to .hdf5 format.
  2. Computes the kinship matrix using the Realized Relationship Matrix (RRM) method.

Inputs

File Description
filtered_snp_data_Kernel_Re.csv Cleaned and sorted germline SNP matrix in CSV format.
filtered_snp_data_Kernel.hdf5 Intermediate SNP matrix in HDF5 format.

Outputs

File Description
filtered_snp_data_Kernel.hdf5 SNP matrix in HDF5 format for downstream kinship calculation.
filtered_snp_RRM Kinship matrix generated using the RRM method.

Automated Gene-Wise Association Testing with JAWAMIX5

Code name: JAWA.sh

Description

This shell script automates gene-wise association testing using JAWAMIX5 over a specified range of gene indices. It is designed for high-throughput execution on a cluster or compute node. For each gene in the range, it performs the following steps:

  1. Generate mutation matrix for the gene using JAWA_prep.py.
  2. Extract the gene name from the output file.
  3. Convert the mutation matrix to HDF5 format via Jawamix5.jar import.
  4. Run association testing using Jawamix5.jar emmax with the expression data and GRM.
  5. Concatenate results using Gene_concatenate.py.
  6. Clean up temporary mutation files to save disk space.

Extract Single Gene Mutation Profile for JAWAMIX5

Code name: JAWA_prep.py

Description

This script extracts the mutation profile of a single gene (one row) from a full somatic mutation matrix and converts it into a format compatible with JAWAMIX5 for gene-based association testing.


Inputs

Argument Description
--mutation_file Full mutation matrix file (.csv). Default: filtered_reordered_mutation_Association.csv.
--gene_index Row index of the gene to extract.
--output_dir Directory to save the output file. Default: current working directory.

Output

  • mutation_gene_<GENE_ID>_index_<INDEX>.csv: One-row mutation matrix with columns: CHR, POS, sample1, sample2, ...

Example Usage

python JAWA_prep.py --gene_index 3570 \
  --mutation_file filtered_reordered_mutation_Association.csv \
  --output_dir ./JAWA_parallel/

Aggregate Gene-Wise Association Results

Code name: Gene_concatenate.py

Description

This script parses and aggregates gene-wise association results from JAWAMIX5 emmax output files (EMMAX.*.top) into a summary file.

Functionality:

  • Extracts gene index and gene name from filenames formatted as EMMAX.<index>_<gene>.top.
  • Reads key statistics: p-value, adjusted R², coefficient, standard error, and MAF count from each result file.
  • Appends each gene’s results as a line in Gene_summary.txt.
  • Deletes processed .top files to keep the working directory clean.

Input

  • A directory containing multiple EMMAX.*.top files (output from Jawamix5.jar emmax).

Output

  • Gene_summary.txt (tab-delimited) with columns:
    index, gene_name, pvalue, adjusted_r2, coefficient, sd_err, maf_count

Example Usage

python Gene_concatenate.py

Pathway-Level Filtering of Somatic Mutation Data for SKAT

Code name: SKAT_Somatic_prep.py

Description

This script filters gene-level somatic mutation data by pathway and exports a mutation matrix for each pathway. It prepares input data for pathway-level association tests such as SKAT.


Inputs

File Description
allpathway_geneid.txt Plain text file where each line contains a pathway ID followed by gene IDs in that pathway.
Example format:
Pathway1 GENE0001 GENE0002 GENE0003
Pathway2 GENE0004 GENE0005
filtered_somatic_SKAT.csv Gene-by-sample somatic mutation matrix including a GeneID column.

Outputs

  • One CSV file per pathway saved under Result/Pathway_Somatic/, named {pathway_id}_mutation.csv, containing only genes from that pathway.

Example Usage

python SKAT_Somatic_prep.py

Pathway-Level Association Testing Using SKAT in R

Code name: run_skat_pathway_association.R

Description

This R script performs pathway-level association tests between somatic mutation profiles and protein expression levels using the SKAT (Sequence Kernel Association Test) method. Each pathway is tested using gene-level mutation data weighted by coefficients from JAWAMIX5.


Inputs

File/Pattern Description
Result/Pathway_Somatic/*.csv Gene-by-sample somatic mutation matrices for each pathway.
filtered_protein_SKAT.csv Protein expression matrix with samples as columns.
Gene_weights.txt Gene-level weights (coefficients) derived from JAWAMIX5 results.

Output

  • pathway_protein_pvalues.txt: Tab-delimited file summarizing SKAT p-values with columns:

    pathway_id protein_id pvalue
    hsa00010 Q9H0H5 0.00392
    hsa00020 Q96A11 0.00001
    ... ... ...

Each row reflects the significance of the association between a pathway’s mutation profile and a protein’s expression.


Usage

Run the script for a range of pathway files (useful for parallelization):

Rscript run_skat_pathway_association.R 1 100

Protein-Pathway Association Summary and Metadata Integration

Code name: SKAT_Summary.py
New name: summarize_protein_pathway_associations.py

Description

This script identifies significantly associated protein-pathway pairs using Bonferroni-corrected p-values, merges metadata from antibody and pathway files, summarizes protein involvement across pathways, and quantifies gene overlaps between pathways.


Processing Steps

  1. Bonferroni Correction: Apply correction to p-values and retain significant associations (p < 0.05).
  2. Metadata Merging: Incorporate protein descriptions and pathway names.
  3. Protein Summary: Group by protein ID to summarize all linked pathways, genes, and targets.
  4. Pathway Gene Overlap: For proteins involved in multiple pathways, calculate intersection sizes of pathway gene sets and pathway sizes.

Inputs

File Description
pathway_protein_pvalues.txt Association p-values between proteins and pathways.
TCGA_antibodies_descriptions.gencode.v36.tsv Metadata for TCGA proteins (AGIDs, gene names, targets).
Pathways_GeneIDs.csv List of gene IDs for each pathway.
allpathway_geneid.txt Mapping of pathway IDs to gene ID sets (used for gene overlap).
allnameidlist_hg19.txt Optional gene ID to gene name mapping file.

Output

  • updated_protein_summary.csv: Protein-level summary table including pathway memberships, overlapping gene counts between pathways, and pathway sizes.

Usage Example

run_pipeline(
    pval_file='pathway_protein_pvalues.txt',
    antibody_file='TCGA_antibodies_descriptions.gencode.v36.tsv',
    pathway_file='Pathways_GeneIDs.csv',
    pathway_gene_file='allpathway_geneid.txt',
    gene_id_mapping_file='allnameidlist_hg19.txt',  # Optional if not used
    output_file='updated_protein_summary.csv'
)

About

Repository for the EBSAS project

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published