EBSAS: Expression-Bridged Somatic Association Studies

Despite the success of genotype–phenotype association studies in many complex traits, their application to cancers has largely focused on germline variants, leaving the role of somatic mutations underexplored. This gap arises partly due to the non-inheritable nature of somatic mutations and technical challenges caused by their heterogeneity. Motivated by the gene-centric approach of transcriptome-wide association studies (TWAS) and evidence from somatic expression Quantitative Trait Loci (eQTLs), we developed EBSAS (Expression-Bridged Somatic Association Studies) to identify somatic mutations underlying cancer traits through their aggregated effects on gene expression.

EBSAS shows well-controlled type I error and improved power compared to existing methods. Applied to cancer transcriptomic and proteomic datasets, EBSAS revealed context-specific expression of cancer-associated proteins (e.g., PAX6 in breast cancer, IDO in prostate cancer) that modulate the tumor immune microenvironment in distinct ways. These findings illuminate mechanisms by which somatic mutations influence immune evasion and immunotherapy response.

This repository demonstrates the main analysis pipeline—from raw inputs including germline mutation data, mRNA expression, protein expression, and corresponding sample information (all sourced and integrated from the TCGA data portal)—to identifying protein-pathway associations.

Demo Files Description

Below are the demo files included in this repository, each showing the structure of input data used in the analyses. These files are for illustration purposes only.

Annotated_snp_file.csv
Annotated germline mutation file.
gdc_sample_SimpleGermlineMutation.tsv
Germline sample information file.
gdc_sample_GeneExpression.tsv
Gene/mRNA expression sample information file.
combined_gene_expression_tpm.csv
PEER corrected gene expression file.
protein_expression_matrix.csv
Protein expression data matrix.
gdc_sample_Protein.tsv
Protein sample information file.
Somatic_mutation_matrix.csv
Somatic mutation matrix file.
allnameidlist_v43GTF_hg38_GRCh38.txt
Gene annotation file containing gene ID, name, start/end positions, etc.
Gene_weights.txt
Example output file from Jawamix5.sh.
allpathway_geneid.txt
Pathway gene file listing all genes within each pathway.
TCGA_antibodies_descriptions.gencode.v36.tsv
Protein annotation details.
Pathways_GeneIDs.csv
Pathway names, IDs, and associated genes.

Note: All demo files are provided only to illustrate input file structure.

Code Files Description

Germline SNP Filtering for Tumor Samples

Code name: GermlineMutation_filter.py

Description

This script filters germline SNP data to retain only tumor-related samples, including Primary Tumor and Metastatic tissues. It:

Reads an annotated SNP matrix and a sample metadata file.
Selects only tumor-related samples.
Restricts SNPs to autosomal chromosomes (chr1–chr22).
Outputs a filtered SNP matrix.

Inputs

Argument	Description
`--snp_file`	Path to the input SNP matrix (`.csv`) with rows as SNPs and columns as sample IDs.
`--mapping_file`	Sample metadata file (`.tsv`) containing tissue type and sample ID information.
`--output_file`	Path to save the filtered SNP matrix containing only tumor/metastatic samples and SNPs from `chr1`–`chr22`.

Usage Example

python filter_germline_tumor_snps.py \
  --snp_file Annotated_snp_file.csv \
  --mapping_file gdc_sample_SimpleGermlineMutation.tsv \
  --output_file PrimaryTumor_snp_data.csv

Gene Expression Filtering for Tumor Samples

Code name: Expression_filter.py

Description

This script filters a gene expression matrix to retain only tumor-related samples, including Primary Tumor and Metastatic tissues. It:

Reads a sample metadata file and a gene expression file.
Selects only tumor-related samples based on tissue type.
Outputs a filtered gene expression matrix.

Inputs

Argument	Description
`--mapping_file`	Sample metadata file (`.tsv`) containing sample IDs and tissue type information.
`--expression_file`	Gene expression file (`.csv`) where columns are TCGA sample IDs and rows are genes.
`--output_file`	Filtered expression matrix containing only tumor/metastatic samples.

Usage Example

python filter_expression_tumor_samples.py \
  --mapping_file gdc_sample_GeneExpression.tsv \
  --expression_file combined_gene_expression_tpm.csv \
  --output_file filtered_expression_primary_tumor.csv

Protein Expression Filtering for Tumor Samples

Code name: filter_protein_tumor_samples.py

Description

This script filters a protein expression matrix to retain only tumor-related samples, including Primary Tumor and Metastatic tissues. It is designed for preparing protein-level data for downstream association tests such as SKAT.

Reads a sample metadata file and a protein expression file.
Selects only tumor-related samples based on tissue type.
Outputs a filtered protein expression matrix.

Inputs

Argument	Description
`--mapping_file`	Sample metadata file (`.tsv`) containing Sample ID and Tissue Type columns.
`--protein_file`	Protein expression file (`.csv`) where samples are in columns, proteins are in rows, and `AGID` is used as the identifier column.
`--output_file`	Path to save the filtered protein matrix containing only tumor/metastatic samples.

Usage Example

python filter_protein_tumor_samples.py \
  --mapping_file gdc_sample_Protein.tsv \
  --protein_file protein_expression_matrix.csv \
  --output_file filtered_protein_primary_tumor.csv

Somatic Mutation Matrix Rescaling by Mutation Frequency

Code name: rescale_somatic_mutation_by_maf.py

Description

This script rescales values in a somatic mutation matrix based on the number of non-zero entries per row, which serves as a proxy for mutation frequency or Minor Allele Frequency (MAF).

For each row:

Count the number of non-zero elements.
Multiply each value in that row by this count.

Inputs

Argument	Description
`--input_file`	Path to the somatic mutation matrix (`.csv`). The first column should contain variant identifiers (e.g., `Chrom_Pos`), and the remaining columns are sample values.
`--output_file`	Path to save the output CSV with values rescaled by row-wise mutation frequency.

Usage Example

python rescale_somatic_mutation_by_maf.py \
  --input_file Somatic_mutation_matrix.csv \
  --output_file Somatic_mutation_matrix_rescaled.csv

Gene-Level Aggregation of Somatic Mutations

Code name: aggregate_somatic_mutations_by_gene.py

Description

This script aggregates rescaled somatic mutation data to the gene level by mapping SNPs to nearby genes. SNPs located within ±50 kb of a gene’s coordinates are assigned to that gene, and their weighted mutation values are combined to create a gene-level mutation matrix suitable for downstream association analyses.

Method

For each SNP, check if it lies within ±50 kb of a gene’s coordinates.
If so, add the SNP’s mutation value to the corresponding gene’s total for each sample.
Retain only genes with at least one non-zero entry across samples in the final output.

Inputs

Argument	Description
`--annotation_file`	Tab-separated gene annotation file with columns: `chr`, `start`, `end`, `gene_id`.
`--somatic_file`	CSV file containing rescaled somatic mutations (`Chrom_Pos` followed by sample columns).
`--output_file`	CSV file to save the aggregated gene-level somatic mutation matrix.

Usage Example

python aggregate_somatic_mutations_by_gene.py \
  --annotation_file allnameidlist_v43GTF_hg38_GRCh38.txt \
  --somatic_file Somatic_mutation_matrix_rescaled.csv \
  --output_file Somatic_genecomb_sp_50k.csv

Sample Consistency Across Multi-Omics Data

Code name: harmonize_tcga_samples_across_omics.py

Description

This script ensures sample consistency across different omics data types in TCGA studies, including:

Gene Expression
SNP (Germline)
Somatic Mutation
Protein Expression

It performs filtering and reordering steps to keep only shared samples (and genes when applicable) for downstream integrative analyses.

Input/Output Overview

Step	Input Files	Output Files	Description
Expression + SNP	`filtered_expression_primary_tumor.csv`, `PrimaryTumor_snp_data.csv`	`filtered_gene_expression.csv`, `filtered_snp_data.csv`	Keep only shared samples
Expression + Somatic	`filtered_gene_expression.csv`, `Somatic_genecomb_sp_50k.csv`	`filtered_expression_file2.csv`, `filtered_reordered_mutation_file.csv`	Keep shared samples & genes, reorder
Protein + Somatic	`filtered_protein_primary_tumor.csv`, `filtered_reordered_mutation_file.csv`	`filtered_protein_SKAT.csv`, `filtered_somatic_SKAT.csv`	Match sample columns in protein and somatic files

Features

Retains only samples shared across multiple omics datasets
Maintains gene and sample alignment for downstream analysis
Modular design for flexible use

Usage

Update file paths inside the script as needed, then run:

python harmonize_tcga_samples_across_omics.py

Gene Filtering Based on Expression and Mutation Properties

Code name: filter_genes_by_properties.py

Description

This script filters genes from expression and somatic mutation matrices based on gene-level properties and hyperparameters. It applies thresholds to remove potentially uninformative or problematic genes by considering:

Gene length
Expression variance
Mutation frequency (mutation sparsity as a proxy for MAF)

Filtering Criteria

Gene length: Removes the shortest 1% of genes.
Expression variance: Removes genes with zero variance (adjustable).
Mutation sparsity: Removes genes in the lowest 1% of non-zero mutation proportion.

Inputs

Argument	Description
`--gene_info_file`	Gene annotation file with chromosome, start, end, and gene ID.
`--expression_file`	Filtered gene expression matrix (`.csv`, genes × samples).
`--mutation_file`	Filtered somatic mutation matrix (`.csv`, genes × samples).

Outputs

Argument	Description
`--output_expression`	Expression matrix after gene filtering.
`--output_mutation`	Mutation matrix after gene filtering.
`--output_transposed_expression`	Transposed expression matrix (samples × genes) for downstream analysis.

Usage Example

python filter_genes_by_properties.py \
  --gene_info_file allnameidlist_v43GTF_hg38_GRCh38.txt \
  --expression_file filtered_gene_expression.csv \
  --mutation_file filtered_reordered_mutation_file.csv \
  --output_expression filtered_gene_expression_Association.csv \
  --output_mutation filtered_reordered_mutation_Association.csv \
  --output_transposed_expression filtered_gene_Tranposed_expression_Association.csv

SNP Preprocessing for Kernel-Based Association Models

Code name: prepare_snp_for_kernel.py

Description

This script preprocesses SNP data to prepare it for kernel-based association models such as JAWAMIX5. The processing includes:

Renaming columns
Filtering autosomal chromosomes (chr1–chr22) only
Removing the rsID column
Sorting rows by genomic location (CHR and POS)

Processing Steps

Rename the first two columns to CHR and POS.
Drop the third column (rsID).
Filter to keep only SNPs on autosomal chromosomes (1–22).
Sort rows by chromosome and position.

Inputs

Argument	Description
`--input_file`	Raw SNP matrix (`.csv`) with first three columns: chromosome, position, rsID, followed by genotype columns.

Outputs

Argument	Description
`--cleaned_output`	SNP matrix after column renaming and autosome filtering.
`--sorted_output`	Final SNP matrix sorted by `CHR` and `POS`, suitable for kernel input.

Usage Example

python prepare_snp_for_kernel.py \
  --input_file filtered_snp_data.csv \
  --cleaned_output filtered_snp_data_clean.csv \
  --sorted_output filtered_snp_data_Kernel_Re.csv

JAWAMIX5 Kernel Preprocessing Script

Code name: JAWA_kernal.sh

Description

This script runs JAWAMIX5 to preprocess germline SNP data for kernel-based genetic analysis. It performs two main steps:

Converts the SNP matrix from .csv format to .hdf5 format.
Computes the kinship matrix using the Realized Relationship Matrix (RRM) method.

Inputs

File	Description
`filtered_snp_data_Kernel_Re.csv`	Cleaned and sorted germline SNP matrix in CSV format.
`filtered_snp_data_Kernel.hdf5`	Intermediate SNP matrix in HDF5 format.

Outputs

File	Description
`filtered_snp_data_Kernel.hdf5`	SNP matrix in HDF5 format for downstream kinship calculation.
`filtered_snp_RRM`	Kinship matrix generated using the RRM method.

Automated Gene-Wise Association Testing with JAWAMIX5

Code name: JAWA.sh

Description

This shell script automates gene-wise association testing using JAWAMIX5 over a specified range of gene indices. It is designed for high-throughput execution on a cluster or compute node. For each gene in the range, it performs the following steps:

Generate mutation matrix for the gene using JAWA_prep.py.
Extract the gene name from the output file.
Convert the mutation matrix to HDF5 format via Jawamix5.jar import.
Run association testing using Jawamix5.jar emmax with the expression data and GRM.
Concatenate results using Gene_concatenate.py.
Clean up temporary mutation files to save disk space.

Extract Single Gene Mutation Profile for JAWAMIX5

Code name: JAWA_prep.py

Description

This script extracts the mutation profile of a single gene (one row) from a full somatic mutation matrix and converts it into a format compatible with JAWAMIX5 for gene-based association testing.

Inputs

Argument	Description
`--mutation_file`	Full mutation matrix file (`.csv`). Default: `filtered_reordered_mutation_Association.csv`.
`--gene_index`	Row index of the gene to extract.
`--output_dir`	Directory to save the output file. Default: current working directory.

Output

mutation_gene_<GENE_ID>_index_<INDEX>.csv: One-row mutation matrix with columns: CHR, POS, sample1, sample2, ...

Example Usage

python JAWA_prep.py --gene_index 3570 \
  --mutation_file filtered_reordered_mutation_Association.csv \
  --output_dir ./JAWA_parallel/

Aggregate Gene-Wise Association Results

Code name: Gene_concatenate.py

Description

This script parses and aggregates gene-wise association results from JAWAMIX5 emmax output files (EMMAX.*.top) into a summary file.

Functionality:

Extracts gene index and gene name from filenames formatted as EMMAX.<index>_<gene>.top.
Reads key statistics: p-value, adjusted R², coefficient, standard error, and MAF count from each result file.
Appends each gene’s results as a line in Gene_summary.txt.
Deletes processed .top files to keep the working directory clean.

Input

A directory containing multiple EMMAX.*.top files (output from Jawamix5.jar emmax).

Output

Gene_summary.txt (tab-delimited) with columns:
index, gene_name, pvalue, adjusted_r2, coefficient, sd_err, maf_count

Example Usage

python Gene_concatenate.py

Pathway-Level Filtering of Somatic Mutation Data for SKAT

Code name: SKAT_Somatic_prep.py

Description

This script filters gene-level somatic mutation data by pathway and exports a mutation matrix for each pathway. It prepares input data for pathway-level association tests such as SKAT.

Inputs

File	Description
`allpathway_geneid.txt`	Plain text file where each line contains a pathway ID followed by gene IDs in that pathway. Example format: Pathway1 GENE0001 GENE0002 GENE0003 Pathway2 GENE0004 GENE0005
`filtered_somatic_SKAT.csv`	Gene-by-sample somatic mutation matrix including a `GeneID` column.

Outputs

One CSV file per pathway saved under Result/Pathway_Somatic/, named {pathway_id}_mutation.csv, containing only genes from that pathway.

Example Usage

python SKAT_Somatic_prep.py

Pathway-Level Association Testing Using SKAT in R

Code name: run_skat_pathway_association.R

Description

This R script performs pathway-level association tests between somatic mutation profiles and protein expression levels using the SKAT (Sequence Kernel Association Test) method. Each pathway is tested using gene-level mutation data weighted by coefficients from JAWAMIX5.

Inputs

File/Pattern	Description
`Result/Pathway_Somatic/*.csv`	Gene-by-sample somatic mutation matrices for each pathway.
`filtered_protein_SKAT.csv`	Protein expression matrix with samples as columns.
`Gene_weights.txt`	Gene-level weights (coefficients) derived from JAWAMIX5 results.

Output

pathway_protein_pvalues.txt: Tab-delimited file summarizing SKAT p-values with columns:

pathway_id protein_id pvalue

hsa00010 Q9H0H5 0.00392

hsa00020 Q96A11 0.00001

... ... ...

Each row reflects the significance of the association between a pathway’s mutation profile and a protein’s expression.

Usage

Run the script for a range of pathway files (useful for parallelization):

Rscript run_skat_pathway_association.R 1 100

Protein-Pathway Association Summary and Metadata Integration

Code name: SKAT_Summary.py
New name: summarize_protein_pathway_associations.py

Description

This script identifies significantly associated protein-pathway pairs using Bonferroni-corrected p-values, merges metadata from antibody and pathway files, summarizes protein involvement across pathways, and quantifies gene overlaps between pathways.

Processing Steps

Bonferroni Correction: Apply correction to p-values and retain significant associations (p < 0.05).
Metadata Merging: Incorporate protein descriptions and pathway names.
Protein Summary: Group by protein ID to summarize all linked pathways, genes, and targets.
Pathway Gene Overlap: For proteins involved in multiple pathways, calculate intersection sizes of pathway gene sets and pathway sizes.

Inputs

File	Description
`pathway_protein_pvalues.txt`	Association p-values between proteins and pathways.
`TCGA_antibodies_descriptions.gencode.v36.tsv`	Metadata for TCGA proteins (AGIDs, gene names, targets).
`Pathways_GeneIDs.csv`	List of gene IDs for each pathway.
`allpathway_geneid.txt`	Mapping of pathway IDs to gene ID sets (used for gene overlap).
`allnameidlist_hg19.txt`	Optional gene ID to gene name mapping file.

Output

updated_protein_summary.csv: Protein-level summary table including pathway memberships, overlapping gene counts between pathways, and pathway sizes.

Usage Example

run_pipeline(
    pval_file='pathway_protein_pvalues.txt',
    antibody_file='TCGA_antibodies_descriptions.gencode.v36.tsv',
    pathway_file='Pathways_GeneIDs.csv',
    pathway_gene_file='allpathway_geneid.txt',
    gene_id_mapping_file='allnameidlist_hg19.txt',  # Optional if not used
    output_file='updated_protein_summary.csv'
)

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
Codes		Codes
Demo		Demo
LICENSE		LICENSE
README.md		README.md
jawamix5.jar		jawamix5.jar

pathway_id	protein_id	pvalue
hsa00010	Q9H0H5	0.00392
hsa00020	Q96A11	0.00001
...	...	...

License

QingrunZhangLab/EBSAS

Folders and files

Latest commit

History

Repository files navigation

EBSAS: Expression-Bridged Somatic Association Studies

Demo Files Description

Code Files Description

Germline SNP Filtering for Tumor Samples

Description

Inputs

Usage Example

Gene Expression Filtering for Tumor Samples

Description

Inputs

Usage Example

Protein Expression Filtering for Tumor Samples

Description

Inputs

Usage Example

Somatic Mutation Matrix Rescaling by Mutation Frequency

Description

Inputs

Usage Example

Gene-Level Aggregation of Somatic Mutations

Description

Method

Inputs

Usage Example

Sample Consistency Across Multi-Omics Data

Description

Input/Output Overview

Features

Usage

Gene Filtering Based on Expression and Mutation Properties

Description

Filtering Criteria

Inputs

Outputs

Usage Example

SNP Preprocessing for Kernel-Based Association Models

Description

Processing Steps

Inputs

Outputs

Usage Example

JAWAMIX5 Kernel Preprocessing Script

Description

Inputs

Outputs

Automated Gene-Wise Association Testing with JAWAMIX5

Description

Extract Single Gene Mutation Profile for JAWAMIX5

Description

Inputs

Output

Example Usage

Aggregate Gene-Wise Association Results

Description

Input

Output

Example Usage

Pathway-Level Filtering of Somatic Mutation Data for SKAT

Description

Inputs

Outputs

Example Usage

Pathway-Level Association Testing Using SKAT in R

Description

Inputs

Output

Usage

Protein-Pathway Association Summary and Metadata Integration

Description

Processing Steps

Inputs

Output

Usage Example

About

Packages