Despite the success of genotype–phenotype association studies in many complex traits, their application to cancers has largely focused on germline variants, leaving the role of somatic mutations underexplored. This gap arises partly due to the non-inheritable nature of somatic mutations and technical challenges caused by their heterogeneity. Motivated by the gene-centric approach of transcriptome-wide association studies (TWAS) and evidence from somatic expression Quantitative Trait Loci (eQTLs), we developed EBSAS (Expression-Bridged Somatic Association Studies) to identify somatic mutations underlying cancer traits through their aggregated effects on gene expression.
EBSAS shows well-controlled type I error and improved power compared to existing methods. Applied to cancer transcriptomic and proteomic datasets, EBSAS revealed context-specific expression of cancer-associated proteins (e.g., PAX6 in breast cancer, IDO in prostate cancer) that modulate the tumor immune microenvironment in distinct ways. These findings illuminate mechanisms by which somatic mutations influence immune evasion and immunotherapy response.
This repository demonstrates the main analysis pipeline—from raw inputs including germline mutation data, mRNA expression, protein expression, and corresponding sample information (all sourced and integrated from the TCGA data portal)—to identifying protein-pathway associations.
Below are the demo files included in this repository, each showing the structure of input data used in the analyses. These files are for illustration purposes only.
-
Annotated_snp_file.csv
Annotated germline mutation file. -
gdc_sample_SimpleGermlineMutation.tsv
Germline sample information file. -
gdc_sample_GeneExpression.tsv
Gene/mRNA expression sample information file. -
combined_gene_expression_tpm.csv
PEER corrected gene expression file. -
protein_expression_matrix.csv
Protein expression data matrix. -
gdc_sample_Protein.tsv
Protein sample information file. -
Somatic_mutation_matrix.csv
Somatic mutation matrix file. -
allnameidlist_v43GTF_hg38_GRCh38.txt
Gene annotation file containing gene ID, name, start/end positions, etc. -
Gene_weights.txt
Example output file from Jawamix5.sh. -
allpathway_geneid.txt
Pathway gene file listing all genes within each pathway. -
TCGA_antibodies_descriptions.gencode.v36.tsv
Protein annotation details. -
Pathways_GeneIDs.csv
Pathway names, IDs, and associated genes.
Note: All demo files are provided only to illustrate input file structure.
Code name: GermlineMutation_filter.py
This script filters germline SNP data to retain only tumor-related samples, including Primary Tumor and Metastatic tissues. It:
- Reads an annotated SNP matrix and a sample metadata file.
- Selects only tumor-related samples.
- Restricts SNPs to autosomal chromosomes (
chr1–chr22). - Outputs a filtered SNP matrix.
| Argument | Description |
|---|---|
--snp_file |
Path to the input SNP matrix (.csv) with rows as SNPs and columns as sample IDs. |
--mapping_file |
Sample metadata file (.tsv) containing tissue type and sample ID information. |
--output_file |
Path to save the filtered SNP matrix containing only tumor/metastatic samples and SNPs from chr1–chr22. |
python filter_germline_tumor_snps.py \
--snp_file Annotated_snp_file.csv \
--mapping_file gdc_sample_SimpleGermlineMutation.tsv \
--output_file PrimaryTumor_snp_data.csvCode name: Expression_filter.py
This script filters a gene expression matrix to retain only tumor-related samples, including Primary Tumor and Metastatic tissues. It:
- Reads a sample metadata file and a gene expression file.
- Selects only tumor-related samples based on tissue type.
- Outputs a filtered gene expression matrix.
| Argument | Description |
|---|---|
--mapping_file |
Sample metadata file (.tsv) containing sample IDs and tissue type information. |
--expression_file |
Gene expression file (.csv) where columns are TCGA sample IDs and rows are genes. |
--output_file |
Filtered expression matrix containing only tumor/metastatic samples. |
python filter_expression_tumor_samples.py \
--mapping_file gdc_sample_GeneExpression.tsv \
--expression_file combined_gene_expression_tpm.csv \
--output_file filtered_expression_primary_tumor.csvCode name: filter_protein_tumor_samples.py
This script filters a protein expression matrix to retain only tumor-related samples, including Primary Tumor and Metastatic tissues. It is designed for preparing protein-level data for downstream association tests such as SKAT.
- Reads a sample metadata file and a protein expression file.
- Selects only tumor-related samples based on tissue type.
- Outputs a filtered protein expression matrix.
| Argument | Description |
|---|---|
--mapping_file |
Sample metadata file (.tsv) containing Sample ID and Tissue Type columns. |
--protein_file |
Protein expression file (.csv) where samples are in columns, proteins are in rows, and AGID is used as the identifier column. |
--output_file |
Path to save the filtered protein matrix containing only tumor/metastatic samples. |
python filter_protein_tumor_samples.py \
--mapping_file gdc_sample_Protein.tsv \
--protein_file protein_expression_matrix.csv \
--output_file filtered_protein_primary_tumor.csvCode name: rescale_somatic_mutation_by_maf.py
This script rescales values in a somatic mutation matrix based on the number of non-zero entries per row, which serves as a proxy for mutation frequency or Minor Allele Frequency (MAF).
For each row:
- Count the number of non-zero elements.
- Multiply each value in that row by this count.
| Argument | Description |
|---|---|
--input_file |
Path to the somatic mutation matrix (.csv). The first column should contain variant identifiers (e.g., Chrom_Pos), and the remaining columns are sample values. |
--output_file |
Path to save the output CSV with values rescaled by row-wise mutation frequency. |
python rescale_somatic_mutation_by_maf.py \
--input_file Somatic_mutation_matrix.csv \
--output_file Somatic_mutation_matrix_rescaled.csvCode name: aggregate_somatic_mutations_by_gene.py
This script aggregates rescaled somatic mutation data to the gene level by mapping SNPs to nearby genes. SNPs located within ±50 kb of a gene’s coordinates are assigned to that gene, and their weighted mutation values are combined to create a gene-level mutation matrix suitable for downstream association analyses.
- For each SNP, check if it lies within ±50 kb of a gene’s coordinates.
- If so, add the SNP’s mutation value to the corresponding gene’s total for each sample.
- Retain only genes with at least one non-zero entry across samples in the final output.
| Argument | Description |
|---|---|
--annotation_file |
Tab-separated gene annotation file with columns: chr, start, end, gene_id. |
--somatic_file |
CSV file containing rescaled somatic mutations (Chrom_Pos followed by sample columns). |
--output_file |
CSV file to save the aggregated gene-level somatic mutation matrix. |
python aggregate_somatic_mutations_by_gene.py \
--annotation_file allnameidlist_v43GTF_hg38_GRCh38.txt \
--somatic_file Somatic_mutation_matrix_rescaled.csv \
--output_file Somatic_genecomb_sp_50k.csvCode name: harmonize_tcga_samples_across_omics.py
This script ensures sample consistency across different omics data types in TCGA studies, including:
- Gene Expression
- SNP (Germline)
- Somatic Mutation
- Protein Expression
It performs filtering and reordering steps to keep only shared samples (and genes when applicable) for downstream integrative analyses.
| Step | Input Files | Output Files | Description |
|---|---|---|---|
| Expression + SNP | filtered_expression_primary_tumor.csv, PrimaryTumor_snp_data.csv |
filtered_gene_expression.csv, filtered_snp_data.csv |
Keep only shared samples |
| Expression + Somatic | filtered_gene_expression.csv, Somatic_genecomb_sp_50k.csv |
filtered_expression_file2.csv, filtered_reordered_mutation_file.csv |
Keep shared samples & genes, reorder |
| Protein + Somatic | filtered_protein_primary_tumor.csv, filtered_reordered_mutation_file.csv |
filtered_protein_SKAT.csv, filtered_somatic_SKAT.csv |
Match sample columns in protein and somatic files |
- Retains only samples shared across multiple omics datasets
- Maintains gene and sample alignment for downstream analysis
- Modular design for flexible use
Update file paths inside the script as needed, then run:
python harmonize_tcga_samples_across_omics.pyCode name: filter_genes_by_properties.py
This script filters genes from expression and somatic mutation matrices based on gene-level properties and hyperparameters. It applies thresholds to remove potentially uninformative or problematic genes by considering:
- Gene length
- Expression variance
- Mutation frequency (mutation sparsity as a proxy for MAF)
- Gene length: Removes the shortest 1% of genes.
- Expression variance: Removes genes with zero variance (adjustable).
- Mutation sparsity: Removes genes in the lowest 1% of non-zero mutation proportion.
| Argument | Description |
|---|---|
--gene_info_file |
Gene annotation file with chromosome, start, end, and gene ID. |
--expression_file |
Filtered gene expression matrix (.csv, genes × samples). |
--mutation_file |
Filtered somatic mutation matrix (.csv, genes × samples). |
| Argument | Description |
|---|---|
--output_expression |
Expression matrix after gene filtering. |
--output_mutation |
Mutation matrix after gene filtering. |
--output_transposed_expression |
Transposed expression matrix (samples × genes) for downstream analysis. |
python filter_genes_by_properties.py \
--gene_info_file allnameidlist_v43GTF_hg38_GRCh38.txt \
--expression_file filtered_gene_expression.csv \
--mutation_file filtered_reordered_mutation_file.csv \
--output_expression filtered_gene_expression_Association.csv \
--output_mutation filtered_reordered_mutation_Association.csv \
--output_transposed_expression filtered_gene_Tranposed_expression_Association.csvCode name: prepare_snp_for_kernel.py
This script preprocesses SNP data to prepare it for kernel-based association models such as JAWAMIX5. The processing includes:
- Renaming columns
- Filtering autosomal chromosomes (
chr1–chr22) only - Removing the
rsIDcolumn - Sorting rows by genomic location (
CHRandPOS)
- Rename the first two columns to
CHRandPOS. - Drop the third column (
rsID). - Filter to keep only SNPs on autosomal chromosomes (
1–22). - Sort rows by chromosome and position.
| Argument | Description |
|---|---|
--input_file |
Raw SNP matrix (.csv) with first three columns: chromosome, position, rsID, followed by genotype columns. |
| Argument | Description |
|---|---|
--cleaned_output |
SNP matrix after column renaming and autosome filtering. |
--sorted_output |
Final SNP matrix sorted by CHR and POS, suitable for kernel input. |
python prepare_snp_for_kernel.py \
--input_file filtered_snp_data.csv \
--cleaned_output filtered_snp_data_clean.csv \
--sorted_output filtered_snp_data_Kernel_Re.csvCode name: JAWA_kernal.sh
This script runs JAWAMIX5 to preprocess germline SNP data for kernel-based genetic analysis. It performs two main steps:
- Converts the SNP matrix from
.csvformat to.hdf5format. - Computes the kinship matrix using the Realized Relationship Matrix (RRM) method.
| File | Description |
|---|---|
filtered_snp_data_Kernel_Re.csv |
Cleaned and sorted germline SNP matrix in CSV format. |
filtered_snp_data_Kernel.hdf5 |
Intermediate SNP matrix in HDF5 format. |
| File | Description |
|---|---|
filtered_snp_data_Kernel.hdf5 |
SNP matrix in HDF5 format for downstream kinship calculation. |
filtered_snp_RRM |
Kinship matrix generated using the RRM method. |
Code name: JAWA.sh
This shell script automates gene-wise association testing using JAWAMIX5 over a specified range of gene indices. It is designed for high-throughput execution on a cluster or compute node. For each gene in the range, it performs the following steps:
- Generate mutation matrix for the gene using
JAWA_prep.py. - Extract the gene name from the output file.
- Convert the mutation matrix to HDF5 format via
Jawamix5.jarimport. - Run association testing using
Jawamix5.jar emmaxwith the expression data and GRM. - Concatenate results using
Gene_concatenate.py. - Clean up temporary mutation files to save disk space.
Code name: JAWA_prep.py
This script extracts the mutation profile of a single gene (one row) from a full somatic mutation matrix and converts it into a format compatible with JAWAMIX5 for gene-based association testing.
| Argument | Description |
|---|---|
--mutation_file |
Full mutation matrix file (.csv). Default: filtered_reordered_mutation_Association.csv. |
--gene_index |
Row index of the gene to extract. |
--output_dir |
Directory to save the output file. Default: current working directory. |
mutation_gene_<GENE_ID>_index_<INDEX>.csv: One-row mutation matrix with columns:CHR,POS,sample1,sample2, ...
python JAWA_prep.py --gene_index 3570 \
--mutation_file filtered_reordered_mutation_Association.csv \
--output_dir ./JAWA_parallel/Code name: Gene_concatenate.py
This script parses and aggregates gene-wise association results from JAWAMIX5 emmax output files (EMMAX.*.top) into a summary file.
Functionality:
- Extracts gene index and gene name from filenames formatted as
EMMAX.<index>_<gene>.top. - Reads key statistics: p-value, adjusted R², coefficient, standard error, and MAF count from each result file.
- Appends each gene’s results as a line in
Gene_summary.txt. - Deletes processed
.topfiles to keep the working directory clean.
- A directory containing multiple
EMMAX.*.topfiles (output fromJawamix5.jar emmax).
Gene_summary.txt(tab-delimited) with columns:
index,gene_name,pvalue,adjusted_r2,coefficient,sd_err,maf_count
python Gene_concatenate.pyCode name: SKAT_Somatic_prep.py
This script filters gene-level somatic mutation data by pathway and exports a mutation matrix for each pathway. It prepares input data for pathway-level association tests such as SKAT.
| File | Description |
|---|---|
allpathway_geneid.txt |
Plain text file where each line contains a pathway ID followed by gene IDs in that pathway. Example format: Pathway1 GENE0001 GENE0002 GENE0003 Pathway2 GENE0004 GENE0005 |
filtered_somatic_SKAT.csv |
Gene-by-sample somatic mutation matrix including a GeneID column. |
- One CSV file per pathway saved under
Result/Pathway_Somatic/, named{pathway_id}_mutation.csv, containing only genes from that pathway.
python SKAT_Somatic_prep.pyCode name: run_skat_pathway_association.R
This R script performs pathway-level association tests between somatic mutation profiles and protein expression levels using the SKAT (Sequence Kernel Association Test) method. Each pathway is tested using gene-level mutation data weighted by coefficients from JAWAMIX5.
| File/Pattern | Description |
|---|---|
Result/Pathway_Somatic/*.csv |
Gene-by-sample somatic mutation matrices for each pathway. |
filtered_protein_SKAT.csv |
Protein expression matrix with samples as columns. |
Gene_weights.txt |
Gene-level weights (coefficients) derived from JAWAMIX5 results. |
-
pathway_protein_pvalues.txt: Tab-delimited file summarizing SKAT p-values with columns:pathway_id protein_id pvalue hsa00010 Q9H0H5 0.00392 hsa00020 Q96A11 0.00001 ... ... ...
Each row reflects the significance of the association between a pathway’s mutation profile and a protein’s expression.
Run the script for a range of pathway files (useful for parallelization):
Rscript run_skat_pathway_association.R 1 100Code name: SKAT_Summary.py
New name: summarize_protein_pathway_associations.py
This script identifies significantly associated protein-pathway pairs using Bonferroni-corrected p-values, merges metadata from antibody and pathway files, summarizes protein involvement across pathways, and quantifies gene overlaps between pathways.
- Bonferroni Correction: Apply correction to p-values and retain significant associations (
p < 0.05). - Metadata Merging: Incorporate protein descriptions and pathway names.
- Protein Summary: Group by protein ID to summarize all linked pathways, genes, and targets.
- Pathway Gene Overlap: For proteins involved in multiple pathways, calculate intersection sizes of pathway gene sets and pathway sizes.
| File | Description |
|---|---|
pathway_protein_pvalues.txt |
Association p-values between proteins and pathways. |
TCGA_antibodies_descriptions.gencode.v36.tsv |
Metadata for TCGA proteins (AGIDs, gene names, targets). |
Pathways_GeneIDs.csv |
List of gene IDs for each pathway. |
allpathway_geneid.txt |
Mapping of pathway IDs to gene ID sets (used for gene overlap). |
allnameidlist_hg19.txt |
Optional gene ID to gene name mapping file. |
updated_protein_summary.csv: Protein-level summary table including pathway memberships, overlapping gene counts between pathways, and pathway sizes.
run_pipeline(
pval_file='pathway_protein_pvalues.txt',
antibody_file='TCGA_antibodies_descriptions.gencode.v36.tsv',
pathway_file='Pathways_GeneIDs.csv',
pathway_gene_file='allpathway_geneid.txt',
gene_id_mapping_file='allnameidlist_hg19.txt', # Optional if not used
output_file='updated_protein_summary.csv'
)