This repository provides the implementation of a decomposition-driven framework for identifying cancer-specific and tissue-of-origin (TOO) DNA methylation biomarkers from bulk tissue data using Non-negative Matrix Factorization (NMF). A detailed description of the methodology and its applications can be found in our manuscript: "Decomposition-Driven Biomarker Identification Enhances non-Invasive Early Cancer Detection".
A Python 3 environment is required, with commonly used packages including numpy, scipy, scikit-learn, and pandas.
The framework consists of several steps, each implemented as an independent script. The workflow and required data formats for each step are described below.
Code: 01.CancerMarker.NMF.py In this step, NMF is applied to bulk DNA methylation data from cancer and matched normal (e.g., para-carcinoma) tissue samples to decompose composite methylation signals into multiple latent components. These components are subsequently used to identify cancer-associated components and enhance cancer-specific marker selection.
Two data files are used in the step: (1) Bulk DNA methylation matrix; (2) Sample annotation table.
- Row: CpGs
- Column: Samples
- Value: DNA methylation levels (range: 0-1)
| ID | Sample:1 | Sample:2 | Sample:3 | ... | Sample:N |
|---|---|---|---|---|---|
| CpG:1 | 0.0 | 0.23 | 0.86 | ... | 1.0 |
| CpG:2 | 0.08 | 0.07 | 0.34 | ... | 0.98 |
| CpG:3 | 0.10 | 0.32 | 0.19 | ... | 0.65 |
| ... | ... | ... | ... | ... | ... |
| CpG:M | 0.55 | 0.0 | 0.62 | ... | 0.04 |
- Must include a sample ID column and a pathological status column (e.g., Cancer or Normal).
- Used for CpG pre-filtering prior to NMF.
| Sample_ID | Pathological |
|---|---|
| Sample:1 | Cancer |
| Sample:2 | Cancer |
| Sample:3 | Cancer |
| ... | ... |
| Sample:N-1 | Normal |
| Sample:N | Normal |
- CpG pre-selection based on cancer–normal differences.
- CpGs with negligible differences between cancer and normal samples are filtered out.
- ROC-AUC is calculated for each CpG; CpGs with AUC > 0.65 or < 0.35 are retained.
- Coverage filtering
- CpGs with ≥10× sequencing depth in ≥70% of samples are retained.
- Missing value imputation
- KNN-based imputation is applied.
- Internal control
- An artificial CpG with 100% methylation is added as an internal control.
- NMF decomposition
- The bulk methylation matrix is factorized into component-specific methylation profiles and sample-wise component fractions.
- The number of components (K) is typically set to ~10, reflecting the cellular and sub-tissue heterogeneity of cancer tissues (e.g., malignant cells, immune infiltration, cancer-associated fibroblasts, stromal cells, endothelial cells, and residual normal cells).
- Alternative values of K (e.g., 3, 5, 20, 25, 30) are encouraged for sensitivity exploration.
- Post-processing
- The artificial CpG is used to rescale the H matrix to approximately 0–1. Correspondingly, component fractions in the W matrix are normalized such that fractions per sample sum to ~1.
- Values exceeding 1 in the H matrix are truncated.
- Rows: CpGs
- Columns: Components
- Values: Methylation levels (0-1)
| ID | Component:1 | ... | Component:K |
|---|---|---|---|
| CpG:1 | 0.0 | ... | 1.0 |
| CpG:2 | 0.25 | ... | 0.84 |
| CpG:3 | 0.16 | ... | 0.65 |
| ... | ... | ... | ... |
| CpG:M | 0.77 | ... | 0.02 |
- Rows: Components
- Columns: Samples
- Values: Relative contribution of each component per sample
| ID | Sample:1 | Sample:2 | Sample:3 | ... | Sample:N |
|---|---|---|---|---|---|
| Component:1 | 0.21 | 0.43 | 0.16 | ... | 0.17 |
| ... | ... | ... | ... | ... | ... |
| Component:K | 0.34 | 0.06 | 0.32 | ... | 0.25 |
Code: 02.WmatrixAnalysis.py The W matrix is reformatted and visualized as a heatmap to illustrate the distribution of components across samples.
- Input:
- W matrix
- Sample annotation table
- Samples can be sorted by phenotype (e.g., cancer vs. normal).
- This step enables intuitive identification of cancer- and normal-associated components.
Code: 03.cfDNAsignalDistribution.py This step evaluates the effectiveness of CpG marker selection by comparing cfDNA methylation changes between cancer patients and healthy individuals for CpGs selected using the NMF framework versus conventional bulk-level comparisons.
- Preparation
- cfDNA methylation data for each CpG were obtained from the CFEA database. Cancer–healthy cfDNA methylation changes for each CpG were quantified using multiple metrics, including:
- Mean distance between cancer and healthy samples.
- Log fold change between the mean methylation levels of cancer and healthy samples.
- Median distance between cancer and healthy samples.
- Log fold change between the median methylation levels of cancer and healthy samples.
- Difference in average methylation fraction between cancer and healthy samples.
- Log fold change of the average methylation fraction between cancer and healthy samples.
- NMF-based CpG marker selection. The methylation profile of the cancer-specific component was compared with that of the healthy blood cell background by subtracting the corresponding CpG methylation signals. CpGs were then ranked by their differential methylation relative to the blood background for NMF-based marker selection.
- Conventional CpG marker selection. Bulk tissue methylation signals were compared between cancer and normal samples for each CpG using multiple metrics, including center distance, log fold change, ROC AUC, and p-values from the Wilcoxon rank-sum (U) test.
- cfDNA methylation data for each CpG were obtained from the CFEA database. Cancer–healthy cfDNA methylation changes for each CpG were quantified using multiple metrics, including:
- Evaluation
- For each selection strategy, the top 1,000 hyper-methylated CpGs and the top 1,000 hypo-methylated CpGs were selected based on:
- NMF-derived cancer-specific component vs. healthy blood cells.
- Bulk cancer vs. normal tissue, CpGs ranked by center distance.
- Bulk cancer vs. normal tissue, CpGs ranked by log fold change.
- Bulk cancer vs. normal tissue, CpGs ranked by ROC AUC.
- Bulk cancer vs. normal tissue, CpGs ranked by p-value from the U-test.
- For hyper- and hypo-methylated CpGs separately, CpGs selected by the NMF framework were compared with those selected by each conventional approach based on their cancer–healthy cfDNA methylation changes.
- For each selection strategy, the top 1,000 hyper-methylated CpGs and the top 1,000 hypo-methylated CpGs were selected based on:
This table is used to select top CpGs with hyper- or hypo-methylation signals based on the cancer-specific component in comparison with the healthy blood cell background. Each row corresponds to a single CpG. The table should include the following columns:
- CpG genomic coordinate
- Methylation level in the cancer-specific component
- Average methylation level in healthy blood cells
- Methylation difference between the cancer-specific component and healthy blood cells
- Cancer–healthy cfDNA methylation change quantified by mean distance
- Cancer–healthy cfDNA methylation change quantified by log fold change of the mean
- Cancer–healthy cfDNA methylation change quantified by median distance
- Cancer–healthy cfDNA methylation change quantified by log fold change of the median
- Cancer–healthy cfDNA methylation change quantified by the difference in average methylation fraction across samples
- Cancer–healthy cfDNA methylation change quantified by log fold change of the average methylation fraction across samples
This table is used to select top CpGs with hyper- or hypo-methylation signals in cancer tissues compared with normal tissue samples. Each row corresponds to a single CpG. The table should include the following columns:
- CpG genomic coordinate
- Mean methylation level in cancer samples
- Mean methylation level in normal samples
- Median methylation level in cancer samples
- Median methylation level in normal samples
- Center distance between cancer and normal samples
- Log fold change between cancer and normal samples
- ROC AUC distinguishing cancer and normal samples
- P-value from the Wilcoxon rank-sum (U) test between cancer and normal samples
- Cancer–healthy cfDNA methylation change quantified by mean distance
- Cancer–healthy cfDNA methylation change quantified by log fold change of the mean
- Cancer–healthy cfDNA methylation change quantified by median distance
- Cancer–healthy cfDNA methylation change quantified by log fold change of the median
- Cancer–healthy cfDNA methylation change quantified by the difference in average methylation fraction across samples
- Cancer–healthy cfDNA methylation change quantified by log fold change of the average methylation fraction across samples
Code: 04.TOOMarker.NMF.py In this step, NMF is performed jointly on bulk DNA methylation profiles from cancer tissues and their corresponding normal tissues (e.g., para-carcinoma samples) across multiple cancer types. The objective is to decompose bulk methylation signals into a set of latent components, which are subsequently used to identify cancer type–specific components and improve the effectiveness of cancer type–specific marker discovery.
It is strongly recommended to include healthy blood cell samples in the NMF decomposition, so that blood-derived background signals can be separated from tissue-derived methylation signals.
The cancer type–specific markers identified in this step will be used in the tissue-of-origin localization module of a multi-cancer early detection (MCED) assay, enabling accurate cancer type determination in positive cases.
Three input data files are required for this step: (1) Raw bulk DNA methylation matrix (2) Sample annotation table (3) CpG-level statistical summary table.
- Rows: CpGs
- Columns: Samples
- Values: Bulk DNA methylation levels ranging from 0 to 1
| ID | Sample:1 | Sample:2 | Sample:3 | ... | Sample:N |
|---|---|---|---|---|---|
| CpG:1 | 0.07 | 0.54 | 0.66 | ... | 1.0 |
| CpG:2 | 0.21 | 0.17 | 0.31 | ... | 0.83 |
| CpG:3 | 0.10 | 0.0 | 0.18 | ... | 0.57 |
| ... | ... | ... | ... | ... | ... |
| CpG:M | 0.35 | 0.05 | 0.92 | ... | 1.0 |
This table must contain:
- A sample ID column
- A pathological status column (e.g., Cancer or Normal)
- A tissue type column This table is used for CpG pre-filtering prior to NMF decomposition.
| Sample_ID | Pathological | Tissue_type |
|---|---|---|
| Sample:1 | Cancer | Esophageal |
| Sample:2 | Cancer | Esophageal |
| ... | ... | ... |
| Sample:# | Normal | Esophageal |
| Sample:# | Normal | Esophageal |
| ... | ... | ... |
| Sample:# | Cancer | Gastric |
| Sample:# | Cancer | Gastric |
| ... | ... | ... |
| Sample:# | Normal | Gastric |
| Sample:# | Normal | Gastric |
| ... | ... | ... |
| Sample:# | Cancer | Colorectal |
| Sample:# | Cancer | Colorectal |
| ... | ... | ... |
| Sample:# | Normal | Colorectal |
| Sample:# | Normal | Colorectal |
| ... | ... | ... |
| Sample:# | Cancer | Liver |
| Sample:# | Cancer | Liver |
| ... | ... | ... |
| Sample:# | Normal | Liver |
| Sample:# | Normal | Liver |
| ... | ... | ... |
| Sample:# | Cancer | Pancreatic |
| Sample:# | Cancer | Pancreatic |
| ... | ... | ... |
| Sample:# | Normal | Pancreatic |
| Sample:# | Normal | Pancreatic |
| ... | ... | ... |
| Sample:# | Normal | Blood_cell |
| Sample:# | Normal | Blood_cell |
| ... | ... | ... |
| Sample:N-1 | Normal | Blood_cell |
| Sample:N | Normal | Blood_cell |
This table contains precomputed statistical metrics for each CpG site and is used in the CpG pre-selection step. Each row corresponds to a single CpG site. The table should include the following columns:
- ANOVA p-values
- The p-value obtained from ANOVA performed across cancer samples from multiple cancer types.
- The p-value obtained from ANOVA performed across normal (para-carcinoma) samples from multiple tissue types.
- Sequencing coverage statistics for cancer samples
- For each cancer type, the percentage of samples in which the corresponding CpG site has a sequencing coverage of ≥10×.
- Sequencing coverage statistics for normal samples
- For each normal tissue type, the percentage of samples in which the corresponding CpG site has a sequencing coverage of ≥10×. These metrics collectively enable the identification of CpG sites that exhibit tissue- or cancer-type–associated methylation patterns while maintaining sufficient sequencing coverage for robust downstream analysis.
- Preparation
- For each CpG site, perform:
- ANOVA across cancer samples from different cancer types
- ANOVA across normal (para-carcinoma) samples from different tissue types
- This evaluates whether the CpG harbors tissue-type– or cancer-type–associated methylation signals.
- For each CpG site, perform:
- CpG Pre-selection
- CpGs are retained if they satisfy either of the following:
- ANOVA p < 0.05 among cancer samples across cancer types
- ANOVA p < 0.05 among normal samples across tissue types
- Additionally, CpGs must have:
- ≥10× sequencing coverage in ≥50% of cancer samples and ≥50% of normal samples
- CpGs are retained if they satisfy either of the following:
- Missing Value Imputation
- Missing methylation values are imputed using the K-nearest neighbors (KNN) method.
- Internal Control CpG Addition
- An artificial CpG with 100% methylation is added as an internal control for downstream normalization.
- NMF Decomposition
- NMF is applied to the filtered bulk DNA methylation matrix.
- Parameter K (number of components) should be selected based on the number of cancer types. If there are X cancer types, it is recommended to set K ≥ 2X + 3, accounting for:
- X cancer type–specific components
- X corresponding normal tissue–specific components
- One shared cancer component
- One shared normal tissue component
- One immune-related component
- Testing alternative K values is encouraged to gain intuitive understanding of the decomposition.
- Initial NMF Outputs
- H matrix (methylation profile matrix):
- Rows: CpGs
- Columns: Components
- W matrix (fraction matrix):
- Rows: Components
- Columns: Samples
- H matrix (methylation profile matrix):
- Normalization
- The artificial CpG is used to rescale: Methylation values in the H matrix to approximately the range 0–1
- Component fractions in the W matrix so that the sum per sample is approximately 1
- Final Adjustment
- Values in the H matrix exceeding 1 are truncated to 1.
- The resulting H and W matrices are treated as the final NMF outputs.
- Rows: CpGs
- Columns: Components
- Values: Methylation levels (0–1)
| ID | Component:1 | ... | Component:K |
|---|---|---|---|
| CpG:1 | 1.0 | ... | 0.0 |
| CpG:2 | 0.55 | ... | 0.89 |
| CpG:3 | 0.24 | ... | 0.35 |
| ... | ... | ... | ... |
| CpG:M | 0.37 | ... | 0.08 |
- Rows: Components
- Columns: Samples
- Values: Component fractions
| ID | Sample:1 | Sample:2 | Sample:3 | ... | Sample:N |
|---|---|---|---|---|---|
| Component:1 | 0.15 | 0.23 | 0.11 | ... | 0.16 |
| ... | ... | ... | ... | ... | ... |
| Component:K | 0.14 | 0.03 | 0.28 | ... | 0.05 |
Code: 05.WmatrixStat.py In this step, statistical tests are performed on the component fractions in the W matrix to determine whether individual components are significantly enriched in specific sample groups, such as a particular cancer type. This analysis enables the identification of cancer type–specific components, which will be used in subsequent downstream analyses.
- W matrix generated from multi-cancer NMF
- Please refer to Section 4.3 for a detailed description of the W matrix format.
- Sample table
- Please refer to Section 4.1 for the description of the sample table.
- Group annotations in the sample table (e.g., cancer type and healthy blood cells) are used to define sample groups for statistical testing.
- For each NMF component, pairwise t-tests are performed on the component fraction values between each cancer type and all other cancer types, as well as healthy blood cell samples.
- A component is considered a candidate cancer type–specific component if it is significantly enriched in one cancer type compared with all other cancer types and healthy blood cells.
Code: 06.HmatrixAnalysis.py In this step, the H matrix is reformatted and organized to facilitate downstream visualization of methylation profiles and the identification of cancer type–specific CpG markers.
- The H matrix generated from the multi-cancer NMF decomposition is required as the input for this step.