The results of this study is used to standardize the tumor names in CT database, so they can be integrated with other biomedical databases for further downstream analysis and understanding the therapeutic agents and drug-target landscape for a given tumor.
Following are the steps for running the pipeline:
- Clone this Github repository to your local machine
- Navigate to the following Open Science Foundation website to find the embeddings data files needed to run CANTOS: https://doi.org/10.17605/OSF.IO/DBGWN
- Under the files section, navigate to the Embeddings folder. There are several embedding files in the Embedding folder that need to be downloaded.
- Click the button with three vertical dots on the right side of each embedding file, and select the download button, which will initiate the download.
- Locate the file on your machine and store the embeddings files in the data directory of the cloned GitHub repository:
# | File Name | Directory |
---|---|---|
1 | CT_Embeddings_ADA2.csv | CANTOS/data |
2 | CT_Embeddings_V3.csv | CANTOS/data |
3 | NCIT_Embeddings_V3.csv | CANTOS/data |
4 | WHO_Aggregate_ADA2.csv | CANTOS/data |
5 | WHO_Terms_All_V3.csv | CANTOS/data |
6 | all-MiniLM-L12-v2.csv | CANTOS/data |
7 | all_MiniLM_L6_v2.csv | CANTOS/data |
8 | all_embedding_llama_33_70b.csv | CANTOS/data |
9 | all_mpnet_base_v2.csv | CANTOS/data |
10 | biobert_embedding.csv | CANTOS/data |
11 | cohere_embeddings_embed_english_v2.csv | CANTOS/data |
12 | deepseek_8b.csv | CANTOS/data |
13 | e5-large-v2.csv | CANTOS/data |
14 | e5-large.csv | CANTOS/data |
15 | embeddings_llama.csv | CANTOS/data |
16 | gtr-t5-large.csv | CANTOS/data |
17 | llama32_3B.csv | CANTOS/data |
18 | medllama-13b.csv | CANTOS/data |
19 | medllama-7b.csv | CANTOS/data |
20 | mordernbert_embeddings.csv | CANTOS/data |
21 | nomic-embed-text.csv | CANTOS/data |
22 | output_tumor_embeddings_biogpt.csv | CANTOS/data |
23 | output_tumor_embeddings_clinicalBERT.csv | CANTOS/data |
24 | phi4.csv | CANTOS/data |
25 | pubmedbert-base-embeddings.csv | CANTOS/data |
26 | tumor_embeddings_labse.csv | CANTOS/data |
27 | tumor_embeddings_sapbert.csv | CANTOS/data |
28 | tumor_embeddings_scibert.csv | CANTOS/data |
Please note that the ADA002 embeddings file for NCIT is contained in the following directory:
CANTOS/data/dt_input_file_6_dec/NCIT_Neoplasm_Core_terms_text-embedding-ada-002_embeddings.csv
Before running CANTOS, please ensure
- Please ensure your machine has R installed on it. It can be downloaded from the following website: https://www.r-project.org/
- The libraries listed in the Library section below are installed.
- In the following scripts make sure to edit the makeCluster argument with the number of cores available on your machine. The number of available cores cab found using the R command
detectCores()
in theparallel
library
02-calculate-edit-distance-5thed.R
02-calculate-edit-distance.R
07A-annotate-cluster-result-NCIT-WHO-5thed.R
07A-annotate-cluster-result-NCIT-WHO.R
07B-annotate-cluster-result-V3-NCIT-WHO-5thed.R
07B-annotate-cluster-result-V3-NCIT-WHO.R
We ran CANTOS on RStudio Version 2023.09.1+494 (2023.09.1+494) using R version 4.4.0 (2024-04-24). Users can also run CANTOS from the command line from the following directory
CANTOS/analysis
using the following command:
bash CANTOS.sh
This repository contains the code, tables, and plots associated with the CT Embedding paper. The pipeline built in this repository does the following task:
-
Extract tumor names from the CT database if they are associated with an NCT ID and have an associated drug belonging to the categories of Drug, Biological,Combination Product,Genetic. A total 50,410 condition names are extracted.
-
These 50410 condition names are flagged as tumors and non tumors by the pipeline, which are then further manually annotated pediatric and adult tumors. A total of 13,230 tumors are identified from the 50,410 conditions and out of the 13,230 tumors, 6,324 were classified as pediatric tumors.
-
Compute the distance of each 13,230 clinical trials tumors, 4720 WHO tumors, and 1395 NCIT tumors. Distance metrics used are Levenshtein, Cosine , and Jarro-Winkler.
-
Find the closest matching WHO term for each tumor for each distance metric and then also group the top 0.05% closest matching group of tumors. Each of the closest match terms are standardized to their closest matching WHO Term.
-
Use the distance matrices computed to perform 3 levels of nested affinity clustering and group the tumors. After grouping the tumors they are standardized to their closest matching WHO Term.
-
Generate embeddings for each tumor terms (CT, WHO, NCIT) using Open AI's models text-embedding-3-large (LTE-3) and text-embedding-ada-002 (ADA-002). We then identify the closest matching (Euclidean Distance) WHO terms for each tumor name from the CTR.
-
Perform PCA on each of the embedding types and then run K-means and Affinity Clustering to group the tumors together. We refine the clusters by filtering outliers using isolation forest and local outlier factor.
-
After cluster refinement, each cluster is standardized to the WHO term that matches a majority of the members of that cluster.
-
We generate embeddings for each tumor terms (CT, WHO, NCIT) using embeddings obtained from LLama 3.3, Llama 3.2,LLama 3.0
,MedLLama2,MedLLama13B,BioBERT,PubMedBERT-Abstract (PubMedBERT),ModernBERT-Large (ModernBERT) ,Phi-4,,e5-large,e5-large-v2, all-mpnet-base-v2,gtr-t5-large,all-MiniLM-L12-v2,all-MiniLM-L6-v2,all-roberta-large-v1 ,SapBERT ,ClinicalBERT,LaBSE,BioGPT, DeepSeek_8B,SciBERT,nomic-embed-text, and Cohere: embed-english-v2.0. -
These embeddings are provided as inputs to CANTOS which then identifies the closest matching (Euclidean Distance) WHO terms for each tumor name from the CTR using these embeddings .
-
We randomly sampled 1600 tumor names from the CTR and manually annotated their ground truths obtained from the WHO System. We observed the methods LTE-3+Euclidean Dist and all-MiniLM-L12-v2+Euclidean Dist had the highest standardization accuracy against WHO all editions and WHO 5th edition respectively. We filtered any CTR tumor names (from the 1600 sampled) that did not have a ground truth to evaluate the standardization accuracy. We then plotted the distributions of the Euclidean distances of the remaining CTR terms to their respective WHO terms as identified by these two methods and segregated the distribution based on correct and incorrect standardization.
00-generate-ct-disease-file.R:
This script loads data from clinical trials and select only the diseases with NCT ID , and associated with Intervention types of Drug, Biological,Combination Product,Genetic. Totally 50410 diseases are extracted.
01-generate-disease-annotation-for-manual-review.R
This script annotates the 50K diseases automatically as cancer or not.
02-calculate-edit-distance.R
This script loads the manually annotated disease file with pediatric and adult cancer annotation and computes the edit distance matrices. WHO all editions was used in this script
02-calculate-edit-distance-5thed.R
This script loads the manually annotated disease file with pediatric and adult cancer annotation and computes the edit distance matrices. WHO 5th editions was used in this script.
03-edit-distance-clustering.R
This Script performs affinity propagation clustering using edit distances. WHO all editions was used in this script.
03-edit-distance-clustering-5thed.R
This Script performs affinity propagation clustering using edit distances. WHO 5th editions was used in this script.
04A-preprocess-embedding-pca.R
These script loads AD-A002 embeddings for CT, WHO, NCIT Tumors and then performs PCA.WHO all editions was used in this script.
04A-preprocess-embedding-pca-ADA2-5thed.R
These script loads ADA-002 embeddings for CT, WHO, NCIT Tumors and then performs PCA.WHO 5th editions was used in this script.
04B-preprocess-embedding-pca-v3.R
These script loads LTE-3 embeddings for CT, WHO, NCIT Tumors and then performs PCA.WHO database all editions was used in this script.
04B-preprocess-embedding-pca-v3-5thed.R
These script loads LTE-3 embeddings for CT, WHO, NCIT Tumors and then performs PCA.WHO database 5th editions was used in this script.
05A-cluster-on-ADA2-embedding-Kmeans.R
This script computes Kmeans cluster using ADA-002 embeddings and also computes silhouette index.WHO database all editions was used in this script.
05A-ADA2-embedding-Kmeans-5thed.R
This script computes Kmeans cluster using ADA-002 embeddings and also computes silhouette index.WHO database 5th editions was used in this script.
05B-v3-embedding-Kmeans.R
This script computes Kmeans cluster using LTE-3 embeddings and also computes silhouette index.WHO database all editions was used in this script.
05B-v3-embedding-Kmeans-5thed.R
This script computes Kmeans cluster using LTE-3 embeddings and also computes silhouette index.WHO database 5th editions was used in this script.
06A-cluster-on-ADA-embedding-affinity.R
This script computes affinity propagation clustering using ADA-002 embeddings. Nested clustering is performed on large cluster. Cluster size is determined to be large using Z scores on cluster membership.WHO database all editions was used in this script.
06A-cluster-on-ADA-embedding-affinity-5thed.R
This script computes affinity propagation clustering using ADA-002 embeddings. Nested clustering is performed on large cluster. Cluster size is determined to be large using Z scores on cluster membership.WHO database 5th editions was used in this script.
06B-cluster-on-V3-embedding-affinity.R
This script computes affinity propagation cluster using LTE-3 embeddings. Nested clustering is performed on large cluster. Cluster size is determined to be large using Z scores on cluster membership.WHO database all editions was used in this script.
06B-cluster-on-V3-embedding-affinity-5thed.R
This script computes affinity propagation cluster using LTE-3 embeddings. Nested clustering is performed on large cluster. Cluster size is determined to be large using Z scores on cluster membership.WHO database 5th editions was used in this script.
07A-annotate-cluster-result-NCIT-WHO.R
This script annotates Affinity propagation cluster results of ADA-002 embeddings. WHO database all editions was used in this script.
07A-annotate-cluster-result-NCIT-WHO-5thed.R
This script annotates Affinity propagation cluster results of ADA-002 embeddings. WHO database 5th editions was used in this script.
07B-annotate-cluster-result-V3-NCIT-WHO.R
This script annotates Affinity propagation cluster results of LTE-3 embeddings. WHO database all editions was used in this script.
07B-annotate-cluster-result-V3-NCIT-WHO-5thed.R
This script annotates Affinity propagation cluster results of LTE-3 embeddings. WHO database 5th editions was used in this script.
08-outlier-detection-embeddings.R
This script is used to detect if Affinity propagation cluster members are outliers using LOF and Isolation Forest.We perform this for clusters formed using both ADA002 and V3 embeddings. WHO database all editions was used in this script
08-outlier-detection-embeddings-5thed.R
This script is used to detect if Affinity propagation cluster members are outliers using LOF and Isolation Forest.We perform this for clusters formed using both ADA-002 and LTE-3 embeddings. WHO database 5th editions was used in this script
09-cluster-reassignment-outlier.R
This script performs reannotates Affinity cluster after outlier detection. We perform this for clusters formed using both ADA002 and V3 embeddings. WHO database all editions was used in this script.
09-cluster-reassignment-outlier-5thed.R
This script performs reannotates Affinity cluster after outlier detection. We perform this for clusters formed using both ADA-002 and LTE-3 embeddings. WHO database 5th editions was used in this script.
10-assign-who-ncit-outlier-kmeans-editdistance-clustering.R
This script to detect outliers for embedding-based-Kmeans and edit distance based standardization. WHO database all editions was used in this script.
10-assign-who-ncit-outlier-kmeans-editdistance-clustering-5thed.R
This script to detect outliers for embedding-based-Kmeans and edit distance based standardization. WHO database 5th editions was used in this script.
11-os-embedding-euclidean-dist.R
This script is used to compute Euclidean distance matrices between tumor names in CTR, WHO and, NCIt using embeddings obtained from non-Open AI models. For each embedding type, we identify the closest match WHO 5th edition, WHO all edition, and NCIt terms for ever CTR tumor name.
12-sample-CT-tumors-validation.R
Randomly Sample 1600 tumors from the CTR Tumor Names.
13-summarize-results.R
For each standardization method compute their standardization accuracy against the sampled tumor names for which at least one ground truth was found from the WHO system .
14-compute-euclidean-v3-threshold.R
Script is used for generating box-plots for comparing the standardization results of the methods LTE-3+Euclidean Dist and all-MiniLM-L12-v2+Euclidean Dist.
15-compute-euclidean-v3-threshold.R
Script is used to generate average silhouette score vs number of cluster plot in figure 5.
16-generate-heatmaps-mi-analysis.R
Script is used to perform mutual information analysis to identify three high accuracy standardization methods for performing majority vote to select standardization for a given CTR tumor name. The script also generates heatmap (figure 6 and 7) of pairwise mutual information among the high-accuracy methods.
17-all-method-voting.R
Script is used to predict CTR tumor name standardization, by performing majority voting of high-accuracy methods (standardization accuracy >=60%) as identified in script 13.
18-majority-vote-predictions.R
Script is used to predict the majority vote based standardization methods identified in script 16 for all the CTR terms. Since two different combinations of methods were identified for WHO 5th and WHO all editions, we use each of the methods to also standardize the CTR terms with respect to NCIt.
- apcluster
- biomaRt
- cluster
- data.table
- dbscan
- DescTools
- doParallel
- dplyr
- factoextra
- foreach
- ggplot2
- ggpubr
- ghql
- httr
- isotree
- jsonlite
- magrittr
- qdapRegex
- readxl
- stringdist
- stringi
- stringr
- tidyverse
- pdist