Skip to content

TaylorResearchLab/CANTOS

Repository files navigation

CT-Embedding-Paper

The results of this study is used to standardize the tumor names in CT database, so they can be integrated with other biomedical databases for further downstream analysis and understanding the therapeutic agents and drug-target landscape for a given tumor.

Embeddings Data Download

Following are the steps for running the pipeline:

  1. Clone this Github repository to your local machine
  2. Navigate to the following Open Science Foundation website to find the embeddings data files needed to run CANTOS: https://doi.org/10.17605/OSF.IO/DBGWN
  3. Under the files section, navigate to the Embeddings folder. There are several embedding files in the Embedding folder that need to be downloaded.
  4. Click the button with three vertical dots on the right side of each embedding file, and select the download button, which will initiate the download.
  5. Locate the file on your machine and store the embeddings files in the data directory of the cloned GitHub repository:
# File Name Directory
1 CT_Embeddings_ADA2.csv CANTOS/data
2 CT_Embeddings_V3.csv CANTOS/data
3 NCIT_Embeddings_V3.csv CANTOS/data
4 WHO_Aggregate_ADA2.csv CANTOS/data
5 WHO_Terms_All_V3.csv CANTOS/data
6 all-MiniLM-L12-v2.csv CANTOS/data
7 all_MiniLM_L6_v2.csv CANTOS/data
8 all_embedding_llama_33_70b.csv CANTOS/data
9 all_mpnet_base_v2.csv CANTOS/data
10 biobert_embedding.csv CANTOS/data
11 cohere_embeddings_embed_english_v2.csv CANTOS/data
12 deepseek_8b.csv CANTOS/data
13 e5-large-v2.csv CANTOS/data
14 e5-large.csv CANTOS/data
15 embeddings_llama.csv CANTOS/data
16 gtr-t5-large.csv CANTOS/data
17 llama32_3B.csv CANTOS/data
18 medllama-13b.csv CANTOS/data
19 medllama-7b.csv CANTOS/data
20 mordernbert_embeddings.csv CANTOS/data
21 nomic-embed-text.csv CANTOS/data
22 output_tumor_embeddings_biogpt.csv CANTOS/data
23 output_tumor_embeddings_clinicalBERT.csv CANTOS/data
24 phi4.csv CANTOS/data
25 pubmedbert-base-embeddings.csv CANTOS/data
26 tumor_embeddings_labse.csv CANTOS/data
27 tumor_embeddings_sapbert.csv CANTOS/data
28 tumor_embeddings_scibert.csv CANTOS/data

Please note that the ADA002 embeddings file for NCIT is contained in the following directory:

CANTOS/data/dt_input_file_6_dec/NCIT_Neoplasm_Core_terms_text-embedding-ada-002_embeddings.csv

Run Instructions for CANTOS

Before running CANTOS, please ensure

  1. Please ensure your machine has R installed on it. It can be downloaded from the following website: https://www.r-project.org/
  2. The libraries listed in the Library section below are installed.
  3. In the following scripts make sure to edit the makeCluster argument with the number of cores available on your machine. The number of available cores cab found using the R command detectCores() in the parallel library
    02-calculate-edit-distance-5thed.R
    02-calculate-edit-distance.R
    07A-annotate-cluster-result-NCIT-WHO-5thed.R
    07A-annotate-cluster-result-NCIT-WHO.R
    07B-annotate-cluster-result-V3-NCIT-WHO-5thed.R
    07B-annotate-cluster-result-V3-NCIT-WHO.R

We ran CANTOS on RStudio Version 2023.09.1+494 (2023.09.1+494) using R version 4.4.0 (2024-04-24). Users can also run CANTOS from the command line from the following directory CANTOS/analysis using the following command:

bash CANTOS.sh

Description

This repository contains the code, tables, and plots associated with the CT Embedding paper. The pipeline built in this repository does the following task:

  1. Extract tumor names from the CT database if they are associated with an NCT ID and have an associated drug belonging to the categories of Drug, Biological,Combination Product,Genetic. A total 50,410 condition names are extracted.

  2. These 50410 condition names are flagged as tumors and non tumors by the pipeline, which are then further manually annotated pediatric and adult tumors. A total of 13,230 tumors are identified from the 50,410 conditions and out of the 13,230 tumors, 6,324 were classified as pediatric tumors.

  3. Compute the distance of each 13,230 clinical trials tumors, 4720 WHO tumors, and 1395 NCIT tumors. Distance metrics used are Levenshtein, Cosine , and Jarro-Winkler.

  4. Find the closest matching WHO term for each tumor for each distance metric and then also group the top 0.05% closest matching group of tumors. Each of the closest match terms are standardized to their closest matching WHO Term.

  5. Use the distance matrices computed to perform 3 levels of nested affinity clustering and group the tumors. After grouping the tumors they are standardized to their closest matching WHO Term.

  6. Generate embeddings for each tumor terms (CT, WHO, NCIT) using Open AI's models text-embedding-3-large (LTE-3) and text-embedding-ada-002 (ADA-002). We then identify the closest matching (Euclidean Distance) WHO terms for each tumor name from the CTR.

  7. Perform PCA on each of the embedding types and then run K-means and Affinity Clustering to group the tumors together. We refine the clusters by filtering outliers using isolation forest and local outlier factor.

  8. After cluster refinement, each cluster is standardized to the WHO term that matches a majority of the members of that cluster.

  9. We generate embeddings for each tumor terms (CT, WHO, NCIT) using embeddings obtained from LLama 3.3, Llama 3.2,LLama 3.0
    ,MedLLama2,MedLLama13B,BioBERT,PubMedBERT-Abstract (PubMedBERT),ModernBERT-Large (ModernBERT) ,Phi-4,,e5-large,e5-large-v2, all-mpnet-base-v2,gtr-t5-large,all-MiniLM-L12-v2,all-MiniLM-L6-v2,all-roberta-large-v1 ,SapBERT ,ClinicalBERT,LaBSE,BioGPT, DeepSeek_8B,SciBERT,nomic-embed-text, and Cohere: embed-english-v2.0.

  10. These embeddings are provided as inputs to CANTOS which then identifies the closest matching (Euclidean Distance) WHO terms for each tumor name from the CTR using these embeddings .

  11. We randomly sampled 1600 tumor names from the CTR and manually annotated their ground truths obtained from the WHO System. We observed the methods LTE-3+Euclidean Dist and all-MiniLM-L12-v2+Euclidean Dist had the highest standardization accuracy against WHO all editions and WHO 5th edition respectively. We filtered any CTR tumor names (from the 1600 sampled) that did not have a ground truth to evaluate the standardization accuracy. We then plotted the distributions of the Euclidean distances of the remaining CTR terms to their respective WHO terms as identified by these two methods and segregated the distribution based on correct and incorrect standardization.

Scripts

00-generate-ct-disease-file.R:
This script loads data from clinical trials and select only the diseases with NCT ID , and associated with Intervention types of Drug, Biological,Combination Product,Genetic. Totally 50410 diseases are extracted.

01-generate-disease-annotation-for-manual-review.R
This script annotates the 50K diseases automatically as cancer or not.

02-calculate-edit-distance.R
This script loads the manually annotated disease file with pediatric and adult cancer annotation and computes the edit distance matrices. WHO all editions was used in this script
02-calculate-edit-distance-5thed.R
This script loads the manually annotated disease file with pediatric and adult cancer annotation and computes the edit distance matrices. WHO 5th editions was used in this script.

03-edit-distance-clustering.R
This Script performs affinity propagation clustering using edit distances. WHO all editions was used in this script.
03-edit-distance-clustering-5thed.R
This Script performs affinity propagation clustering using edit distances. WHO 5th editions was used in this script.

04A-preprocess-embedding-pca.R
These script loads AD-A002 embeddings for CT, WHO, NCIT Tumors and then performs PCA.WHO all editions was used in this script.
04A-preprocess-embedding-pca-ADA2-5thed.R
These script loads ADA-002 embeddings for CT, WHO, NCIT Tumors and then performs PCA.WHO 5th editions was used in this script.

04B-preprocess-embedding-pca-v3.R
These script loads LTE-3 embeddings for CT, WHO, NCIT Tumors and then performs PCA.WHO database all editions was used in this script.
04B-preprocess-embedding-pca-v3-5thed.R
These script loads LTE-3 embeddings for CT, WHO, NCIT Tumors and then performs PCA.WHO database 5th editions was used in this script.

05A-cluster-on-ADA2-embedding-Kmeans.R
This script computes Kmeans cluster using ADA-002 embeddings and also computes silhouette index.WHO database all editions was used in this script.
05A-ADA2-embedding-Kmeans-5thed.R
This script computes Kmeans cluster using ADA-002 embeddings and also computes silhouette index.WHO database 5th editions was used in this script.

05B-v3-embedding-Kmeans.R
This script computes Kmeans cluster using LTE-3 embeddings and also computes silhouette index.WHO database all editions was used in this script.
05B-v3-embedding-Kmeans-5thed.R
This script computes Kmeans cluster using LTE-3 embeddings and also computes silhouette index.WHO database 5th editions was used in this script.

06A-cluster-on-ADA-embedding-affinity.R
This script computes affinity propagation clustering using ADA-002 embeddings. Nested clustering is performed on large cluster. Cluster size is determined to be large using Z scores on cluster membership.WHO database all editions was used in this script.
06A-cluster-on-ADA-embedding-affinity-5thed.R
This script computes affinity propagation clustering using ADA-002 embeddings. Nested clustering is performed on large cluster. Cluster size is determined to be large using Z scores on cluster membership.WHO database 5th editions was used in this script.

06B-cluster-on-V3-embedding-affinity.R
This script computes affinity propagation cluster using LTE-3 embeddings. Nested clustering is performed on large cluster. Cluster size is determined to be large using Z scores on cluster membership.WHO database all editions was used in this script.
06B-cluster-on-V3-embedding-affinity-5thed.R
This script computes affinity propagation cluster using LTE-3 embeddings. Nested clustering is performed on large cluster. Cluster size is determined to be large using Z scores on cluster membership.WHO database 5th editions was used in this script.

07A-annotate-cluster-result-NCIT-WHO.R
This script annotates Affinity propagation cluster results of ADA-002 embeddings. WHO database all editions was used in this script.
07A-annotate-cluster-result-NCIT-WHO-5thed.R
This script annotates Affinity propagation cluster results of ADA-002 embeddings. WHO database 5th editions was used in this script.

07B-annotate-cluster-result-V3-NCIT-WHO.R
This script annotates Affinity propagation cluster results of LTE-3 embeddings. WHO database all editions was used in this script.
07B-annotate-cluster-result-V3-NCIT-WHO-5thed.R
This script annotates Affinity propagation cluster results of LTE-3 embeddings. WHO database 5th editions was used in this script.
08-outlier-detection-embeddings.R
This script is used to detect if Affinity propagation cluster members are outliers using LOF and Isolation Forest.We perform this for clusters formed using both ADA002 and V3 embeddings. WHO database all editions was used in this script
08-outlier-detection-embeddings-5thed.R
This script is used to detect if Affinity propagation cluster members are outliers using LOF and Isolation Forest.We perform this for clusters formed using both ADA-002 and LTE-3 embeddings. WHO database 5th editions was used in this script

09-cluster-reassignment-outlier.R
This script performs reannotates Affinity cluster after outlier detection. We perform this for clusters formed using both ADA002 and V3 embeddings. WHO database all editions was used in this script.
09-cluster-reassignment-outlier-5thed.R
This script performs reannotates Affinity cluster after outlier detection. We perform this for clusters formed using both ADA-002 and LTE-3 embeddings. WHO database 5th editions was used in this script.

10-assign-who-ncit-outlier-kmeans-editdistance-clustering.R
This script to detect outliers for embedding-based-Kmeans and edit distance based standardization. WHO database all editions was used in this script.
10-assign-who-ncit-outlier-kmeans-editdistance-clustering-5thed.R
This script to detect outliers for embedding-based-Kmeans and edit distance based standardization. WHO database 5th editions was used in this script.

11-os-embedding-euclidean-dist.R
This script is used to compute Euclidean distance matrices between tumor names in CTR, WHO and, NCIt using embeddings obtained from non-Open AI models. For each embedding type, we identify the closest match WHO 5th edition, WHO all edition, and NCIt terms for ever CTR tumor name.

12-sample-CT-tumors-validation.R
Randomly Sample 1600 tumors from the CTR Tumor Names.

13-summarize-results.R
For each standardization method compute their standardization accuracy against the sampled tumor names for which at least one ground truth was found from the WHO system .

14-compute-euclidean-v3-threshold.R
Script is used for generating box-plots for comparing the standardization results of the methods LTE-3+Euclidean Dist and all-MiniLM-L12-v2+Euclidean Dist.

15-compute-euclidean-v3-threshold.R
Script is used to generate average silhouette score vs number of cluster plot in figure 5.

16-generate-heatmaps-mi-analysis.R
Script is used to perform mutual information analysis to identify three high accuracy standardization methods for performing majority vote to select standardization for a given CTR tumor name. The script also generates heatmap (figure 6 and 7) of pairwise mutual information among the high-accuracy methods.

17-all-method-voting.R
Script is used to predict CTR tumor name standardization, by performing majority voting of high-accuracy methods (standardization accuracy >=60%) as identified in script 13.

18-majority-vote-predictions.R
Script is used to predict the majority vote based standardization methods identified in script 16 for all the CTR terms. Since two different combinations of methods were identified for WHO 5th and WHO all editions, we use each of the methods to also standardize the CTR terms with respect to NCIt.

Libraries

  1. apcluster
  2. biomaRt
  3. cluster
  4. data.table
  5. dbscan
  6. DescTools
  7. doParallel
  8. dplyr
  9. factoextra
  10. foreach
  11. ggplot2
  12. ggpubr
  13. ghql
  14. httr
  15. isotree
  16. jsonlite
  17. magrittr
  18. qdapRegex
  19. readxl
  20. stringdist
  21. stringi
  22. stringr
  23. tidyverse
  24. pdist

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •