Initial Manuscript: Analysis and Evaluation Plan

PheKnowVec: A Novel Approach to Computational Phenotyping

Manuscript Milestones

Project Milestone

Proposed Journal:

Documents Linked to Outline

Overview

Motivation
Approach

Motivation

Computational phenotyping (CP) leverages predefined sets of clinical concepts (e.g., diagnoses, medications, procedures, and laboratory test codes) to identify patients with and without a condition. CP approaches have great potential to aid in diagnosis, prognosis, therapeutic decision-making, and identification of mechanisms or novel biomarkers.

Existing methods face three unsolved barriers:

CP definitions may have limited generalizability because they are tailored to specific source vocabularies or hospital systems.
CP definitions may lack translational relevance because they primarily rely on clinical data requiring additional mapping to incorporate, for example, molecular or physiologic data.
CP definitions may lack scalability because the current process for creating definitions is a time-consuming, iterative process requiring both domain expertise and external validation.

How can we solve these problems?
PheKnowVec is a novel method for deriving, implementing, and validating CPs that addresses these barriers by:

The mapping of source vocabularies to standardized clinical terminology concepts, like those in the Observational Medical Outcomes Partnership (OMOP) common data model.
The mapping of standardized clinical terminology concepts to linked open data, such as biomedical ontologies, has been shown to significantly improve the process of integrating and incorporating sources of non-clinical data.
Embedding methods, which convert large complex heterogeneous data into scalable compressed vectors without semantic information loss, have successfully solved a wide range of problems in the biomedical domain.

Approach

We define the following terms, which will be used throughout this section:

Phenotype Definition Rules: the logic that underlies the definitions of a clinical phenotype (e.g., must have two occurrences of abnormally elevated eosinophil counts within 1 month).
Phenotype Code Sets: the clinical concepts (i.e., diagnoses, medications, and laboratory tests, and procedure codes) that are used in the definition rules of each phenotype. There are two types of code sets (see the Code Sets section for more information):
- Clinical code set: derived by mapping the source vocabulary (SV) codes defined in an eMERGE phenotype to codes in a OMOP common data model standardized terminology (ST).
- Ontology code set: derived by mapping codes from the OMOP standardized terminologies to codes in a Open Biomedical Ontology (OBO).

Data Sources

All of the experiments described above will be evaluated using two independent datasets:

The first dataset contains pediatric data and was extracted from a de-identified database built using data from the Children’s Hospital Colorado. The CHCO De-ID data conforms to the structure defined by the Pediatric Learning Health System OMOP common data model version 5.
The second dataset contains adult intensive care data built using data from the MIMIC III database that has also been standardized to the OMOP common data model version 5.3.

All diagnosis, medication, and laboratory test data in the current build of each dataset were considered for analysis. All patients whose record contained at least 1 code in the defined code sets were included in the analysis. Use of this data was approved by the Colorado Multiple Institutional Review Board (protocol # 15-0445).

Phenotype Definitions

We will use all phenotypes appropriate for implementation in pediatric and adult populations from the eMERGE network's Phenotype KnowledgeBase (n=9). Table 1 provides an overview of the clinical domains and source vocabularies that are used in each of the phenotypes.

Additional information on the phenotypes listed below can be found within these documents:

TABLE 1. The eMERGE phenotypes selected for use in the PheKnowVec experiments.

Phenotypes	Diagnoses	Medications	Lab Tests	Procedures	Problem Lists	Diagnoses Vocab	Medications Vocab	Lab Tests Vocab	Procedures Vocab	Problem List Vocab	NLP Required	Phenotype Definition	Case Definition	Control Definition
ADHD	TRUE	TRUE	FALSE	FALSE	FALSE	ICD9-CM	String	---	---	---	FALSE	FALSE	TRUE	TRUE
Appendicitis	TRUE	TRUE	FALSE	TRUE	FALSE	ICD9-CM	String	---	CPT	---	FALSE	TRUE	TRUE	TRUE
Crohn's Disease	TRUE	TRUE	FALSE	FALSE	FALSE	ICD9-CM	String	---	---	---	FALSE	FALSE	TRUE	TRUE
Hypothyroidism	TRUE	TRUE	TRUE	TRUE	FALSE	ICD9-CM	String	String	CPT	---	FALSE	TRUE	TRUE	TRUE
Peanut allergy	TRUE	FALSE	TRUE	TRUE	FALSE	String	---	String	String	---	FALSE	FALSE	TRUE	FALSE
Sickle Cell Disease	TRUE	FALSE	FALSE	FALSE	FALSE	ICD9-CM	---	---	---	---	FALSE	TRUE	TRUE	FALSE
Sleep Apnea	TRUE	FALSE	FALSE	FALSE	FALSE	ICD9-CM	---	---	---	---	FALSE	FALSE	TRUE	FALSE
Steroid-Induced Osteonecrosis	TRUE	TRUE	FALSE	TRUE	FALSE	ICD9-CM	String	---	CPT	---	FALSE	TRUE	TRUE	TRUE
Systemic lupus erythematosus	TRUE	TRUE	TRUE	FALSE	FALSE	ICD9-CM	String	String	---	---	FALSE	FALSE	TRUE	TRUE

Code Sets

Code sets will be created for condition, medication, laboratory, and procedure concepts (defined by a phenotype and whenever possible). The SV, ST, and OBO code sets that will be used in the PheKnowVec definitions are described in Table 2. We use similar clinical code sets as defined in recent work by Hripcsak et al., 2018.

TABLE 2. The SV, ST, and OBO Code sets use in PheKnowVec experiments.

Code Set	Description
SV	Clinical code sets comprised from the original concepts defined by the eMERGE phenotype authors.
ST No Descendants	Clinical code sets comprised from mapping SV clinical code sets to OMOP common data model ST clinical concepts.
ST Descendants	Clinical code sets comprised from mapping SV clinical code sets to OMOP common data model ST clinical concepts. This code sets also includes all of the descendants of the clinical codes if none of the code's children were already part of the code set.
OBO-EM No Descendants	Ontology code sets comprised from all OBO codes with an exact match by database cross-reference (DbXRef) or exact string match between a clinical concept's label and an ontology concept's label, definition, or synonym.
OBO-EM Descendants	Ontology code sets comprised from all OBO codes with an exact match by database cross-reference (DbXRef) or exact string match between a clinical concept's label and an ontology concept's label, definition, or synonym. This code sets also includes all of the descendants of the OBO codes, if none of the code's children were already part of the code set.
OBO-EM+MM No Descendants	Ontology code sets comprised from all OBO-Exact Match concepts and all ontology concepts that were manually mapped (i.e. a manual mapping created between one ST clinical concept and one ontology concept) or manually constructed (i.e. a manual mapping created between one ST clinical concept and two or more ontology concepts using a "AND", "OR", or "NOT" constructor).
OBO-EM+MM Descendants	Ontology code sets comprised from all OBO-Exact Match concepts and all ontology concepts that were manually mapped (i.e. a manual mapping created between one ST clinical concept and one ontology concept) or manually constructed (i.e. a manual mapping created between one ST clinical concept and two or more ontology concepts using a "AND", "OR", or "NOT" constructor). This code sets also includes all of the descendants of the OBO codes if none of the code's children were already part of the code set.

Acronyms: SV (source vocabulary); ST (standardized terminology); OBO (open biomedical ontology); EM (exact match); MM (manually mapped).

Code Set Mapping

To generate the clinical and ontology code sets described in Table 1, two types of mappings will be created:

Mapping Source Vocabulary Code Sets to Clinical Code Sets: The OMOP CONCEPT table will be used to map SV codes to ST codes:

Diagnoses/Problem Lists:
- If an SV code is provided, it will be mapped to SNOMED CT concepts
- If no SV code is provided, a match between the provided string and an ST code will be will be found by querying the string against ST code labels (i.e. condition_source_value or concept_name)
Medications:
- If an SV code is provided, it will be mapped to RxNORM concepts
- If no SV code is provided, a match between the provided string and an ST code will be will be found by querying the string against ST code labels (i.e. drug_source_value or concept_name)
Laboratory Tests:
- If an SV code is provided, it will be mapped to LOINC
- If no SV code is provided, a match between the provided string and an ST code will be will be found by querying the string against ST code labels (i.e. measurement_source_value or concept_name)
Procedures:
- If an SV code is provided, it will be mapped to CPT4
- If no SV code is provided, a match between the provided string and an ST code will be will be found by querying the string against ST code labels (i.e. procedure_source_value or concept_name)

Mapping Clinical Code Sets to Ontology Code Sets: The following mappings between ST clinical code sets and OBO ontology code sets (see the BioLater project for more information) were developed, by clinical domain:

Diagnoses and Problem Lists: Over 29,000 unique ST clinical codes (SNOMED CT) were mapped to codes in the Human Phenotype Ontology and the Human Disease Ontology. We verified 1,000 manually mapped or constructed mappings with a clinician who had over 5 years of professional medical coding experience.
Medications: Over 8,000 unique ST clinical codes (RxNORM) were mapped to codes in the Chemical Entities of Biological Interest Ontology, the Vaccine Ontology, the Protein Ontology, and the NCBITaxon. We verified 25% of the ingredients that were manually mapped or constructed with a professional pharmacist.
Laboratory Tests: Over 2,500 unique ST clinical codes (LOINC) were manually mapped to codes in the Human Phenotype Ontology. We verified 1,500 of the mappings with a professional ontologist in addition, a subset of 270 mappings were manually verified by three clinicians.
Procedures: ST clinical codes (CPT4) were mapped to codes in the National Cancer Institute Thesaurus. STILL DECIDING IF WE WILL PURSUE MAPPING PROCEDURES AS WELL.

ASSUMPTIONS:

If an SV concept is not present in the OMOP CONCEPT table, it will be excluded from all mappings and analyses
Codes in problem lists are treated like diagnosis codes

Experiments

To demonstrate the generalizability, translatability, and scalability of PheKnowVec, we will perform several experiments drawing from the following four categories:

Phenotype Cohort Assignment: patients will be assigned to cohorts in two ways:
1. If they have at least one occurrence of one or more codes specified by the phenotype definition (Phenotype Codes)
2. if they meet the criteria specified in the phenotype definition (Phenotype Definitions)
Cohort Group: cases vs. controls
Code Sets: each of the code sets defined in Table 1
Clinical Data Types: Only Conditions vs. All Clinical Domains (conditions, medications, laboratory tests, and procedures)

The generalizability and translatability experiments will be performed using the CHCO OMOP DeID and the OMOP MIMIC 3 data. The OMOP MIMIC 3 data will only be used in the scalability task to validate the representations built from CHCO OMOP DeID.

Generalizability
For each phenotype, we will examine what information is gained and/or lost when deriving pediatric and adult patient cohorts using different clinical code sets (Figure). For all comparisons, the SV clinical code set will be used as the gold standard.

For each phenotype, the comparisons listed in Table 3 (shown below) will be performed and evaluated using:

False Negative and False Positive Error Rate: the number of incorrectly included (false positive or FP) or missed (false negative or FN) patients using each ST clinical code set versus the SV clinical code set (i.e. gold standard patient cohort) for cases and controls.
Cohort Verification: randomly select a small subset of FP and FN patients assigned to one or two phenotypes for review by a clinician.

TABLE 3. Comparisons to evaluate the generalizability of mapping clinical code sets.

Phenotype Cohort Assignment	Cases	Controls	Code Sets	Clinical Data Types
Phenotype Codes	X	X	SV	Only Condition
Phenotype Definitions	X	X	SV	All Clinical Domains
Phenotype Codes	X	X	SV	Only Condition
Phenotype Definitions	X	X	SV	All Clinical Domains

Phenotype Codes	X	X	ST	Only Condition
Phenotype Definitions	X	X	ST	All Clinical Domains
Phenotype Codes	X	X	ST	Only Condition
Phenotype Definitions	X	X	ST	All Clinical Domains

Phenotype Codes	X	X	ST Children	Only Condition
Phenotype Definitions	X	X	ST Children	All Clinical Domains
Phenotype Codes	X	X	ST Children	Only Condition
Phenotype Definitions	X	X	ST Children	All Clinical Domains

Translatability
For each phenotype, we will examine what information is gained and/or lost when deriving pediatric and adult patient cohorts when ST clinical code sets are mapped to ontology code sets (Figure). For all comparisons, the ST clinical code set will be used as the gold standard.

For each phenotype, the comparisons listed in Table 4 (shown below) will be performed and evaluated using:

False Negative and False Positive Error Rate: the number of incorrectly included (false positive or FP) or missed (false negative or FN) patients using each ST clinical code set versus the SV clinical code set (i.e. gold standard patient cohort) for cases and controls.
Cohort Verification: randomly select a small subset of FP and FN patients assigned to one or two phenotypes for review by a clinician.

To illustrate the translational ability of this approach, we will take 1 or 2 of the phenotypes and extend their ontology code sets to include additional open data sources that are already manually annotated to the ontologies (e.g. DOID and HPO contain hand-annotated gene list mappings). - We will perform basic clustering within each phenotype to derive sub-groups. Identify the most important codes within each cluster and get a domain expert to help with interpreting the results.

TABLE 4. Comparisons to evaluate the translatability of mapping clinical code sets to ontology code sets.

Phenotype Cohort Assignment	Cases	Controls	Code Sets	Clinical Data Types
Phenotype Codes	X	X	ST	All Clinical Domains
Phenotype Definitions	X	X	ST	All Clinical Domains

Phenotype Codes	X	X	ST Children	All Clinical Domains
Phenotype Definitions	X	X	ST Children	All Clinical Domains

Phenotype Codes	X	X	OBO-EM	All Clinical Domains
Phenotype Definitions	X	X	OBO-EM	All Clinical Domains

Phenotype Codes	X	X	OBO-EM Children	All Clinical Domains
Phenotype Definitions	X	X	OBO-EM Children	All Clinical Domains

Phenotype Codes	X	X	OBO-EM+MM	All Clinical Domains
Phenotype Definitions	X	X	OBO-EM+MM	All Clinical Domains

Phenotype Codes	X	X	OBO-EM+MM Children	All Clinical Domains
Phenotype Definitions	X	X	OBO-EM+MM Children	All Clinical Domains

Scalability
For each phenotype, we will create patient-level embeddings for each of the cohorts that were derived using the clinical and ontology code sets from the translatability experiments (Figure). Two types of patient-level embeddings will be built. The first type of embedding will include only the clinical codes explicitly outlined by the phenotype definition. The second type of embedding will be built using all available data. For all comparisons, the clinical and ontology code sets (without descendants) from the translatability experiments will be used as gold standards.

For each pediatric phenotype, the comparisons listed in Table 5 (shown below) will be performed and evaluated using the approaches described below:

Leave-One-Patient-Out (LOPO) Cross Validation with Logistic Regression and Youden Index Thresholding: within each phenotype, the cosine similarity between each patient and all other patient’s vectors will be calculated. The Youden Index is then used to convert the continuous cosine similarity score for each pairwise patient comparison into a cut-off that can be used for binary classification. This task will be performed for cases and controls, within each phenotype, as well as pooled so that we can determine how well the definition vectors identify the correct patients within each case and control group, across all phenotypes.
- Performance metrics for each case and control group by phenotype: accuracy, precision, recall, ROC curves and counts of TP for most similar 1, 5, 10, 25, 50, 75, and 100 patients.
Aggregated Case and Control Phenotype Definition Vectors: We will apply the same approach described above, but will compare all patients within each case and control group for each phenotype to the aggregated case and control phenotype definition vectors. The best performing code set phenotype definition vectors (1 clinical and 1 ontology code set) will be applied to the OMOP MIMIC data.
- Performance will be evaluated through domain expert-review of patient groups returned for each aggregated cohort group within each phenotype.

TABLE 5. Comparisons to evaluate the scalability of embedded clinical and ontology code sets.

Phenotype Cohort Assignment	Cases	Controls	Code Sets	Clinical Data Types	Embedded Data
Phenotype Definitions	X	X	ST	All Clinical Domains	Only phenotype codes
Phenotype Definitions	X	X	ST	All Clinical Domains	All available data

Phenotype Definitions	X	X	ST Children	All Clinical Domains	Only phenotype codes
Phenotype Definitions	X	X	ST Children	All Clinical Domains	All available data

Phenotype Definitions	X	X	OBO-EM	All Clinical Domains	Only phenotype codes
Phenotype Definitions	X	X	OBO-EM	All Clinical Domains	All available data

Phenotype Definitions	X	X	OBO-EM Children	All Clinical Domains	Only phenotype codes
Phenotype Definitions	X	X	OBO-EM Children	All Clinical Domains	All available data

Phenotype Definitions	X	X	OBO-EM+MM	All Clinical Domains	Only phenotype codes
Phenotype Definitions	X	X	OBO-EM+MM	All Clinical Domains	All available data

Phenotype Definitions	X	X	OBO-EM+MM Children	All Clinical Domains	Only phenotype codes
Phenotype Definitions	X	X	OBO-EM+MM Children	All Clinical Domains	All available data

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initial Manuscript: Analysis and Evaluation Plan

PheKnowVec: A Novel Approach to Computational Phenotyping

Manuscript Milestones

Proposed Journal:

Documents Linked to Outline

Overview

Motivation

Approach

Data Sources

Phenotype Definitions

Code Sets

Code Set Mapping

Experiments

Project Information

Analyses

Findings

Enabling Reproducible Research

Discussions

Clone this wiki locally