This repository contains feature extraction definitions that process patient data represented in the Common Data Model for Heart Failure Research and transform it to a tabular AI-ready dataset format, which is used to train ML models. The suite is shared by both the DataTools4Heart (DT4H) and AI4HF projects. Feature extraction is realized via four main concepts: populations, feature groups, feature sets and pipelines.
Broadly, the feature extraction suite extracts patients' data from the FHIR patient data repository based on population definition.
Then, feature groups' main aim is to extract a group of raw features for specific healthcare resources such as conditions, medications, lab measurements, etc. For each feature group a timeseries table is created such that
- Each record specified matching to the FHIR query of the feature group will be mapped as a row in the table
- Each feature defined in the feature group will be converted a column in the table
In the next step, feature sets work on the timeseries data generated by the feature groups to extract the final tabular dataset. Feature sets allow the following dataset manipulations:
- Identification of reference time points that would lead to data points in the final dataset
- Grouping data based on the reference time points in configurable time periods
- Applying aggregations on the grouped data
Pipelines are used to associate feature sets and populations. This indicates that a dataset, as configured by the feature sets, will be generated for the specified population in the pipeline.
- definitions/featuregroup/: Feature group definitions aligned with the Heart Failure CDM (encounters, vital signs, meds, labs, etc.); matching FHIR pipelines live in definitions/featuregroup/pipeline/.
- definitions/featureset/: Final feature sets including care-heart, maggic-mlp, study1-fs, study2-fs, study3-fs, and synthetic-risk-score.
- definitions/population/: Cohort definitions (care-heart, maggic, maggic_cprd, study1, study1-vhir, study2, study3) with FHIR pipelines under definitions/population/pipeline/.
- definitions/datasetqualitycriteria/: Dataset QA rules (e.g., maggic-quality-criteria.json).
- definitions/valuesets/: Reserved for value set catalogues (currently empty).
- docker/: Compose file and helper scripts (pull.sh, run.sh, clean-and-stop.sh, server configs).
- readme-assets/: Logos referenced in this README.
- Completing the deployment instructions of the either of the following
After mapping the data source to the common data model, the feature extraction process can be started. Feature extraction configurations for both DT4H and AI4HF are maintained in this repository.
Navigate into a working directory to run the tools: <workspaceDir>
git clone https://github.com/DataTools4Heart/feature-extraction-suiteRun the following scripts in the <workspaceDir>:
sh ./feature-extraction-suite/docker/pull.sh
sh ./feature-extraction-suite/docker/run.sh- For
feature-extraction-suitedeployment, the matchingdata-ingestion-suite(DT4H or AI4HF) must first be deployed successfully and mapping must be run. If you used the Nginx Docker container during thedata-ingestion-suitedeployment, update the Nginx config forfeature-extraction-suiteby following these steps:
Navigate into the working directory:
cd <workspaceDir>Edit the ./data-ingestion-suite/docker/proxy/nginx.conf file and uncomment the following lines:
# location /<basePath>/feast {
# proxy_pass http://<feast-service-name>:8085/onfhir-feast;
# proxy_set_header Host $host;
# proxy_set_header X-Real-IP $remote_addr;
# }Restart the Nginx container:
./data-ingestion-suite/docker/proxy/restart.sh- Or, if your host machine is already running Nginx, insert the following proxy configuration and restart Nginx:
location /<basePath>/feast {
proxy_pass http://<hostname>:<port>/onfhir-feast;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
}- To start the feature extraction process for a specific study, use the following cURL commands. Replace
<hostname>with your server hostname and<basePath>with the path segment you configured in Nginx (e.g.,dt4horai4hf).
curl -X POST 'http://<hostname>/<basePath>/feast/api/DataSource/myFhirServer/FeatureSet/study1-fs/Population/study1_cohort/$extract?entityMatching=pid|pid,encounterId|encounterId&reset=true'curl -X POST 'http://<hostname>/<basePath>/feast/api/DataSource/myFhirServer/FeatureSet/study1-fs/Population/study2_cohort/$extract?entityMatching=pid|pid,encounterId|encounterId&reset=true'curl -X POST 'http://<hostname>/<basePath>/feast/api/DataSource/myFhirServer/FeatureSet/study3-fs/Population/study3_cohort/$extract?entityMatching=pid|pid&reset=true'-
The extraction process may take a long time to complete depending on the size of data.
-
After completion, the dataset will be available in the following location. For example:
<workspaceDir>/feature-extraction-suite/output-data-cli/myFhirServer/<entityType>/<entityId>/...
- For statistics (metadata) about the datasets:
https://<hostname>/<basePath>/feast/api/Dataset
- For statistics (metadata) about a specific dataset:
https://<hostname>/<basePath>/feast/api/Dataset/<datasetId>
Use this section to completely remove all feature-extraction-suite containers, volumes, and data, then perform a fresh installation.
Run the clean-and-stop script to stop all containers and remove associated volumes:
Warning: This will permanently delete all persisted data including extracted datasets, metadata(s) and feature extraction history.
sh ./feature-extraction-suite/docker/clean-and-stop.shIf you also want to perform a clean installation of the data-ingestion-suite, follow the instructions in the data-ingestion-suite README - Clean Installation from Scratch section before proceeding.
# Pull the latest feature extraction suite code
cd feature-extraction-suite
git pull
cd ..
# Pull the latest images
sh ./feature-extraction-suite/docker/pull.shAfter completing the above steps (and ensuring data-ingestion-suite is running if you cleaned it), start the feature extraction suite:
sh ./feature-extraction-suite/docker/run.sh
sh ./data-ingestion-suite/docker/proxy/restart.sh # Optional