This project demonstrates a robust, end-to-end medallion architecture for healthcare payer data analytics using Databricks, Delta Lake, and PySpark/Spark SQL. The pipeline covers data ingestion, cleansing, modeling, and analytics, following the Bronze/Silver/Gold (Medallion) pattern.
The notebook and supporting scripts guide you through:
- Structuring a data lakehouse for payer claims data
- Building a layered ETL process (Bronze → Silver → Gold)
- Parameterizing locations and schemas for reusable, production-ready pipeline runs
- Running scalable analytics on healthcare claims, members, providers, diagnosis, and procedure data
- Bronze Layer: Raw, minimally processed ingested data from CSV files into Delta tables
- Silver Layer: Cleaned, deduplicated, type-corrected, and joined data ready for analysis
- Gold Layer: Curated analytics tables (e.g., claims enrichment, member claim summaries)
This modular pattern ensures data lineage, scalability, and easy extensibility for additional healthcare analytics use cases.
- PySpark + Spark SQL: Hybrid approach for transformation logic, enabling easy customization and automation
- Completely Parameterized: All paths, database, and table names are provided as widgets/variables for effortless re-use
- Synthetic Demo Data: Simulated payer, claims, diagnostic, procedure, member, and provider tables for training and testing
- Production-Grade Practices: Explicit schema definitions, robust error handling, clear layering, and portability
- Clone or import this notebook/project to your Databricks workspace
- Ensure you have access to Unity Catalog volumes
- Upload the provided sample CSV files into your configured volume locations (per parameter values)
- (Optional) Edit the default widget parameters at the top for your target catalog, schema, and data paths
- Run the notebook top-to-bottom
- Bronze: Ingest CSVs to raw Delta tables using
COPY INTO
and ETL - Silver: Transform and clean raw tables into analytics-ready models
- Gold: Build enrichment and summary tables for reporting
- Bronze: Ingest CSVs to raw Delta tables using
Tables included:
claims_raw
,diagnosis_raw
,procedures_raw
,providers_raw
,members_raw
(Bronze)- Silver & Gold layers build on top, implementing best practices for date/double casting, data cleaning, deduplication, and joins.
- Healthcare claims analytics and visualization
- payer data quality and pipeline testing
- Data platform engineering training (Databricks focused)
- Accelerating migration to medallion/lakehouse in real-world payer environments
- Databricks workspace (with permissions for Unity Catalog)
- Databricks Runtime with Delta Lake support
- Upload access for CSV source files
git clone https://github.com/bigdatavik/databricksfirststeps.git
# or import the Databricks notebook directly via UI
In Databricks:
- Open the notebook in your Workspace or Repo folder.
- Edit the top parameter cells for your environment (optional).
- Upload your sample data files to
/Volumes////...
as needed. - Run the notebook and explore your new lakehouse!
.
├── payer_medallion_etl_notebook.py
├── data/
│ ├── claims.csv
│ ├── diagnosis.csv
│ ├── procedures.csv
│ ├── providers.csv
│ └── members.csv
├── README.md
└── LICENSE
Pull requests and discussions are welcome! For bug reports or suggestions, please open a GitHub issue.