A pharmaceutical commercial analytics platform built on two free, public CMS datasets — answering the central question that pharma commercial teams, regulators, and health-economics researchers debate:
Do physicians who receive payments from a pharma company prescribe more of that company's drugs?
Built with the modern analytics-engineering stack: PostgreSQL · dbt · Streamlit · Python, with an LLM-powered "Explain This Chart" feature on every visualization.
→ Open the deployed Streamlit app
Six views: Executive Dashboard, KOL Finder, Company Intelligence, Payment vs. Prescribing, Market Opportunity Map, and About this project. Every chart has an LLM-powered "Explain This Chart" button.
Five clearly bounded layers. The dbt mart is the contract between the data team and the application team; the Streamlit app does not know — and does not need to know — that raw CMS CSVs ever existed.
flowchart TB
subgraph SRC["1 · Data Sources"]
OP["CMS Open Payments<br/>(2022)"]
PD["CMS Medicare Part D PUF<br/>(2022)"]
end
subgraph STORAGE["2 · Storage (Neon Postgres)"]
RAW["raw schema<br/>(filtered: top 10 mfrs,<br/>≥ $50 payments)"]
STG["raw_staging (views)"]
INT["raw_intermediate (views)"]
MART["raw_mart<br/>★ star schema:<br/>3 dims + 3 facts + 1 bridge"]
end
subgraph TRANSFORM["3 · Transformation (dbt)"]
DBT["13 models<br/>57 passing tests"]
end
subgraph APP["4 · Application (Streamlit)"]
APPPY["app.py · st.navigation hub"]
VIEWS["6 views<br/>(Dashboard, KOL Finder,<br/>Company Intel, Payment-Rx,<br/>Market Map, About)"]
end
subgraph AI["5 · AI Augmentation"]
BTN["Explain this chart"]
GROQ["Groq · Llama 3.3 70B"]
end
OP -->|"download + load"| RAW
PD -->|"download + load"| RAW
DBT -.->|"builds"| STG
DBT -.->|"builds"| INT
DBT -.->|"builds"| MART
RAW --> STG --> INT --> MART
MART --> APPPY --> VIEWS
VIEWS --> BTN -->|"chart context"| GROQ
GROQ -->|"insight"| BTN
Three dimension tables (orange) and three fact tables (red).
fact_payment_prescribing is the headline analytical table —
a FULL OUTER JOIN of payments and prescribing on
(physician × company), capturing every relationship type:
paid_and_prescribed, paid_no_rx, and rx_no_payment.
erDiagram
DIM_PHYSICIAN {
text physician_npi PK
text physician_display_id "Anonymized"
text specialty
text state
bool received_pharma_payments
}
DIM_COMPANY {
text company_name PK
text company_display_name "Pfizer, AbbVie..."
text therapeutic_areas
}
DIM_DRUG {
text drug_key PK
text drug_brand_name
text drug_generic_name
text therapeutic_class
}
FACT_PAYMENTS {
text payment_id
text physician_npi FK
text company_name FK
date payment_date
numeric payment_amount_usd
text payment_category
}
FACT_PRESCRIPTIONS {
text prescription_id
text physician_npi FK
text company_name FK
text therapeutic_class
int total_claim_count
}
FACT_PAYMENT_PRESCRIBING {
text physician_npi FK
text company_name FK
numeric total_payment_usd
int company_claim_count
text relationship_type
}
DIM_PHYSICIAN ||--o{ FACT_PAYMENTS : "has many"
DIM_COMPANY ||--o{ FACT_PAYMENTS : "pays"
DIM_PHYSICIAN ||--o{ FACT_PRESCRIPTIONS : "prescribes"
DIM_COMPANY ||--o{ FACT_PRESCRIPTIONS : "manufactures"
DIM_PHYSICIAN ||--o{ FACT_PAYMENT_PRESCRIBING : "paired with"
DIM_COMPANY ||--o{ FACT_PAYMENT_PRESCRIBING : "paired with"
| Document | Covers |
|---|---|
| docs/ARCHITECTURE.md | dbt model lineage (all 13 models), user-interaction sequence, LLM "Explain this chart" flow |
| docs/DATA_MODEL.md | Star schema ER diagram, data-quality enforcement (57 tests), privacy-by-design |
| docs/TABLE_REFERENCE.md | Plain-English table tour — every schema/table/view, sample questions, how to read dbt's compiled SQL |
| METHODOLOGY.md | Statistical formulas (lift ratio, HHI, investment ratio), the BMS reporting quirk, what we can and cannot conclude |
The U.S. government collects two public datasets that, taken together, allow this question to be asked rigorously:
| Dataset | What it tracks | Source |
|---|---|---|
| CMS Open Payments (Sunshine Act) | Every dollar a pharma / med-device company paid a U.S. physician — speaking fees, consulting, meals, travel, research | openpaymentsdata.cms.gov |
| CMS Medicare Part D Prescribers | Every drug prescribed under Medicare Part D, by physician, with claim volume and cost | data.cms.gov |
Neither dataset answers the question alone. Joining them on physician NPI — and attributing each prescribed drug to a manufacturer (a curated mapping; CMS doesn't publish one) — is the analytics-engineering work this project does.
The same join sits behind every pharma commercial analytics function: ZS Associates, IQVIA, Komodo Health, Veeva, and every Big Pharma in-house brand analytics team builds variations of it.
.
├── dbt_project/ # Analytics warehouse (dbt + Postgres)
│ ├── models/
│ │ ├── staging/ # Type casting, deduplication, name canonicalization
│ │ ├── intermediate/ # Business logic, drug→company attribution
│ │ └── mart/ # Star schema: facts + dimensions
│ ├── analyses/ # 5 showcase SQL analyses (see below)
│ └── seeds/ # Curated drug→manufacturer mapping
├── streamlit_app/ # 6-page interactive dashboard
│ ├── app.py # Navigation hub + sidebar (st.navigation entry point)
│ ├── views/ # Dashboard, KOL Finder, Company, Payment-Rx, Market, About
│ └── utils/ # styles + charts + DB + LLM helpers
├── scripts/ # Data download + load pipeline
├── .github/workflows/ # CI: ruff (format + lint) + dbt parse
└── .pre-commit-config.yaml # Auto-format on every git commit
Each is a standalone SQL file in dbt_project/analyses/ written as a
demonstration of advanced SQL across a realistic business question.
| # | Analysis | Key SQL techniques | Business decision it informs |
|---|---|---|---|
| 1 | KOL Identification | ROW_NUMBER() OVER, percentile windows, cumulative density |
Which physicians should the field team prioritize? |
| 2 | Payment-to-Prescribing Correlation | FILTER aggregates, PERCENTILE_CONT, CORR(), specialty controls |
Are paid physicians prescribing more (controlled for specialty)? |
| 3 | Company Spend Efficiency | Pivot-via-FILTER, ratio metrics, null-safe division | Which companies get the highest associated prescribing per dollar paid? |
| 4 | Geographic Market Analysis | Cross-joins for shares-of-totals, FULL OUTER JOIN |
Where is pharma over- or under-invested vs. prescribing demand? |
| 5 | Physician Loyalty (HHI Scoring) | Window functions for shares, sum-of-squared-shares (Herfindahl) | Which physicians prescribe single-source vs. multi-brand within a class? |
| Layer | Tool | Why |
|---|---|---|
| Database | Neon (free-tier Postgres) | Cloud Postgres, no Docker needed, generous free tier, familiar |
| Transformations | dbt Core + dbt-postgres |
Industry standard for analytics engineering. Lineage, testing, docs built-in |
| App framework | Streamlit + Plotly | Fast to build, free Streamlit Cloud deployment |
| LLM ("Explain This Chart") | Groq Llama 3.3 70B | Free tier with generous limits, sub-second inference |
| CI | GitHub Actions | Free for public repos. Runs Ruff format + lint + dbt parse on every push |
| Data quality | dbt-utils, dbt-expectations |
Generic + Great-Expectations-style tests |
| Local hygiene | pre-commit (Ruff hook) | Auto-formats Python on every git commit, so CI can't fail on style |
The full CMS files are large (Open Payments ≈ 7GB unzipped CSV; Part D by Drug ≈ 3.7GB). To stay well within Neon's 512 MB free tier and leave room for future projects, the load script applies:
- Open Payments:
- Keep only the top 10 manufacturers by 2022 spend (these alone represent ~60% of total industry physician spend)
- Drop payments under $50 — this removes ~70% of records (mostly $5–20 individual meal payments) while preserving all consulting, speaking, travel, research, and meaningful meal events
- Part D Prescribers: filter to ~15 specialties that drive the bulk of pharma commercial focus, drop providers with < 10 total claims
- Part D by Drug: keep only the ~60 brand-name drugs in our curated drug → manufacturer seed file
Resulting Postgres footprint: ~150 MB.
See scripts/config.py for the exact filter parameters.
- Python 3.12 specifically (dbt-core's
mashumarodependency has issues on Python 3.13+; olderpydanticversions don't work on 3.10) - A free Neon Postgres database
- A free Groq API key (for the LLM feature)
# 1. Clone the repo
git clone https://github.com/mateoportillo1900/physician-pharma-analytics.git
cd physician-pharma-analytics
# 2. Create a Python 3.12 virtual env and install dependencies
py -3.12 -m venv .venv # Windows; use `python3.12` on macOS / Linux
.venv\Scripts\activate # Windows
# source .venv/bin/activate # macOS / Linux
pip install -r requirements.txt
# 3. Install the pre-commit hook (so commits auto-format)
pre-commit install
# 4. Configure secrets
cp .env.example .env
# Edit .env and fill in DATABASE_URL, NEON_HOST, NEON_USER, NEON_PASSWORD,
# NEON_DBNAME, and GROQ_API_KEY
# 5. Download CMS data (~5 GB total; takes 10-30 min depending on bandwidth)
python scripts/download_data.py
# 6. Filter and load into Neon (~5 min)
python scripts/load_data.py
# 7. Build the dbt warehouse
cd dbt_project
dbt deps # install dbt packages
dbt seed --profiles-dir . # load the drug→company mapping
dbt run --profiles-dir . # build all models
dbt test --profiles-dir . # run all 57 data-quality tests
cd ..
# 8. Run the Streamlit app
streamlit run streamlit_app/app.pycd dbt_project
dbt compile --profiles-dir . # compiles SQL but doesn't run analyses
# Then copy the compiled SQL from target/compiled/.../analyses/ and run it
# against your warehouse directly using psql, DBeaver, or the VS Code
# SQLTools extension.Detailed methodology — including the NPI matching strategy, drug
attribution rules, sample-size thresholds, and what we can and cannot
conclude from observational data — is in METHODOLOGY.md.
Headline caveats:
- All analyses are descriptive, not causal. Observational data cannot distinguish payment-influences-prescribing from prescribers-attract-payments.
- The drug → manufacturer mapping is curated to ~60 high-volume drugs across 9 therapeutic classes. Drugs outside this set are excluded.
- Company name canonicalization consolidates known subsidiaries (Janssen → Johnson & Johnson, etc.) but small mis-attributions are inevitable.
- The Part D file is restricted to the top prescribing specialties to fit within free-tier limits — generalist primary-care prescribers are underrepresented.
This project uses only publicly released federal data and processes no patient information of any kind.
| Concern | Status | Why |
|---|---|---|
| HIPAA / PHI | ✅ Not applicable | No patient identifiers, diagnoses, or encounter records are anywhere in this project. The Medicare Part D file is CMS's de-identified, aggregated Public Use File — cell counts under 11 are pre-suppressed by CMS for residual privacy |
| Physician privacy | ✅ Strengthened by design | Physician NPIs and names are professional identifiers (not PHI) and are intentionally published by CMS — but for this public portfolio demo, the mart layer surfaces only deterministic surrogate IDs (e.g., Physician #4837), not real names. Underlying analytics are unchanged |
| CMS terms of use | ✅ Compliant | CMS data is in the public domain. CMS explicitly encourages re-analysis and re-publication of these files |
| Sunshine Act compliance | ✅ Foundational | The Physician Payments Sunshine Act (§ 6002 of the Affordable Care Act, 2010) was passed specifically to make this data publicly available |
Precedents: ProPublica's Dollars for Docs (online continuously since 2014), peer-reviewed papers in JAMA Internal Medicine and NEJM, and every major pharma commercial analytics firm (ZS Associates, IQVIA, Komodo Health, Veeva) all build on these same public datasets.
What this project is: a descriptive analytics platform demonstrating the commercial-analytics tradecraft used across the pharmaceutical industry. It is not investigative journalism, does not make causal claims about individual physicians, and does not suggest any specific payment-prescribing relationship is improper.
- DeJong et al. (2016), JAMA Internal Medicine: "Pharmaceutical Industry-Sponsored Meals and Physician Prescribing Patterns for Medicare Beneficiaries" — the foundational paper.
- Yeh et al. (2016), JAMA Internal Medicine: extends to specific drug classes.
- ProPublica's Dollars for Docs: ongoing journalistic coverage of CMS Open Payments at the individual-physician level.
Code: MIT. Data: public domain (CMS).
This project was built as part of a portfolio demonstrating healthcare-domain analytics engineering. Questions, feedback, or hiring inquiries — LinkedIn.