-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Description
We are using OFFLINE dimension tables in Apache Pinot and are facing duplicate rows with the same primary key during batch ingestion.
Currently:
APPENDingestion is not supported for dimension tablesREFRESHingestion keeps re-reading the same input files- This results in duplicate primary keys in the dimension table
Pinot recently added support to detect and error on duplicate primary keys using:
"dimensionTableConfig": {
"errorOnDuplicatePrimaryKey": true
}(PR: #12290)
While this helps catch the issue, it does not solve the core use case where we want to overwrite existing rows by primary key(UPSERT semantics) instead of failing ingestion.
Current Behavior
-
OFFLINE dimension tables do not support UPSERT
-
Duplicate primary keys are either:
- silently allowed (default), or
- rejected using
errorOnDuplicatePrimaryKey=true
-
There is no way to overwrite an existing row for a primary key during OFFLINE ingestion
Expected Behavior
Support UPSERT semantics for OFFLINE dimension tables, similar to REALTIME upsert tables:
-
If a record with an existing primary key is ingested:
- overwrite the existing row
- do not create duplicate records
-
Allow deterministic, idempotent batch ingestion
-
Enable safe reprocessing and reruns
Why this is needed
Offline dimension tables are commonly used for:
- Slowly changing dimensions (store, area, category, mappings)
- Periodic full refreshes or partial backfills
- Reference data that naturally evolves over time
Without upsert support:
- Pipelines are fragile
- Reruns cause duplicates
- Users are forced to move dimension data to REALTIME ingestion, which is not always desirable
Workarounds today
- Enable strict validation:
"dimensionTableConfig": {
"errorOnDuplicatePrimaryKey": true
}→ Prevents bad data but breaks ingestion
- Move dimension data to REALTIME upsert table
→ Works, but adds operational complexity and is not ideal for batch-managed dimensions
Proposal
Add support for OFFLINE UPSERT dimension tables, where:
- Primary key uniqueness is enforced
- Latest record overwrites the previous one
- Behavior is deterministic and rerun-safe
This would align OFFLINE dimension tables with REALTIME upsert capabilities and significantly simplify batch ingestion workflows.
Related Issues / PRs
- Duplicate primary key handling for dimension tables: Able to add duplicate rows with same primary key in dimension table #12284
- Disallow duplicate primary keys (error-only):