Skip to content

Support UPSERT semantics for OFFLINE dimension tables (overwrite on primary key) #17535

@u-ranjith-kumar

Description

@u-ranjith-kumar

We are using OFFLINE dimension tables in Apache Pinot and are facing duplicate rows with the same primary key during batch ingestion.

Currently:

  • APPEND ingestion is not supported for dimension tables
  • REFRESH ingestion keeps re-reading the same input files
  • This results in duplicate primary keys in the dimension table

Pinot recently added support to detect and error on duplicate primary keys using:

"dimensionTableConfig": {
  "errorOnDuplicatePrimaryKey": true
}

(PR: #12290)

While this helps catch the issue, it does not solve the core use case where we want to overwrite existing rows by primary key(UPSERT semantics) instead of failing ingestion.

Current Behavior

  • OFFLINE dimension tables do not support UPSERT

  • Duplicate primary keys are either:

    • silently allowed (default), or
    • rejected using errorOnDuplicatePrimaryKey=true
  • There is no way to overwrite an existing row for a primary key during OFFLINE ingestion


Expected Behavior

Support UPSERT semantics for OFFLINE dimension tables, similar to REALTIME upsert tables:

  • If a record with an existing primary key is ingested:

    • overwrite the existing row
    • do not create duplicate records
  • Allow deterministic, idempotent batch ingestion

  • Enable safe reprocessing and reruns


Why this is needed

Offline dimension tables are commonly used for:

  • Slowly changing dimensions (store, area, category, mappings)
  • Periodic full refreshes or partial backfills
  • Reference data that naturally evolves over time

Without upsert support:

  • Pipelines are fragile
  • Reruns cause duplicates
  • Users are forced to move dimension data to REALTIME ingestion, which is not always desirable

Workarounds today

  1. Enable strict validation:
"dimensionTableConfig": {
  "errorOnDuplicatePrimaryKey": true
}

→ Prevents bad data but breaks ingestion

  1. Move dimension data to REALTIME upsert table
    → Works, but adds operational complexity and is not ideal for batch-managed dimensions

Proposal

Add support for OFFLINE UPSERT dimension tables, where:

  • Primary key uniqueness is enforced
  • Latest record overwrites the previous one
  • Behavior is deterministic and rerun-safe

This would align OFFLINE dimension tables with REALTIME upsert capabilities and significantly simplify batch ingestion workflows.


Related Issues / PRs

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions