Skip to content

[Umbrella] Support Aggregation Merge Engine #2133

@platinumhamburg

Description

@platinumhamburg

Search before asking

  • I searched in the issues and found nothing similar.

Motivation

When using Aggregation Merge Engine with multiple concurrent Flink jobs writing to the same primary key table, the following issues occur:

  1. Concurrent Conflicts: Multiple writers updating the same aggregation columns simultaneously cause data inconsistency
  2. No Exactly-Once Guarantee: Job failover cannot guarantee aggregation accuracy
  3. Duplicate Calculations: Data replay after job restart leads to duplicate aggregation
  4. State Loss: Lack of state management to track committed data

These issues make Aggregation Merge Engine unsuitable for production scenarios requiring strict data accuracy.

Solution

The complete solution has been proposed in the FIP document FIP-21: Aggregation Merge Engine.

1. State Management

  • WriterState: Tracks maximum committed offset per TableBucket
  • BucketOffsetTracker: Client-side offset tracking
  • Integration: Uses Flink's Operator State for persistence across checkpoints

2. Undo Recovery

  • Mechanism: On failover, undoes uncommitted data written after last checkpoint
  • Implementation: Reads old values at committed offset and overwrites new values
  • Guarantee: Ensures data consistency after job restart

Anything else?

No response

Willingness to contribute

  • I'm willing to submit a PR!

Sub-issues

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions