Skip to content

Reduce flink IcebergFilesCommitter validate snapshot history when commit table #3102

Closed
@Reo-LEI

Description

@Reo-LEI

Currently, IcebergFilesCommitter will validate all snapshot history for every time commit new snapshot in commitDeltaTxn . That means that the same snapshot will be verified multiple times, and take a lot of time to read manifests and manifest file. And That is the reason why for IcebergFilesCommitter need opening multiple Avro metadata files and take several minutes
in #2900 (comment) (the more detailed reason is that flink will call notifyCheckpointComplete(ckptId) immediately after calling snapshotState(ckptId), and committer will travel all snapshot history
to verify whether the data files which are referenced by pos-delete files still exists. That will block the commiter thread and make snapshotState(ckptId+1) timeout if hdfs response slow or table has too many manifest file need to travel.)

I think IcebergFilesCommitter doesn't need to validate all snapshot history for every commit, just need to validate snapshots between last committed snapshot id and current snapshot id. For IcebergFilesCommitter first commit, we still need to travel all snapshot history to ensure referenced data files still exists.

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions