Skip to content

Conversation

@brishi19791
Copy link
Contributor

This PR removes redundant DeltaLog.getSnapshotAt(version) calls in the Delta source conversion path that were happening for every commit. getSnapshotAt can internally trigger an expensive Spark job and associated network I/O (e.g., listing/reading Delta log metadata from remote storage) to resolve the snapshot for a given version. We now fetch the snapshot once per commit/version and reuse it to construct the InternalTable (via a DeltaTableExtractor.table(Snapshot, tableName) overload), instead of re-resolving the same snapshot multiple times.

Impact

  • Avoids redundant snapshot resolution work per commit/version (and the Spark job + network calls it may trigger).
  • Reduces end-to-end conversion latency, especially for large commit backlogs.
  • No intended functional behavior change; performance optimization only.

@kevinjqliu
Copy link
Contributor

We now fetch the snapshot once per commit/version and reuse it to construct the InternalTable (via a DeltaTableExtractor.table(Snapshot, tableName) overload), instead of re-resolving the same snapshot multiple times.

Thats awesome, thanks for adding the perf improvement. It might be a good idea to add a test (either here or a follow up PR) to verify that the entire conversion process should only read the DeltaLog once.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants