Replies: 2 comments 5 replies
-
|
Do we need to introduce a new concept Snapshot here? I think for systems like Iceberg, snapshot and table version are 2 different things, and are tracked separately, thus 2 concepts. But here the are the same thing. I feel it would be nice to keep the number of concepts minimum if possible. Curious what others think, cc @wjones127 Also I feel there is a mix of bad user experience and performance concerns.
|
Beta Was this translation helpful? Give feedback.
-
|
There seems to be two parts to this proposal (but I may be misunderstanding):
I think I understand (and agree) with the first point. I'm not sure about the second. Let's say you issue a range query and get back 20 matching locations. Is the goal here to reduce 40 reads (manifest + transaction) into 20 reads (combined manifest/transaction)? |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Motivation
Case 1: No Time-Based Version Access
Currently, there is no convenient way to access historical versions by timestamp. Users must first list all versions, find the desired timestamp, then explicitly checkout that version. This workflow is cumbersome for time travel use cases.
Case 2: Full manifests scanning for version listing
The existing implementation (
Dataset.versions()) reads manifests sequentially when listing versions. For large tables with many versions, this causes significant memory and time overhead. Additionally, there is no support for range-based queries, forcing a full manifest load each time we only need a page of version history. (as illustrated at the bottom)Case 3: No unified metadata entry for metadata table
VersionandTransactionare separate types with no unified interface for accessing snapshot-level metadata. This forces upper layers to handle multiple types and understand their relationships.Proposed Solution
Introduce a new
Snapshotconcept as a unified, lightweight entry point for all snapshot-level metadata.Key Goals
Interface
A prototype for this: #5886
Snapshot
Snapshots API
Snapshot API
Benefits Scenarios
Time Travel
Metadata Table
This is a real use case from integrating Lance with Apache Amoro:

Current implementation:
This approach loads all manifests upfront and cannot be paginated server-side.
With Snapshot, we can implement efficient pagination:
Beta Was this translation helpful? Give feedback.
All reactions