[POC] Alternate implementation of using snapshot schema when reading snapshot #3314

wypoon · 2021-10-19T04:46:55Z

This is an alternate approach to #1508 based on #3269. This handles the support for Spark 3.

According to Edwin Choi, in order to get the schema for a snapshot, the only safe option is to scan the metadata files to find the one where the current-snapshot-id matches target snapshot id.

…he snapshot id.

The changes are mostly in spark3. They are necessitated by the catalog support introduced in apache#1783. As the spark3 IcebergSource now implements SupportsCatalogOptions, DataFrameReader#load no longer calls IcebergSource#getTable but calls SparkCatalog#loadTable directly. In order for the SparkTable returned by SparkCatalog#loadTable(Identifier) to be aware of the snapshot, the information about the snapshot needs to be present in the Identifier. For this reason, we introduce a SnapshotAwareIdentifier interface extending Identifier. As SupportsCatalogOptions does not allow a schema to be specified (requested), SparkTable no longer needs a requestedSchema field, so some dead code is removed from it.

Rebased on master. Use constants from SparkReadOptions. Implement snapshotSchema() in SparkFilesScan as it extends SparkBatchScan.

Avoid introducing new methods to BaseTable. Add helper methods to SnapshotUtil instead. Move recovery of schema from previous metadata files in the event that snapshot does not have associated schema id to new PR. Remove snapshotSchema method from SparkBatchScam and its subclasses, as it is not needed. Adjust schema in BaseTableScan when useSnapshot is called.

Use the existing CatalogAndIdentifier and swap out the Identifier for a snapshot-aware TableIdentifier if snapshotId or asOfTimestamp is set.

Fix a bug in BaseTableScan#useSnapshot. Some clean up in SnapshotUtil. Some streamlining in added unit tests. Refactor spark2 Reader to configure the TableScan on construction, and let the TableScan get the schema for the snapshot. Rename new TableIdentifier to SparkTableIdentifier to avoid confusion with existing TableIdentifier (in different package). Add convenience constructor to PathIdentifier to avoid modifying tests for it.

…ableScan. Use SnapshotUtil.snapshotIdAsOfTime in BaseTableScan#asOfTime. Move formatTimestampMillis from BaseTableScan to SnapshotUtil in order to use it there (BaseTableScan is a package-private and not a public class). Fix some error messages.

Incorporate the approach shown in apache#3269 by Ryan Blue. That defines a syntax for selecting a snapshot or timestamp through the table name. Use that instead of a SnapshotAwareIdentifier to load the SparkTable.

Removed test left two unused imports.

wypoon · 2021-10-19T17:47:19Z

spark/v3.0/spark3/src/main/java/org/apache/iceberg/spark/source/IcebergSource.java

    CatalogManager catalogManager = spark.sessionState().catalogManager();

    if (path.contains("/")) {
      // contains a path. Return iceberg default catalog and a PathIdentifier
+      String newPath = selector.equals("") ? path : path + "#" + selector;


This is one area I'm not sure about. I am not too familiar with Hadoop tables and PathIdentifier. Is the only thing that can go after # the name of a metadata table (and now the snapshot/timestamp selector)?

wypoon · 2021-10-19T17:50:40Z

spark/v3.0/spark3/src/main/java/org/apache/iceberg/spark/source/SparkTable.java

+      // If the table is loaded using the Spark DataFrame API, and option("snapshot-id", <snapshot_id>)
+      // or option("as-of-timestamp", <timestamp>)  is applied to the DataFrameReader, SparkTable will be
+      // constructed with a non-null snapshotId. Subsequently SparkTable#newScanBuilder will be called
+      // with the options, which will include "snapshot-id" or "as-of-timestamp".
+      // On the other hand, if the table is loaded using SQL, with the table suffixed with a snapshot
+      // or timestamp selector, then SparkTable will be constructed with a non-null snapshotId, but
+      // SparkTable#newScanBuilder will be called without the "snapshot-id" or "as-of-timestamp" option.
+      // We therefore add a "snapshot-id" option here in this latter case.
+      // As a consistency check, if "snapshot-id" is in the options, the id must match what we already
+      // have.


This took me some figuring out. We only need to add snapshot id if the SparkTable is loaded from SQL via the table name syntax, not when loaded using the DataFrame API.

wypoon · 2022-02-02T18:50:35Z

Obsolete. Superseded by #3722 which is merged.

github-actions bot added core spark labels Oct 19, 2021

wypoon changed the title ~~[POC] Use snapshot schema when reading snapshot~~ [POC] Alternate implementation of using snapshot schema when reading snapshot Oct 19, 2021

wypoon and others added 12 commits October 19, 2021 09:38

Use schema at the time of the snapshot when reading a snapshot.

ae33d89

Add snapshotForSchema methods to BaseTable only, not to the Table API.

49df3c7

Changes due to rebase on master. Implement Edwin Choi's suggestion.

17b2d68

According to Edwin Choi, in order to get the schema for a snapshot, the only safe option is to scan the metadata files to find the one where the current-snapshot-id matches target snapshot id.

Always go through all the metadata files to find the first one with t…

d5dc3a5

…he snapshot id.

Get schema by id if snapshot has a schema id associated with it.

2a15f08

Rebased on master. Use constants from SparkReadOptions. Implement snapshotSchema() in SparkFilesScan as it extends SparkBatchScan.

Simplify CatalogAndIdentifier resolution.

56fb7c3

Use the existing CatalogAndIdentifier and swap out the Identifier for a snapshot-aware TableIdentifier if snapshotId or asOfTimestamp is set.

Use BaseTableScan#useSnapshot for argument validation.

369ae98

Support time travel through table names.

129cd44

Incorporate the approach shown in apache#3269 by Ryan Blue. That defines a syntax for selecting a snapshot or timestamp through the table name. Use that instead of a SnapshotAwareIdentifier to load the SparkTable.

wypoon force-pushed the schema-for-snapshot2 branch from f7fadf3 to 129cd44 Compare October 19, 2021 16:40

Fix flink checkstyleTest errors due to apache#3258.

e7fa344

Removed test left two unused imports.

github-actions bot added the flink label Oct 19, 2021

wypoon commented Oct 19, 2021

View reviewed changes

wypoon mentioned this pull request Oct 19, 2021

Use schema at the time of the snapshot when reading a snapshot. #1508

Merged

wypoon mentioned this pull request Nov 30, 2021

Spark: Support time travel through table names #3269

Closed

wypoon closed this Feb 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[POC] Alternate implementation of using snapshot schema when reading snapshot #3314

[POC] Alternate implementation of using snapshot schema when reading snapshot #3314

wypoon commented Oct 19, 2021

wypoon Oct 19, 2021

wypoon Oct 19, 2021

wypoon commented Feb 2, 2022

[POC] Alternate implementation of using snapshot schema when reading snapshot #3314

[POC] Alternate implementation of using snapshot schema when reading snapshot #3314

Conversation

wypoon commented Oct 19, 2021

wypoon Oct 19, 2021

Choose a reason for hiding this comment

wypoon Oct 19, 2021

Choose a reason for hiding this comment

wypoon commented Feb 2, 2022