Spark: Support time travel through table names #3269

rdblue · 2021-10-10T22:16:04Z

This adds support to Spark for time travel queries using extended table names, like db.table.snapshot_19082734 or db.table.at_109872341. This is in support of #1508, which adds a more narrow feature to pass snapshot selectors through identifiers.

rdblue · 2021-10-10T22:16:44Z

@wypoon, this is a prototype of how we can cleanly pass snapshot and timestamp selectors through identifiers. Could you take a look?

pvary · 2021-10-11T11:27:32Z

In Hive we enabled this in HIVE-25344:

SELECT * FROM t FOR SYSTEM_TIME AS OF <timestamp>;
SELECT * FROM t FOR SYSTEM_VERSION AS OF <version>;

Having different ways to do time travel with different engines will cause confusion for the users. Would it be good to keep the standard SQL syntax as we did in Hive?

rdblue · 2021-10-11T15:44:36Z

@pvary, that would be great. But Spark is unlikely to add that syntax any time soon. I proposed it years ago and had people push back against it. I think it is more important to support time travel in SQL soon than to keep SQL syntax uniform. I know that would be ideal, though.

jackye1995 · 2021-10-11T18:23:13Z

I agree with @rdblue that we should allow time travel query using table name. It's not likely for AS OF to be supported by all query engines and platforms in the near future. We can only add that when the engine supports it. But there is always a use case for using table name to perform time travel, because there are applications that user only have control of some parts of the SQL such as table name but do not have control over the entire SQL. For those applications, there is no way to add AS OF clause even if we support that, and user has to embed the information in the table name. This is also one of the reasons why there is not much motivation in different communities to support AS OF.

jackye1995 · 2021-10-11T18:26:56Z

spark3/src/main/java/org/apache/iceberg/spark/SparkCatalog.java

@@ -80,6 +87,9 @@
 */
 public class SparkCatalog extends BaseCatalog {
  private static final Set<String> DEFAULT_NS_KEYS = ImmutableSet.of(TableCatalog.PROP_OWNER);
+  private static final Splitter COMMA = Splitter.on(",");
+  private static final Pattern AT_TIME = Pattern.compile("at(?:_(?:time(?:stamp)?)?)?_?(\\d+)");


what's the reason for supporting so many different ways to specify the time and snapshot id? My opinion is we should just support 1 shortcut and 1 full name for each, for example at_ and at_timestamp_ for timestamp travel, s_ and snapshot_id_ for snapshot ID travel. Maybe even the full name ones are not necessary.

I may have gone a bit too far here. I think that we want to be able to use the full "at_timestamp_12938471" version because that is the least likely to conflict with existing table names. Similarly, I think it is valuable to have a short version, like at_<timestamp> and snap_<id> so you don't have to type at_timestamp_ or the full snapshot_id_ every time. Since we were already testing multiple prefixes, I added a few that I thought would be valuable to avoid confusion:

Make the final _ optional because that's an easy mistake to make

Allow time and not just timestamp because people may not remember

Allow omitting _id and allow just snap_ or s_ as shortened forms

The logic made sense at every step but there are quite a few variations. I'm up for defining the full version and one shortening if you think it's best to have just a couple.

jackye1995 · 2021-10-11T18:36:58Z

spark3/src/main/java/org/apache/iceberg/spark/SparkCatalog.java

+
+    } catch (org.apache.iceberg.exceptions.NoSuchTableException e) {
+      // if the original load didn't work, the identifier may be extended and include a snapshot selector
+      TableIdentifier namespaceAsIdent = buildIdentifier(namespaceToIdentifier(ident.namespace()));


So we are also using . as the deliminator for table names, similar to how system tables are parsed today. However, system table parsing logic is in core, should we also move this logic to core so all engines obey the same table name format?

I debated that as well. I'm not sure where it would be most clean to put this, which is why I added it just to Spark for now. I think we will also want to extend this to work for branch or tag names, at which point we may want to reconsider moving everything into core.

I forgot one other thing. This is a more reasonable change if we do it in Spark because SparkTable is a pretty thin wrapper around Iceberg's Table. If we do this in core, then we would have to update BaseTable to select a version -- which is pretty strange with the API -- or to introduce a wrapper table type that only allows reading. So since we already have the wrapper table it seemed easiest just to update that for now and not over-complicate the core Table API.

I think we may want to support the timestamp as part of the TableIdentifier as that would align better with what we're planning to do anyway for Snapshot Branching/Tagging, where the TableIdentifier needs to understand what reference we're currently at. Similarly, it would also understand the timestamp.
I'll update https://docs.google.com/document/d/1KSgkVYnIMlWEbAT1qSnnnLS-gc0kdgHlWR6Ud08HOhA/edit#heading=h.zf5ulr1b1ytv to also respect timestamps

pvary · 2021-10-12T05:44:01Z

I agree with @rdblue that we should allow time travel query using table name. It's not likely for AS OF to be supported by all query engines and platforms in the near future.

FWIW, Hive and Impala added this feature for Iceberg table using the standard format. I understand the spacial way Spark handles features. What other engines do have problems with implementing the standard and need to use the dot based naming format?

But there is always a use case for using table name to perform time travel, because there are applications that user only have control of some parts of the SQL such as table name but do not #have control over the entire SQL.

For Hive it was much harder to add the metadata table support, where the metadata table name is delimited by a dot after the real table name.

I was always envious of the Spark SQL extension feature. Could we add our own extension to implement time travel with format defined by the standard?

I fear that if we start using . to separate the metadata table name, the snapshot, the branch, the tags, it will be hard to follow which one the user wants to refer to. We have to limit the usable branch and tag names, and also this limits the user to always use the full path to the tables and limits the naming hierarchy to the same depth.

If we must use identifier names to express these features I would prefer to use something specific for the feature. Like @ for tme travel, and # for the tags, so it is easier to dechiper the meaning for the user and for the parser as well.

wypoon · 2021-10-13T01:31:37Z

I agree with @pvary that it would be better if Spark supported the same SQL syntax for time travel queries as Hive and Impala, especially if the syntax in question is the SQL standard. I understand the difficulty of getting the support into Spark itself. Is it not possible to implement the support using SQL extensions in Iceberg?

Orthogonal to that, I think that metadata tables are conceptually different from snapshots, so while I think something like <catalog>.<database>.<table>.history or <catalog>.<database>.<table>.snapshots have a certain naturalness, <catalog>.<database>.<table>.snapshot_id_<id> is a different kind of beast than the metadata tables, and using the same syntax for the snapshots is lumping two different kinds of things into the same category. Pardon my ignorance, but does the code for table identifiers limit us to using . for delimiting the name? Is it possible to use @ for time travel, as @pvary suggests? (I expect we'd need something like @snapshot(<id>) and @time(<timestamp>) to distinguish the two different ways of specifying time travel.)

aokolnychyi · 2021-10-14T22:10:02Z

@huaxingao was kind enough to look into supporting the AS OF syntax in Spark.

wypoon · 2021-10-16T23:48:25Z

@huaxingao was kind enough to look into supporting the AS OF syntax in Spark.

@aokolnychyi @huaxingao do you mean in the Spark project or in the Iceberg project in spark3-extensions?

wypoon · 2021-10-16T23:58:30Z

As for the suggestion to use @ for time travel, it seems that @ is not a valid character in a table identifier in Spark SQL. (Otherwise, we would get the <table>@... as the table name in SparkCatalog and we could split that there and extract the snapshot id/timestamp and load the table.)

rdblue · 2021-10-17T19:04:56Z

Yeah, the problem with using @ is that it must be quoted. And because Spark and Trino use different characters to quote identifiers, there is confusion for users who often don't know about identifier quoting at all.

huaxingao · 2021-10-17T20:10:13Z

@wypoon

do you mean in the Spark project or in the Iceberg project in spark3-extensions?

I am communicating with the Spark folks to see if I can add AS OF syntax in Spark project. Still waiting for their reply.

Incorporate the approach shown in apache#3269 by Ryan Blue. That defines a syntax for selecting a snapshot or timestamp through the table name. Use that instead of a SnapshotAwareIdentifier to load the SparkTable.

wypoon · 2021-10-19T17:00:46Z

spark3/src/main/java/org/apache/iceberg/spark/source/SparkTable.java

@@ -21,6 +21,7 @@

 import java.util.Map;
 import java.util.Set;
+import org.apache.arrow.util.Preconditions;


Is there a reason to use this instead of com.google.common.base.Preconditions?

No. And this shouldn't be allowed. We have a rule that checks for the wrong one in imports. Looks like we now need to add the Arrow package.

rdblue · 2021-11-03T16:36:57Z

@huaxingao, any update on getting AS OF syntax in Spark?

huaxingao · 2021-11-03T16:55:33Z

@rdblue
They are OK with adding AS OF syntax as long as I make it exactly the same as Delta Lake's syntax. I updated in internal slack channel but forgot to update here. Sorry about that. I will start working on this.

huaxingao · 2021-11-03T17:12:15Z

@rdblue
I plan to pass the as of info using CatalogV2Util.loadTable(catalog: CatalogPlugin, ident: Identifier).
Does it work if I just append the info after Identifier.name, for example: default.table.12345678, or it's better to change the Identifier interface to add something like this

  static Identifier of(String[] namespace, String name, long versionInfo) {
    return new IdentifierImpl(namespace, name, versionInfo);
  }

wypoon · 2021-11-03T17:56:39Z

@huaxingao please confirm the SQL syntax that is agreed to in Spark (I'm looking at some Databricks blog for Delta Lake time travel):

SELECT * FROM t TIMESTAMP AS OF <timestamp>;
SELECT * FROM t VERSION AS OF <version>;

where timestamp is something like "2019-01-01 01:30:00.000" and version for us would be the snapshot id. Is that correct?
@pvary unfortunately that will not match the syntax supported by Hive and Impala, but it will be similar.

@rdblue given that support for AS OF in Spark will only be in a future Spark release, what do you think we can do to support identifying the snapshot (or timestamp) for loading a table in Iceberg now? We need this to support reading the snapshot using the correct schema (as is now done for Spark 2.4).

huaxingao · 2021-11-03T18:19:51Z

@wypoon
Here is Delta Lake syntax https://docs.databricks.com/delta/quick-start.html#query-an-earlier-version-of-the-table-time-travel

SELECT * FROM default.people10m VERSION AS OF 0;

SELECT * FROM default.people10m TIMESTAMP AS OF '2019-01-29 00:37:58';

I plan to do exactly the same as above.
Yes, timestamp is something like "2019-01-01 01:30:00.000" and version would be the snapshot id.

wypoon · 2021-11-03T18:21:54Z

I think adding a Long or Optional<Long> to the org.apache.spark.sql.connector.catalog.Identifier interface could work for us:

  static Identifier of(String[] namespace, String name, Long version) {
    return new IdentifierImpl(namespace, name, version);
  }
  ...
  Long version();

Until the change to this interface makes it into a released Spark 3 version, we could add this extension to Iceberg and use this internal extension of Identifier to load the table with the snapshot id, when reading it using .option("snapshot-id", ...) or .option("as-of-timestamp", ...). When a Spark version is released with this change to Identifier, we can switch from our internal extension to the upstream Identifier.

jackye1995 · 2021-11-03T20:05:17Z

@wypoon @rdblue @huaxingao was there any consideration around the upcoming feature around branching?

Based on the current syntax, it looks like we are assuming a linear history for time travel. Users can only:

time travel on the main branch
first switch the current branch to point to a different branch and then time travel in that branch

There is no way to appoint a different branch to time travel, unless I do something like:

SELECT * FROM table@my_branch TIMESTAMP AS OF '2019-01-29 00:37:58';

Any thoughts?

wypoon · 2021-11-04T00:40:16Z

@jackye1995 I have not been following the proposal/project to add snapshot tagging and branching, so I had not considered this question. I only just read the proposal document and your doc PR #3425. I did not see any discussion in them about the semantics of time travel by timestamp. What do you think the semantics should be? One possible option is:

No branch is specified (I'm not sure if that is possible, but if it is), we default to the main branch. Time travel behaves as though only the main branch exists.
A branch is specified. We go up the tree from the head of the branch, until we are at or before the timestamp. We select the snapshot we find there. There are two ways this might fail: a. We pass the beginning of the branch -- we can either say there is no snapshot or we can continue on the tree up another branch our branch is branched from. b. We do not find any snapshot at or before the timestamp -- we say there is no snapshot.

This assumes the history is really a tree. (Am I mistaken about it being a tree?)

In addition, we need a syntax for specifying a tag or a branch; I don't know if one is proposed -- I saw above an allusion to using . for specifying branch and tag as well. Or will we support a way to say "SET BRANCH <branch>" in SQL or in the DataFrame API using .option("branch", <branch>)? It sounds like we would, since you consider the option to "first switch the current branch to point to a different branch".

Do you think we need the org.apache.spark.sql.connector.catalog.Identifier to have enough information to load a table using the necessary branch/tag/snapshot-id? I think the answer depends on what syntax we intend to support for specifying branch and tag. Here are some things I can think of:

Support using a tag instead of a snapshot id in the VERSION AS OF clause: SELECT * from t VERSION AS OF <tag>. Is that feasible technically, or do we need a different variant of AS OF?
Support specifying a tag or a branch as part of the table identifier: SELECT * from t.<tag> or SELECT * FROM t.<branch> or SELECT * from t.<branch> TIMESTAMP AS OF <timestamp>. (Here I'm using . just for illustration.)

I think that if we use the . syntax, then we do not need to pass a branch or tag field in the Identifier. We would resolve the branch/tag on the Iceberg side along the lines shown in this PR for snapshot-id. But we do need to pass the snapshot-id or timestamp that comes from the AS OF clause in the Identifier.
@huaxingao without considering the branch/tag question, do you think we need the ability to pass either snapshot-id or timestamp in the Identifier? How do you propose to distinguish the two?

huaxingao · 2021-11-04T00:58:51Z

@wypoon
How about we use type String for version and have different prefixes to differentiate between snapshot-id and timestamp?

huaxingao · 2021-11-04T04:26:19Z

Seems I can't change Identifier because it is used by other code paths as well.
I will overload loadTable method to pass in version:

CatalogV2Util
def loadTable(catalog: CatalogPlugin, ident: Identifier, version: String): Option[Table]

TableCatalog
default Table loadTable(Identifier ident, String version) {
    throw new UnsupportedOperationException("Load table with version is not supported.");
}

wypoon · 2021-11-04T19:45:12Z

@huaxingao I'm guessing that if there is an AS OF clause in the SQL query, you will call CatalogV2Util.loadTable(CatalogPlugin, Identifier, String) but if there isn't an AS OF clause, CatalogV2Util.loadTable(CatalogPlugin, Identifier) will be called. Is that right?

The problem is that we need to support using the DataFrame API with .option("snapshot-id", ...) or .option("as-of-timestamp", ...) as well. In those cases, CatalogV2Util.loadTable(CatalogPlugin, Identifier) will be called. In this code path, the Identifier is provided by IcebergSource (which implements SupportsCatalogOptions) via extractIdentifier. Ideally, Identifier has a version field, so we can set it in the Identifier we return in IcebergSource#extractIdentifier. But if Identifier cannot be changed to support version, then this is less convenient for us. What I did previously in earlier iterations of #1508 is to create a SnapshotAwareIdentifier in Iceberg that extends Identifier, and return that in IcebergSource#extractIdentifier. (Then in SparkCatalog#loadTable(Identifier) use the SnapshotAwareIdentifier to identify the snapshot.)

@rdblue what are your thoughts on this?

huaxingao · 2021-11-04T23:36:38Z

@wypoon I understand the problem, but my proposal to add version in Identifier was not accepted. There are some concerns that the change may affect other code paths.

rdblue · 2021-11-04T23:41:08Z

@huaxingao, I agree with the direction to not add a version to Identifier. Adding a method to the catalog that can load different versions is a good idea. If we have two different AS OF clauses, I think that we should have two separate ways of passing the version so that we know whether VERSION AS OF or TIMESTAMP AS OF was used.

I agree that VERSION should be passed as a string (hopefully an arbitrary one!) and TIMESTAMP as a long in micros from epoch, just like the other timestamps.

rdblue · 2021-11-04T23:48:53Z

For the branching and time travel discussion, I think it is valuable to be able to use VERSION AS OF to select a branch, like SELECT * FROM db.table VERSION AS OF branch in addition to SELECT * FROM db.table VERSION AS OF 39874172309487 with a snapshot ID.

Time travel by timestamp would then select from the main branch, as was suggested.

To support both time travel by timestamp and branches at the same time, I think it is natural to use another level of multi-part identifier, like db.table.branch. I think that putting the branch name in the identifier is a more natural way to manage branches in the first place, so it works well with AS OF like this: db.table.branch TIMESTAMP AS OF 2021-11-04T10:00:00.

The reason why I think we should consider supporting a branch name in VERSION AS OF is so that we have something that may be portable across systems that don't support more than db and table in table identifiers, like Trino.

rdblue · 2021-11-04T23:53:07Z

@wypoon and others. For our next release, should we add something like this PR so that we can fix schemas with time travel in Spark 3.0? We don't have to document that we support time travel through table identifiers so that we can remove it later and replace it with the AS OF syntax. What do you think?

@huaxingao, one more thing to consider on your PR is how to update SupportsCatalogOptions to be able to pass the time travel information. Otherwise, I don't think that @wypoon would be able to make the DataFrameReader API work with time travel and the snapshot's schema.

rdblue · 2021-11-05T15:47:43Z

one more thing to consider on your PR is how to update SupportsCatalogOptions to be able to pass the time travel information.

I was thinking about this a bit more and I think that the read options are still passed through. All we should need to do is to standardize the time travel property names and pass the values to the new loadTable method. Something like this:

User calls spark.read.format("iceberg").option("version", 123245).load("db.table")
Spark uses SupportsCatalogOptions to get (catalog, Identifier.of("db", "table"))
Spark extracts version from the read options and calls catalog.loadTable(ident, version)

I think that should work. What do you think, @huaxingao?

huaxingao · 2021-11-05T16:13:39Z

@rdblue
Yes, I think that should work.
So there is no need to modify SupportsCatalogOptions. All I need to do is to add the new overloaded loadTable methods to pass in time travel info as follows, right?

CatalogV2Util
def loadTable(catalog: CatalogPlugin, ident: Identifier, version: String): Option[Table]
def loadTable(catalog: CatalogPlugin, ident: Identifier, timestamp: long): Option[Table]

TableCatalog
default Table loadTable(Identifier ident, String version) {
    throw new UnsupportedOperationException("Load table with version is not supported.");
}

default Table loadTable(Identifier ident, timestamp: long) {
    throw new UnsupportedOperationException("Load table with timestamp is not supported.");
}

rdblue · 2021-11-05T16:17:14Z

@huaxingao, that looks great to me!

You might want to just pass options into CatalogV2Util and have it load the table from there. That way, it will always go through the same path to decide which loadTable to call.

huaxingao · 2021-11-05T16:28:14Z

@rdblue

You might want to just pass options into CatalogV2Util and have it load the table from there...

Sure! Sounds good to me.

pvary · 2021-11-05T18:46:43Z

For branch: SELECT * from t.<branch> TIMESTAMP AS OF <timestamp> is also what I am considering, although it means we cannot use metadata table name as the branch name, but we can discuss on that. The question I have more is if we want to have a more "formal" support, like SELECT * from table TIMESTAMP AS OF <timestamp> IN BRANCH <branch> just to make things up.

It is still strange for me why deviate from the standard SQL for time travel when we developing a new feature, but I heard about the Spark community, and I can accept that we have constraints there.

If we do not have the same constraints for handling branches I would go for having IN BRANCH in the SQL query. using t@branch would be acceptable as well. I would suggest to avoid overloading . again. Using . for separating branches/tags/metadata tables would cause complications/ambiguities later which we can avoid by using very specific SQL syntax, or specific separators.

Thanks,
Peter

wypoon · 2021-11-05T23:06:53Z

@rdblue @huaxingao I agree with your core suggestion:

Overload TableCatalog#loadTable with variants that take a version (String) or a timestamp (long, millis since the Unix epoch) parameter. Then standardize names of options that correspond to VERSION AS OF and TIMESTAMP AS OF, and if one of them is set, call the appropriateTableCatalog#loadTable.

May I suggest "as-of-version" and "as-of-timestamp" for the options? We already use "as-of-timestamp" in Iceberg. (Side note: In Iceberg documentation, the "as-of-timestamp" option is shown being used with a String rather than a long value, but we actually use the value as a long.)

I support being able to use a tag instead of the snapshot-id for the value of VERSION AS OF. But I'm less convinced of the value of being able to use a branch. Perhaps it is enough to support using the branch as part of the table identifier (whether delimited by . or another symbol)? It would be nice to support specifying the branch in SQL using something like IN BRANCH as @jackye1995 suggested (and @pvary finds acceptable); correct me if I'm wrong, but is the problem then that it will be opposed if it is not in Delta Lake?

An orthogonal question for @huaxingao: is it possible to have FOR SYSTEM_TIME AS OF as a synonym for TIMESTAMP AS OF and FOR SYSTEM_VERSION AS OF as a synonym for VERSION AS OF in Spark SQL? While they are not what is used in Delta Lake, I'm just suggesting them as synonyms. This would allow compatibility with Hive and Impala.

wypoon · 2021-11-05T23:13:42Z

@wypoon and others. For our next release, should we add something like this PR so that we can fix schemas with time travel in Spark 3.0? We don't have to document that we support time travel through table identifiers so that we can remove it later and replace it with the AS OF syntax. What do you think?

I'm ok with that idea. (You mean Spark 3.0, 3.1 and 3.2, right?) I would like to get the schema for snapshot done for Spark 3!
Since the special identifiers will be just for the use of the implementation (not for public use), we just need one syntax for each of "snapshot-id" and "as-of-timestamp".

wypoon · 2021-11-05T23:25:28Z

The reason why I think we should consider supporting a branch name in VERSION AS OF is so that we have something that may be portable across systems that don't support more than db and table in table identifiers, like Trino.

I missed that. Can Trino support a new syntax like IN BRANCH? But we might have difficulty to support IN BRANCH in Spark? Can we do it in an extension in Iceberg?

huaxingao · 2021-11-05T23:33:07Z

@wypoon
Are timestampAsOf and versionAsOf OK for you? I saw the following in Delta Lake doc:

df1 = spark.read.format('delta').option('timestampAsOf', '2019-01-01').load('/mnt/delta/people-10m')
df2 = spark.read.format('delta').option('versionAsOf', 2).load('/mnt/delta/people-10m')

Spark has a very strict control over adding new SQL syntax. Normally only ANSI standard SQL syntax can be added. I will give it a try and propose adding FOR SYSTEM_TIME AS OF and FOR SYSTEM_VERSION AS OF as synonyms.

rdblue · 2021-11-06T00:07:28Z

I like the idea to support SYSTEM_TIME and SYSTEM_VERSION for compatibility. That sounds reasonable to me. Do the keywords have underscores in them? If so, is that needed for some reason?

The IN BRANCH syntax is fine with me, but it would mean another addition to SparkSQL syntax and I'm not confident that it would be accepted. I also think that it's more natural to use db.table.branch, but I don't think that will be supported everywhere. The easiest way to get this in is probably to implement IN BRANCH elsewhere and cite that as precedent.

rdblue · 2021-11-06T00:08:44Z

Can Trino support a new syntax like IN BRANCH? But we might have difficulty to support IN BRANCH in Spark? Can we do it in an extension in Iceberg?

Probably. I think it's more likely that Trino supports that than multi-part identifiers, but I'm not sure if they would be open to it.

rdblue · 2021-11-06T00:10:35Z

@huaxingao, can we support both timestampAsOf and as-of-timestamp / versionAsOf and as-of-version since there are multiple precedents?

wypoon · 2021-11-06T01:04:51Z

@huaxingao thanks. For the synonym proposal, you can cite HIVE-25344.
For the option names, I'm fine with "timestampAsOf" and "versionAsOf" if that is what the Spark community agrees to. I expected that we have to break with the existing "snapshot-id" anyway. But if the community will accept "as-of-timestamp" and (for consistency) "as-of-version" as alternatives, that would be great.

@pvary can you comment on the question about the underscores in SYSTEM_TIME and SYSTEM_VERSION?

huaxingao · 2021-11-06T06:45:22Z

@rdblue Yes, I will try to add SYSTEM_TIME and SYSTEM_VERSION, and also as-of-timestamp /as-of-version. Hopefully, I won't get any objections.

pvary · 2021-11-08T09:28:20Z

I like the idea to support SYSTEM_TIME and SYSTEM_VERSION for compatibility. That sounds reasonable to me. Do the keywords have underscores in them? If so, is that needed for some reason?

AFAIK the SQL 11 standard defines them this way: https://sigmodrecord.org/?smd_process_download=1&download_id=3338

huaxingao · 2021-11-09T17:04:26Z

@rdblue @wypoon

I agree that VERSION should be passed as a string (hopefully an arbitrary one!)

So this VERSION is NOT digits only and should be defined as version=STRING instead of version=INTEGER_VALUE, right?

wypoon · 2021-11-30T02:36:57Z

@rdblue do you intend to go forward with some form of this PR as a way to enable implementing schema for snapshot for Spark 3 (until what Huaxin proposes is implemented in a released Spark version)? Do we have to wait for a decision on @nastra's proposal in https://docs.google.com/document/d/1KSgkVYnIMlWEbAT1qSnnnLS-gc0kdgHlWR6Ud08HOhA/edit# to change TableIdentifier? From your comments on that document, I have the impression that you do not think TableIdentifier should be changed.
If you do think a form of this PR is the way forward but are currently very busy, I'm happy to incorporate some form of this PR in a single PR to support schema for snapshot in Spark 3 (basically an update of #3314).

This has been implemented for Spark 2. For Spark 3, Ryan Blue proposed a syntax for adding the snapshot id or timestamp to the table identifier in apache#3269. Here we implement the Spark 3 support for using the snapshot schema by using the proposed table identifier syntax. This is until a new Spark 3 is released with support for AS OF in Spark SQL. Note: The table identifier syntax is for internal use only (as in this implementation) and not meant to be exposed as a publicly supported syntax in SQL. However, for testing, we do test its use from SQL.

rdblue · 2021-12-21T00:18:33Z

This was replaced by #3722 so I'll close this.

Thanks for the great discussion and for all the work on this problem, everyone!

Spark: Support time travel through table names.

5c65e44

github-actions bot added core spark labels Oct 10, 2021

rdblue mentioned this pull request Oct 10, 2021

Use schema at the time of the snapshot when reading a snapshot. #1508

Merged

jackye1995 reviewed Oct 11, 2021

View reviewed changes

wypoon mentioned this pull request Oct 19, 2021

[POC] Alternate implementation of using snapshot schema when reading snapshot #3314

Closed

wypoon reviewed Oct 19, 2021

View reviewed changes

wypoon mentioned this pull request Dec 12, 2021

Spark: Use snapshot schema when reading snapshot #3722

Merged

rdblue closed this Dec 21, 2021

singhpk234 mentioned this pull request May 5, 2022

Doc : Document time travel via SPARK SQL #4707

Closed

Spark: Support time travel through table names #3269

Spark: Support time travel through table names #3269

Conversation

rdblue commented Oct 10, 2021

rdblue commented Oct 10, 2021

pvary commented Oct 11, 2021

rdblue commented Oct 11, 2021

jackye1995 commented Oct 11, 2021

jackye1995 Oct 11, 2021

Choose a reason for hiding this comment

rdblue Oct 11, 2021 • edited Loading

Choose a reason for hiding this comment

jackye1995 Oct 11, 2021

Choose a reason for hiding this comment

rdblue Oct 11, 2021

Choose a reason for hiding this comment

rdblue Oct 11, 2021

Choose a reason for hiding this comment

nastra Nov 18, 2021 • edited Loading

Choose a reason for hiding this comment

pvary commented Oct 12, 2021

wypoon commented Oct 13, 2021

aokolnychyi commented Oct 14, 2021

wypoon commented Oct 16, 2021

wypoon commented Oct 16, 2021

rdblue commented Oct 17, 2021

huaxingao commented Oct 17, 2021

wypoon Oct 19, 2021

Choose a reason for hiding this comment

rdblue Oct 19, 2021

Choose a reason for hiding this comment

rdblue commented Nov 3, 2021

huaxingao commented Nov 3, 2021

huaxingao commented Nov 3, 2021

wypoon commented Nov 3, 2021 • edited Loading

huaxingao commented Nov 3, 2021

wypoon commented Nov 3, 2021 • edited Loading

jackye1995 commented Nov 3, 2021

wypoon commented Nov 4, 2021 • edited Loading

huaxingao commented Nov 4, 2021

huaxingao commented Nov 4, 2021

wypoon commented Nov 4, 2021

huaxingao commented Nov 4, 2021

rdblue commented Nov 4, 2021 • edited Loading

rdblue commented Nov 4, 2021

rdblue commented Nov 4, 2021

rdblue commented Nov 5, 2021

huaxingao commented Nov 5, 2021

rdblue commented Nov 5, 2021 • edited Loading

huaxingao commented Nov 5, 2021

pvary commented Nov 5, 2021

wypoon commented Nov 5, 2021

wypoon commented Nov 5, 2021 • edited Loading

wypoon commented Nov 5, 2021

huaxingao commented Nov 5, 2021

rdblue commented Nov 6, 2021

rdblue commented Nov 6, 2021

rdblue commented Nov 6, 2021

wypoon commented Nov 6, 2021

huaxingao commented Nov 6, 2021

pvary commented Nov 8, 2021

huaxingao commented Nov 9, 2021

wypoon commented Nov 30, 2021

rdblue commented Dec 21, 2021

rdblue Oct 11, 2021 •

edited

Loading

nastra Nov 18, 2021 •

edited

Loading

wypoon commented Nov 3, 2021 •

edited

Loading

wypoon commented Nov 3, 2021 •

edited

Loading

wypoon commented Nov 4, 2021 •

edited

Loading

rdblue commented Nov 4, 2021 •

edited

Loading

rdblue commented Nov 5, 2021 •

edited

Loading

wypoon commented Nov 5, 2021 •

edited

Loading