Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spark: Support time travel through table names #3269

Closed
wants to merge 1 commit into from

Conversation

rdblue
Copy link
Contributor

@rdblue rdblue commented Oct 10, 2021

This adds support to Spark for time travel queries using extended table names, like db.table.snapshot_19082734 or db.table.at_109872341. This is in support of #1508, which adds a more narrow feature to pass snapshot selectors through identifiers.

@rdblue
Copy link
Contributor Author

rdblue commented Oct 10, 2021

@wypoon, this is a prototype of how we can cleanly pass snapshot and timestamp selectors through identifiers. Could you take a look?

@pvary
Copy link
Contributor

pvary commented Oct 11, 2021

In Hive we enabled this in HIVE-25344:

SELECT * FROM t FOR SYSTEM_TIME AS OF <timestamp>;
SELECT * FROM t FOR SYSTEM_VERSION AS OF <version>;

Having different ways to do time travel with different engines will cause confusion for the users. Would it be good to keep the standard SQL syntax as we did in Hive?

@rdblue
Copy link
Contributor Author

rdblue commented Oct 11, 2021

@pvary, that would be great. But Spark is unlikely to add that syntax any time soon. I proposed it years ago and had people push back against it. I think it is more important to support time travel in SQL soon than to keep SQL syntax uniform. I know that would be ideal, though.

@jackye1995
Copy link
Contributor

I agree with @rdblue that we should allow time travel query using table name. It's not likely for AS OF to be supported by all query engines and platforms in the near future. We can only add that when the engine supports it. But there is always a use case for using table name to perform time travel, because there are applications that user only have control of some parts of the SQL such as table name but do not have control over the entire SQL. For those applications, there is no way to add AS OF clause even if we support that, and user has to embed the information in the table name. This is also one of the reasons why there is not much motivation in different communities to support AS OF.

@@ -80,6 +87,9 @@
*/
public class SparkCatalog extends BaseCatalog {
private static final Set<String> DEFAULT_NS_KEYS = ImmutableSet.of(TableCatalog.PROP_OWNER);
private static final Splitter COMMA = Splitter.on(",");
private static final Pattern AT_TIME = Pattern.compile("at(?:_(?:time(?:stamp)?)?)?_?(\\d+)");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the reason for supporting so many different ways to specify the time and snapshot id? My opinion is we should just support 1 shortcut and 1 full name for each, for example at_ and at_timestamp_ for timestamp travel, s_ and snapshot_id_ for snapshot ID travel. Maybe even the full name ones are not necessary.

Copy link
Contributor Author

@rdblue rdblue Oct 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may have gone a bit too far here. I think that we want to be able to use the full "at_timestamp_12938471" version because that is the least likely to conflict with existing table names. Similarly, I think it is valuable to have a short version, like at_<timestamp> and snap_<id> so you don't have to type at_timestamp_ or the full snapshot_id_ every time. Since we were already testing multiple prefixes, I added a few that I thought would be valuable to avoid confusion:

  • Make the final _ optional because that's an easy mistake to make
  • Allow time and not just timestamp because people may not remember
  • Allow omitting _id and allow just snap_ or s_ as shortened forms

The logic made sense at every step but there are quite a few variations. I'm up for defining the full version and one shortening if you think it's best to have just a couple.


} catch (org.apache.iceberg.exceptions.NoSuchTableException e) {
// if the original load didn't work, the identifier may be extended and include a snapshot selector
TableIdentifier namespaceAsIdent = buildIdentifier(namespaceToIdentifier(ident.namespace()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we are also using . as the deliminator for table names, similar to how system tables are parsed today. However, system table parsing logic is in core, should we also move this logic to core so all engines obey the same table name format?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I debated that as well. I'm not sure where it would be most clean to put this, which is why I added it just to Spark for now. I think we will also want to extend this to work for branch or tag names, at which point we may want to reconsider moving everything into core.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot one other thing. This is a more reasonable change if we do it in Spark because SparkTable is a pretty thin wrapper around Iceberg's Table. If we do this in core, then we would have to update BaseTable to select a version -- which is pretty strange with the API -- or to introduce a wrapper table type that only allows reading. So since we already have the wrapper table it seemed easiest just to update that for now and not over-complicate the core Table API.

Copy link
Contributor

@nastra nastra Nov 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we may want to support the timestamp as part of the TableIdentifier as that would align better with what we're planning to do anyway for Snapshot Branching/Tagging, where the TableIdentifier needs to understand what reference we're currently at. Similarly, it would also understand the timestamp.
I'll update https://docs.google.com/document/d/1KSgkVYnIMlWEbAT1qSnnnLS-gc0kdgHlWR6Ud08HOhA/edit#heading=h.zf5ulr1b1ytv to also respect timestamps

@pvary
Copy link
Contributor

pvary commented Oct 12, 2021

I agree with @rdblue that we should allow time travel query using table name. It's not likely for AS OF to be supported by all query engines and platforms in the near future.

FWIW, Hive and Impala added this feature for Iceberg table using the standard format. I understand the spacial way Spark handles features. What other engines do have problems with implementing the standard and need to use the dot based naming format?

But there is always a use case for using table name to perform time travel, because there are applications that user only have control of some parts of the SQL such as table name but do not #have control over the entire SQL.

For Hive it was much harder to add the metadata table support, where the metadata table name is delimited by a dot after the real table name.

I was always envious of the Spark SQL extension feature. Could we add our own extension to implement time travel with format defined by the standard?

I fear that if we start using . to separate the metadata table name, the snapshot, the branch, the tags, it will be hard to follow which one the user wants to refer to. We have to limit the usable branch and tag names, and also this limits the user to always use the full path to the tables and limits the naming hierarchy to the same depth.

If we must use identifier names to express these features I would prefer to use something specific for the feature. Like @ for tme travel, and # for the tags, so it is easier to dechiper the meaning for the user and for the parser as well.

@wypoon
Copy link
Contributor

wypoon commented Oct 13, 2021

I agree with @pvary that it would be better if Spark supported the same SQL syntax for time travel queries as Hive and Impala, especially if the syntax in question is the SQL standard. I understand the difficulty of getting the support into Spark itself. Is it not possible to implement the support using SQL extensions in Iceberg?

Orthogonal to that, I think that metadata tables are conceptually different from snapshots, so while I think something like <catalog>.<database>.<table>.history or <catalog>.<database>.<table>.snapshots have a certain naturalness, <catalog>.<database>.<table>.snapshot_id_<id> is a different kind of beast than the metadata tables, and using the same syntax for the snapshots is lumping two different kinds of things into the same category. Pardon my ignorance, but does the code for table identifiers limit us to using . for delimiting the name? Is it possible to use @ for time travel, as @pvary suggests? (I expect we'd need something like @snapshot(<id>) and @time(<timestamp>) to distinguish the two different ways of specifying time travel.)

@aokolnychyi
Copy link
Contributor

@huaxingao was kind enough to look into supporting the AS OF syntax in Spark.

@wypoon
Copy link
Contributor

wypoon commented Oct 16, 2021

@huaxingao was kind enough to look into supporting the AS OF syntax in Spark.

@aokolnychyi @huaxingao do you mean in the Spark project or in the Iceberg project in spark3-extensions?

@wypoon
Copy link
Contributor

wypoon commented Oct 16, 2021

As for the suggestion to use @ for time travel, it seems that @ is not a valid character in a table identifier in Spark SQL. (Otherwise, we would get the <table>@... as the table name in SparkCatalog and we could split that there and extract the snapshot id/timestamp and load the table.)

@rdblue
Copy link
Contributor Author

rdblue commented Oct 17, 2021

Yeah, the problem with using @ is that it must be quoted. And because Spark and Trino use different characters to quote identifiers, there is confusion for users who often don't know about identifier quoting at all.

@huaxingao
Copy link
Contributor

@wypoon

do you mean in the Spark project or in the Iceberg project in spark3-extensions?

I am communicating with the Spark folks to see if I can add AS OF syntax in Spark project. Still waiting for their reply.

wypoon added a commit to wypoon/iceberg that referenced this pull request Oct 19, 2021
Incorporate the approach shown in apache#3269 by Ryan Blue.
That defines a syntax for selecting a snapshot or timestamp through
the table name. Use that instead of a SnapshotAwareIdentifier to
load the SparkTable.
wypoon added a commit to wypoon/iceberg that referenced this pull request Oct 19, 2021
Incorporate the approach shown in apache#3269 by Ryan Blue.
That defines a syntax for selecting a snapshot or timestamp through
the table name. Use that instead of a SnapshotAwareIdentifier to
load the SparkTable.
@@ -21,6 +21,7 @@

import java.util.Map;
import java.util.Set;
import org.apache.arrow.util.Preconditions;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason to use this instead of com.google.common.base.Preconditions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. And this shouldn't be allowed. We have a rule that checks for the wrong one in imports. Looks like we now need to add the Arrow package.

@rdblue
Copy link
Contributor Author

rdblue commented Nov 3, 2021

@huaxingao, any update on getting AS OF syntax in Spark?

@huaxingao
Copy link
Contributor

@rdblue
They are OK with adding AS OF syntax as long as I make it exactly the same as Delta Lake's syntax. I updated in internal slack channel but forgot to update here. Sorry about that. I will start working on this.

@huaxingao
Copy link
Contributor

@rdblue
I plan to pass the as of info using CatalogV2Util.loadTable(catalog: CatalogPlugin, ident: Identifier).
Does it work if I just append the info after Identifier.name, for example: default.table.12345678, or it's better to change the Identifier interface to add something like this

  static Identifier of(String[] namespace, String name, long versionInfo) {
    return new IdentifierImpl(namespace, name, versionInfo);
  }

@wypoon
Copy link
Contributor

wypoon commented Nov 3, 2021

@huaxingao please confirm the SQL syntax that is agreed to in Spark (I'm looking at some Databricks blog for Delta Lake time travel):

SELECT * FROM t TIMESTAMP AS OF <timestamp>;
SELECT * FROM t VERSION AS OF <version>;

where timestamp is something like "2019-01-01 01:30:00.000" and version for us would be the snapshot id. Is that correct?
@pvary unfortunately that will not match the syntax supported by Hive and Impala, but it will be similar.

@rdblue given that support for AS OF in Spark will only be in a future Spark release, what do you think we can do to support identifying the snapshot (or timestamp) for loading a table in Iceberg now? We need this to support reading the snapshot using the correct schema (as is now done for Spark 2.4).

@huaxingao
Copy link
Contributor

@wypoon
Here is Delta Lake syntax https://docs.databricks.com/delta/quick-start.html#query-an-earlier-version-of-the-table-time-travel

SELECT * FROM default.people10m VERSION AS OF 0;

SELECT * FROM default.people10m TIMESTAMP AS OF '2019-01-29 00:37:58';

I plan to do exactly the same as above.
Yes, timestamp is something like "2019-01-01 01:30:00.000" and version would be the snapshot id.

@wypoon
Copy link
Contributor

wypoon commented Nov 3, 2021

I think adding a Long or Optional<Long> to the org.apache.spark.sql.connector.catalog.Identifier interface could work for us:

  static Identifier of(String[] namespace, String name, Long version) {
    return new IdentifierImpl(namespace, name, version);
  }
  ...
  Long version();

Until the change to this interface makes it into a released Spark 3 version, we could add this extension to Iceberg and use this internal extension of Identifier to load the table with the snapshot id, when reading it using .option("snapshot-id", ...) or .option("as-of-timestamp", ...). When a Spark version is released with this change to Identifier, we can switch from our internal extension to the upstream Identifier.

@jackye1995
Copy link
Contributor

@wypoon @rdblue @huaxingao was there any consideration around the upcoming feature around branching?

Based on the current syntax, it looks like we are assuming a linear history for time travel. Users can only:

  1. time travel on the main branch
  2. first switch the current branch to point to a different branch and then time travel in that branch

There is no way to appoint a different branch to time travel, unless I do something like:

SELECT * FROM table@my_branch TIMESTAMP AS OF '2019-01-29 00:37:58';

Any thoughts?

@wypoon
Copy link
Contributor

wypoon commented Nov 4, 2021

@jackye1995 I have not been following the proposal/project to add snapshot tagging and branching, so I had not considered this question. I only just read the proposal document and your doc PR #3425. I did not see any discussion in them about the semantics of time travel by timestamp. What do you think the semantics should be? One possible option is:

  1. No branch is specified (I'm not sure if that is possible, but if it is), we default to the main branch. Time travel behaves as though only the main branch exists.
  2. A branch is specified. We go up the tree from the head of the branch, until we are at or before the timestamp. We select the snapshot we find there. There are two ways this might fail: a. We pass the beginning of the branch -- we can either say there is no snapshot or we can continue on the tree up another branch our branch is branched from. b. We do not find any snapshot at or before the timestamp -- we say there is no snapshot.

This assumes the history is really a tree. (Am I mistaken about it being a tree?)

In addition, we need a syntax for specifying a tag or a branch; I don't know if one is proposed -- I saw above an allusion to using . for specifying branch and tag as well. Or will we support a way to say "SET BRANCH <branch>" in SQL or in the DataFrame API using .option("branch", <branch>)? It sounds like we would, since you consider the option to "first switch the current branch to point to a different branch".

Do you think we need the org.apache.spark.sql.connector.catalog.Identifier to have enough information to load a table using the necessary branch/tag/snapshot-id? I think the answer depends on what syntax we intend to support for specifying branch and tag. Here are some things I can think of:

  1. Support using a tag instead of a snapshot id in the VERSION AS OF clause: SELECT * from t VERSION AS OF <tag>. Is that feasible technically, or do we need a different variant of AS OF?
  2. Support specifying a tag or a branch as part of the table identifier: SELECT * from t.<tag> or SELECT * FROM t.<branch> or SELECT * from t.<branch> TIMESTAMP AS OF <timestamp>. (Here I'm using . just for illustration.)

I think that if we use the . syntax, then we do not need to pass a branch or tag field in the Identifier. We would resolve the branch/tag on the Iceberg side along the lines shown in this PR for snapshot-id. But we do need to pass the snapshot-id or timestamp that comes from the AS OF clause in the Identifier.
@huaxingao without considering the branch/tag question, do you think we need the ability to pass either snapshot-id or timestamp in the Identifier? How do you propose to distinguish the two?

@huaxingao
Copy link
Contributor

@wypoon
How about we use type String for version and have different prefixes to differentiate between snapshot-id and timestamp?

@huaxingao
Copy link
Contributor

Seems I can't change Identifier because it is used by other code paths as well.
I will overload loadTable method to pass in version:

CatalogV2Util
def loadTable(catalog: CatalogPlugin, ident: Identifier, version: String): Option[Table]

TableCatalog
default Table loadTable(Identifier ident, String version) {
    throw new UnsupportedOperationException("Load table with version is not supported.");
}

@wypoon
Copy link
Contributor

wypoon commented Nov 4, 2021

@huaxingao I'm guessing that if there is an AS OF clause in the SQL query, you will call CatalogV2Util.loadTable(CatalogPlugin, Identifier, String) but if there isn't an AS OF clause, CatalogV2Util.loadTable(CatalogPlugin, Identifier) will be called. Is that right?

The problem is that we need to support using the DataFrame API with .option("snapshot-id", ...) or .option("as-of-timestamp", ...) as well. In those cases, CatalogV2Util.loadTable(CatalogPlugin, Identifier) will be called. In this code path, the Identifier is provided by IcebergSource (which implements SupportsCatalogOptions) via extractIdentifier. Ideally, Identifier has a version field, so we can set it in the Identifier we return in IcebergSource#extractIdentifier. But if Identifier cannot be changed to support version, then this is less convenient for us. What I did previously in earlier iterations of #1508 is to create a SnapshotAwareIdentifier in Iceberg that extends Identifier, and return that in IcebergSource#extractIdentifier. (Then in SparkCatalog#loadTable(Identifier) use the SnapshotAwareIdentifier to identify the snapshot.)

@rdblue what are your thoughts on this?

@huaxingao
Copy link
Contributor

@wypoon I understand the problem, but my proposal to add version in Identifier was not accepted. There are some concerns that the change may affect other code paths.

@rdblue
Copy link
Contributor Author

rdblue commented Nov 4, 2021

@huaxingao, I agree with the direction to not add a version to Identifier. Adding a method to the catalog that can load different versions is a good idea. If we have two different AS OF clauses, I think that we should have two separate ways of passing the version so that we know whether VERSION AS OF or TIMESTAMP AS OF was used.

I agree that VERSION should be passed as a string (hopefully an arbitrary one!) and TIMESTAMP as a long in micros from epoch, just like the other timestamps.

@rdblue
Copy link
Contributor Author

rdblue commented Nov 4, 2021

For the branching and time travel discussion, I think it is valuable to be able to use VERSION AS OF to select a branch, like SELECT * FROM db.table VERSION AS OF branch in addition to SELECT * FROM db.table VERSION AS OF 39874172309487 with a snapshot ID.

Time travel by timestamp would then select from the main branch, as was suggested.

To support both time travel by timestamp and branches at the same time, I think it is natural to use another level of multi-part identifier, like db.table.branch. I think that putting the branch name in the identifier is a more natural way to manage branches in the first place, so it works well with AS OF like this: db.table.branch TIMESTAMP AS OF 2021-11-04T10:00:00.

The reason why I think we should consider supporting a branch name in VERSION AS OF is so that we have something that may be portable across systems that don't support more than db and table in table identifiers, like Trino.

@rdblue
Copy link
Contributor Author

rdblue commented Nov 4, 2021

@wypoon and others. For our next release, should we add something like this PR so that we can fix schemas with time travel in Spark 3.0? We don't have to document that we support time travel through table identifiers so that we can remove it later and replace it with the AS OF syntax. What do you think?

@huaxingao, one more thing to consider on your PR is how to update SupportsCatalogOptions to be able to pass the time travel information. Otherwise, I don't think that @wypoon would be able to make the DataFrameReader API work with time travel and the snapshot's schema.

@rdblue
Copy link
Contributor Author

rdblue commented Nov 5, 2021

one more thing to consider on your PR is how to update SupportsCatalogOptions to be able to pass the time travel information.

I was thinking about this a bit more and I think that the read options are still passed through. All we should need to do is to standardize the time travel property names and pass the values to the new loadTable method. Something like this:

  • User calls spark.read.format("iceberg").option("version", 123245).load("db.table")
  • Spark uses SupportsCatalogOptions to get (catalog, Identifier.of("db", "table"))
  • Spark extracts version from the read options and calls catalog.loadTable(ident, version)

I think that should work. What do you think, @huaxingao?

@huaxingao
Copy link
Contributor

@rdblue
Yes, I think that should work.
So there is no need to modify SupportsCatalogOptions. All I need to do is to add the new overloaded loadTable methods to pass in time travel info as follows, right?

CatalogV2Util
def loadTable(catalog: CatalogPlugin, ident: Identifier, version: String): Option[Table]
def loadTable(catalog: CatalogPlugin, ident: Identifier, timestamp: long): Option[Table]

TableCatalog
default Table loadTable(Identifier ident, String version) {
    throw new UnsupportedOperationException("Load table with version is not supported.");
}

default Table loadTable(Identifier ident, timestamp: long) {
    throw new UnsupportedOperationException("Load table with timestamp is not supported.");
}

@rdblue
Copy link
Contributor Author

rdblue commented Nov 5, 2021

@huaxingao, that looks great to me!

You might want to just pass options into CatalogV2Util and have it load the table from there. That way, it will always go through the same path to decide which loadTable to call.

@huaxingao
Copy link
Contributor

@rdblue

You might want to just pass options into CatalogV2Util and have it load the table from there...

Sure! Sounds good to me.

@pvary
Copy link
Contributor

pvary commented Nov 5, 2021

For branch: SELECT * from t.<branch> TIMESTAMP AS OF <timestamp> is also what I am considering, although it means we cannot use metadata table name as the branch name, but we can discuss on that. The question I have more is if we want to have a more "formal" support, like SELECT * from table TIMESTAMP AS OF <timestamp> IN BRANCH <branch> just to make things up.

It is still strange for me why deviate from the standard SQL for time travel when we developing a new feature, but I heard about the Spark community, and I can accept that we have constraints there.

If we do not have the same constraints for handling branches I would go for having IN BRANCH in the SQL query. using t@branch would be acceptable as well. I would suggest to avoid overloading . again. Using . for separating branches/tags/metadata tables would cause complications/ambiguities later which we can avoid by using very specific SQL syntax, or specific separators.

Thanks,
Peter

@wypoon
Copy link
Contributor

wypoon commented Nov 5, 2021

@rdblue @huaxingao I agree with your core suggestion:

  • Overload TableCatalog#loadTable with variants that take a version (String) or a timestamp (long, millis since the Unix epoch) parameter. Then standardize names of options that correspond to VERSION AS OF and TIMESTAMP AS OF, and if one of them is set, call the appropriateTableCatalog#loadTable.

May I suggest "as-of-version" and "as-of-timestamp" for the options? We already use "as-of-timestamp" in Iceberg. (Side note: In Iceberg documentation, the "as-of-timestamp" option is shown being used with a String rather than a long value, but we actually use the value as a long.)

I support being able to use a tag instead of the snapshot-id for the value of VERSION AS OF. But I'm less convinced of the value of being able to use a branch. Perhaps it is enough to support using the branch as part of the table identifier (whether delimited by . or another symbol)? It would be nice to support specifying the branch in SQL using something like IN BRANCH as @jackye1995 suggested (and @pvary finds acceptable); correct me if I'm wrong, but is the problem then that it will be opposed if it is not in Delta Lake?

An orthogonal question for @huaxingao: is it possible to have FOR SYSTEM_TIME AS OF as a synonym for TIMESTAMP AS OF and FOR SYSTEM_VERSION AS OF as a synonym for VERSION AS OF in Spark SQL? While they are not what is used in Delta Lake, I'm just suggesting them as synonyms. This would allow compatibility with Hive and Impala.

@wypoon
Copy link
Contributor

wypoon commented Nov 5, 2021

@wypoon and others. For our next release, should we add something like this PR so that we can fix schemas with time travel in Spark 3.0? We don't have to document that we support time travel through table identifiers so that we can remove it later and replace it with the AS OF syntax. What do you think?

I'm ok with that idea. (You mean Spark 3.0, 3.1 and 3.2, right?) I would like to get the schema for snapshot done for Spark 3!
Since the special identifiers will be just for the use of the implementation (not for public use), we just need one syntax for each of "snapshot-id" and "as-of-timestamp".

@wypoon
Copy link
Contributor

wypoon commented Nov 5, 2021

The reason why I think we should consider supporting a branch name in VERSION AS OF is so that we have something that may be portable across systems that don't support more than db and table in table identifiers, like Trino.

I missed that. Can Trino support a new syntax like IN BRANCH? But we might have difficulty to support IN BRANCH in Spark? Can we do it in an extension in Iceberg?

@huaxingao
Copy link
Contributor

@wypoon
Are timestampAsOf and versionAsOf OK for you? I saw the following in Delta Lake doc:

df1 = spark.read.format('delta').option('timestampAsOf', '2019-01-01').load('/mnt/delta/people-10m')
df2 = spark.read.format('delta').option('versionAsOf', 2).load('/mnt/delta/people-10m')

Spark has a very strict control over adding new SQL syntax. Normally only ANSI standard SQL syntax can be added. I will give it a try and propose adding FOR SYSTEM_TIME AS OF and FOR SYSTEM_VERSION AS OF as synonyms.

@rdblue
Copy link
Contributor Author

rdblue commented Nov 6, 2021

I like the idea to support SYSTEM_TIME and SYSTEM_VERSION for compatibility. That sounds reasonable to me. Do the keywords have underscores in them? If so, is that needed for some reason?

The IN BRANCH syntax is fine with me, but it would mean another addition to SparkSQL syntax and I'm not confident that it would be accepted. I also think that it's more natural to use db.table.branch, but I don't think that will be supported everywhere. The easiest way to get this in is probably to implement IN BRANCH elsewhere and cite that as precedent.

@rdblue
Copy link
Contributor Author

rdblue commented Nov 6, 2021

Can Trino support a new syntax like IN BRANCH? But we might have difficulty to support IN BRANCH in Spark? Can we do it in an extension in Iceberg?

Probably. I think it's more likely that Trino supports that than multi-part identifiers, but I'm not sure if they would be open to it.

@rdblue
Copy link
Contributor Author

rdblue commented Nov 6, 2021

@huaxingao, can we support both timestampAsOf and as-of-timestamp / versionAsOf and as-of-version since there are multiple precedents?

@wypoon
Copy link
Contributor

wypoon commented Nov 6, 2021

@huaxingao thanks. For the synonym proposal, you can cite HIVE-25344.
For the option names, I'm fine with "timestampAsOf" and "versionAsOf" if that is what the Spark community agrees to. I expected that we have to break with the existing "snapshot-id" anyway. But if the community will accept "as-of-timestamp" and (for consistency) "as-of-version" as alternatives, that would be great.

@pvary can you comment on the question about the underscores in SYSTEM_TIME and SYSTEM_VERSION?

@huaxingao
Copy link
Contributor

@rdblue Yes, I will try to add SYSTEM_TIME and SYSTEM_VERSION, and also as-of-timestamp /as-of-version. Hopefully, I won't get any objections.

@pvary
Copy link
Contributor

pvary commented Nov 8, 2021

I like the idea to support SYSTEM_TIME and SYSTEM_VERSION for compatibility. That sounds reasonable to me. Do the keywords have underscores in them? If so, is that needed for some reason?

AFAIK the SQL 11 standard defines them this way: https://sigmodrecord.org/?smd_process_download=1&download_id=3338

@huaxingao
Copy link
Contributor

@rdblue @wypoon

I agree that VERSION should be passed as a string (hopefully an arbitrary one!)

So this VERSION is NOT digits only and should be defined as version=STRING instead of version=INTEGER_VALUE, right?

@wypoon
Copy link
Contributor

wypoon commented Nov 30, 2021

@rdblue do you intend to go forward with some form of this PR as a way to enable implementing schema for snapshot for Spark 3 (until what Huaxin proposes is implemented in a released Spark version)? Do we have to wait for a decision on @nastra's proposal in https://docs.google.com/document/d/1KSgkVYnIMlWEbAT1qSnnnLS-gc0kdgHlWR6Ud08HOhA/edit# to change TableIdentifier? From your comments on that document, I have the impression that you do not think TableIdentifier should be changed.
If you do think a form of this PR is the way forward but are currently very busy, I'm happy to incorporate some form of this PR in a single PR to support schema for snapshot in Spark 3 (basically an update of #3314).

wypoon added a commit to wypoon/iceberg that referenced this pull request Dec 12, 2021
This has been implemented for Spark 2. For Spark 3, Ryan Blue proposed
a syntax for adding the snapshot id or timestamp to the table identifier
in apache#3269. Here we implement the Spark 3 support for using the snapshot
schema by using the proposed table identifier syntax. This is until a
new Spark 3 is released with support for AS OF in Spark SQL.
Note: The table identifier syntax is for internal use only (as in this
implementation) and not meant to be exposed as a publicly supported
syntax in SQL. However, for testing, we do test its use from SQL.
wypoon added a commit to wypoon/iceberg that referenced this pull request Dec 13, 2021
This has been implemented for Spark 2. For Spark 3, Ryan Blue proposed
a syntax for adding the snapshot id or timestamp to the table identifier
in apache#3269. Here we implement the Spark 3 support for using the snapshot
schema by using the proposed table identifier syntax. This is until a
new Spark 3 is released with support for AS OF in Spark SQL.
Note: The table identifier syntax is for internal use only (as in this
implementation) and not meant to be exposed as a publicly supported
syntax in SQL. However, for testing, we do test its use from SQL.
wypoon added a commit to wypoon/iceberg that referenced this pull request Dec 14, 2021
This has been implemented for Spark 2. For Spark 3, Ryan Blue proposed
a syntax for adding the snapshot id or timestamp to the table identifier
in apache#3269. Here we implement the Spark 3 support for using the snapshot
schema by using the proposed table identifier syntax. This is until a
new Spark 3 is released with support for AS OF in Spark SQL.
Note: The table identifier syntax is for internal use only (as in this
implementation) and not meant to be exposed as a publicly supported
syntax in SQL. However, for testing, we do test its use from SQL.
@rdblue
Copy link
Contributor Author

rdblue commented Dec 21, 2021

This was replaced by #3722 so I'll close this.

Thanks for the great discussion and for all the work on this problem, everyone!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants