Spark: Inject `DataSourceV2Relation` when missing #7910

Fokko · 2023-06-26T06:43:58Z

When you start a structured streaming query using .start(), there will be no DataSourceV2Relation reference. Therefore the catalog functions won't be available, resulting in:

Exception in thread "main" org.apache.spark.sql.streaming.StreamingQueryException: days(ts) is not currently supported

When this is missing, we'll just create one since we know the table.

Resolves #7226

When you start a structured streaming query using `.start()`, there will be no `DataSourceV2Relation` reference. When this is missing, we'll just create one since we know the table. Resolves apache#7226

…-streaming

Marcus-Rosti · 2023-08-08T17:30:50Z

Is this fix only for 3.4+?

Fokko · 2023-08-08T18:29:50Z

@Marcus-Rosti I first wanted to get some feedback on this before also backporting this to older versions of Spark

aokolnychyi · 2023-08-16T23:59:15Z

I should have some time to review this week. Sorry for the delay!

aokolnychyi · 2023-08-22T01:41:49Z

It seems like we are trying to fix a bug in Spark, which is beyond Iceberg control. While I don't mind that cause we would have to wait for a new Spark release otherwise, it would be nice to look for a proper fix in Spark that would work without Iceberg extensions.

Let me take a closer look at the Spark side.

aokolnychyi · 2023-08-22T02:13:52Z

...k-extensions/src/main/scala/org/apache/spark/sql/catalyst/optimizer/SetMissingRelation.scala

+    case p: WriteToMicroBatchDataSource if p.relation.isEmpty =>
+      import spark.sessionState.analyzer.CatalogAndIdentifier
+      val originalMultipartIdentifier = spark.sessionState.sqlParser
+        .parseMultipartIdentifier(p.table.name())


I have doubts that this is generally safe as we assume table.name() would include the catalog name. That's currently the case but it feels fragile. I mentioned a potential fix on the Spark side here. Let me come back with fresh eyes tomorrow.

@Fokko, let me know what you think about fixing it in Spark.

I see your concern, but as you can see in the test it works quite well. In the smoke test it passes for table, catalog.schema.table, and with s3://bucket/wh/path. Fixing it in Spark could also work, but then I need more pointers on where to start. I looked into the comment on the issue, but it wasn't directly obvious to me. I think we should fix this, looking at how many people are bumping into this.

I'd say let's just use the new API in Spark and don't worry about it. I think you already updated the docs to cover that.

Fokko requested a review from aokolnychyi June 26, 2023 06:43

github-actions bot added build spark labels Jun 26, 2023

Fokko force-pushed the fd-structured-streaming branch 3 times, most recently from de721ad to caa73de Compare June 26, 2023 07:46

Spark: Inject DataSourceV2Relation when missing

566fbcb

When you start a structured streaming query using `.start()`, there will be no `DataSourceV2Relation` reference. When this is missing, we'll just create one since we know the table. Resolves apache#7226

Fokko force-pushed the fd-structured-streaming branch from caa73de to 566fbcb Compare June 26, 2023 09:41

Fokko added 4 commits July 4, 2023 22:52

Merge branch 'master' of github.com:apache/iceberg into fd-structured…

c062fea

…-streaming

Merge branch 'master' of github.com:apache/iceberg into fd-structured…

e684eb3

…-streaming

Set default catalog

c0594d1

Replace Scala with Java

fb9bd54

Fokko force-pushed the fd-structured-streaming branch from 130a85f to fb9bd54 Compare July 13, 2023 10:31

Fokko added 3 commits July 13, 2023 12:55

Fix build issues

2a75b09

MOAR tests

9fcfc55

Merge branch 'master' into fd-structured-streaming

2b18d03

Merge branch 'master' into fd-structured-streaming

dfddcb2

aokolnychyi reviewed Aug 22, 2023

View reviewed changes

Fokko closed this Oct 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark: Inject `DataSourceV2Relation` when missing #7910

Spark: Inject `DataSourceV2Relation` when missing #7910

Fokko commented Jun 26, 2023 •

edited

Loading

Marcus-Rosti commented Aug 8, 2023

Fokko commented Aug 8, 2023

aokolnychyi commented Aug 16, 2023

aokolnychyi commented Aug 22, 2023

aokolnychyi Aug 22, 2023 •

edited

Loading

Fokko Sep 1, 2023

aokolnychyi Oct 12, 2023

Spark: Inject DataSourceV2Relation when missing #7910

Spark: Inject DataSourceV2Relation when missing #7910

Conversation

Fokko commented Jun 26, 2023 • edited Loading

Marcus-Rosti commented Aug 8, 2023

Fokko commented Aug 8, 2023

aokolnychyi commented Aug 16, 2023

aokolnychyi commented Aug 22, 2023

aokolnychyi Aug 22, 2023 • edited Loading

Choose a reason for hiding this comment

Fokko Sep 1, 2023

Choose a reason for hiding this comment

aokolnychyi Oct 12, 2023

Choose a reason for hiding this comment

Spark: Inject `DataSourceV2Relation` when missing #7910

Spark: Inject `DataSourceV2Relation` when missing #7910

Fokko commented Jun 26, 2023 •

edited

Loading

aokolnychyi Aug 22, 2023 •

edited

Loading