-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spark: Inject DataSourceV2Relation
when missing
#7910
Conversation
de721ad
to
caa73de
Compare
When you start a structured streaming query using `.start()`, there will be no `DataSourceV2Relation` reference. When this is missing, we'll just create one since we know the table. Resolves apache#7226
caa73de
to
566fbcb
Compare
130a85f
to
fb9bd54
Compare
Is this fix only for 3.4+? |
@Marcus-Rosti I first wanted to get some feedback on this before also backporting this to older versions of Spark |
I should have some time to review this week. Sorry for the delay! |
It seems like we are trying to fix a bug in Spark, which is beyond Iceberg control. While I don't mind that cause we would have to wait for a new Spark release otherwise, it would be nice to look for a proper fix in Spark that would work without Iceberg extensions. Let me take a closer look at the Spark side. |
case p: WriteToMicroBatchDataSource if p.relation.isEmpty => | ||
import spark.sessionState.analyzer.CatalogAndIdentifier | ||
val originalMultipartIdentifier = spark.sessionState.sqlParser | ||
.parseMultipartIdentifier(p.table.name()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have doubts that this is generally safe as we assume table.name()
would include the catalog name. That's currently the case but it feels fragile. I mentioned a potential fix on the Spark side here. Let me come back with fresh eyes tomorrow.
@Fokko, let me know what you think about fixing it in Spark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see your concern, but as you can see in the test it works quite well. In the smoke test it passes for table
, catalog.schema.table
, and with s3://bucket/wh/path
. Fixing it in Spark could also work, but then I need more pointers on where to start. I looked into the comment on the issue, but it wasn't directly obvious to me. I think we should fix this, looking at how many people are bumping into this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd say let's just use the new API in Spark and don't worry about it. I think you already updated the docs to cover that.
When you start a structured streaming query using
.start()
, there will be noDataSourceV2Relation
reference. Therefore the catalog functions won't be available, resulting in:When this is missing, we'll just create one since we know the table.
Resolves #7226