-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-51172][SS][3.5] Rename to spark.sql.optimizer.pruneFiltersCanPruneStreamingSubplan #49905
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…treamingSubplan This PR is a followup of apache#48149 that proposes to rename `spark.databricks.sql.optimizer.pruneFiltersCanPruneStreamingSubplan` to `spark.sql.optimizer.pruneFiltersCanPruneStreamingSubplan` For consistent naming. No, the main PR has not been released yet. CI should test it out. No. Closes apache#49897 from HyukjinKwon/SPARK-49699. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
@dongjoon-hyun @HyukjinKwon |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although this is an internal configuration, it seems that we need to follow a deprecation procedure because this is exposed to the user already.
It means that we need the following still in branch-3.5
.
.withAlternative("spark.databricks.sql.optimizer.pruneFiltersCanPruneStreamingSubplan")
And, the following.
DeprecatedConfig(PRUNE_FILTERS_CAN_PRUNE_STREAMING_SUBPLAN.head, ...
WDYT, @HeartSaVioR and @HyukjinKwon ?
This config is not user facing one - this is so much advanced topic and I don't expect any user to try configuring by themselves. Technically we should have just fixed the issue, but config is there to avoid breakage of existing query. |
What do we get from that deprecation procedural when we no longer have that config in very next (technically) minor release? If you think we need to follow that deprecation, this has to be there for a couple minor releases, otherwise it sounds to me just adding more work without outcome. Users do not upgrade every single version release. |
Are you aware of the meaning of |
What I meant is, don't we just want to remove the config ASAP as I was under the impression like that in the previous discussion thread. If I was wrong, I feel like we were overreacted. We should have just filed a JIRA ticket marked as a blocker, and followed the full process. Even master and 4.0 we should not just remove that. If I wasn't wrong, I believe this is the fastest way with very small impact. We change the config, release the new version, guiding users a way to mitigate if they are in combination of "starting query from 3.5.4, and upgrading to this version, and somehow hit some weird issue". This is just the same with what we were in Spark 3.5.3-. The whole effect is that they went back with the time whether there was no fix. |
Apache Spark 4.0.0 RC1 is scheduled in three days (2025-02-15). So, we are on the same page on
![]() For this part, you are more accurate with the rule.
You can block Apache Spark 4.0.0 release with this, if you want. Then, do the Apache Spark 3.5.5 release. I'm with you, @HeartSaVioR . |
cc @cloud-fan |
Technically, I aim not to block 4.0.0 release and do 3.5.5 after 4.0.0 release immediately. That was my best compromised idea. But, you are right, @HeartSaVioR . |
I could actually promise that the chance for case 1 is super small, but if we want to follow practice regardless, I'll file one. I am just seeking the way to be pragmatic if everyone is OK with it. |
Please notice that following full deprecation is not just adding alternative but we are willing to support both case 1 and case 2, and I need to come up with testing. Please understand if this could delay the first RC of 4.0.0. I'm trying to prioritize but it's not something I have full control of. |
I actually also prefer to just remove this .... |
I'm technically not okay with covering up our mistake silently in the community. At the same time, I agree with you guys that I really don't want to block the Apache Spark 4.0.0 release. For the record, given that we have well-known internal configs like the following, spark/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala Lines 1080 to 1081 in f749742
spark/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala Lines 1639 to 1640 in f749742
The bottom line is that how can we handle this properly in the Apache way. Since this is our mistake, we should be responsible to correct this instead of ignoring any possibility. As I mentioned early, we cannot delete the exposed config at Apache Spark 3.5.x. We can delete that in Apache Spark 4.0.0+ only. |
You don't seem to understand my explanation clearly. If we want to avoid breakage in case 1, we can never delete the config in 4.0. Probably a couple minor releases later. Maybe 4.2, or just 5.0? There must be no case where the user upgrades the Spark from 3.5.4 to the version we remove the config. Please let me know if we are OK with this. I'm ready to file a blocker ticket. |
https://issues.apache.org/jira/browse/SPARK-51187 Here is it. I can't guarantee that this can be done before the date of proposed first RC. If we want to make graceful fix, this ticket is definitely a blocker for Spark 4.0.0. |
To avoid trouble, we can always call |
That's not the case, because streaming engine has the ability to persist the value of the config and set it back to session config when we restart the query. The incorrect config name is in the offset log when they ever run the query in Spark 3.5.4. It's not a single line change if we really want to fix this without any trouble. |
I'm thinking of the fix. I'm trying to prioritize but I might not come up with the fix in this weekend. Shall we give me a bit more time to address this properly? Probably early next week. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To @HeartSaVioR , @HyukjinKwon , @cloud-fan ,
According to the vote result,
I believe we can close this PR in favor of
I'll address the recent discussion to other PRs. Closing this. |
Thank you for closing this. |
Credit to @HyukjinKwon . This PR is just to ensure I've signed up with other committers before proceeding.
What changes were proposed in this pull request?
This PR is a followup of #48149 that proposes to rename
spark.databricks.sql.optimizer.pruneFiltersCanPruneStreamingSubplan
tospark.sql.optimizer.pruneFiltersCanPruneStreamingSubplan
Why are the changes needed?
For consistent naming.
Does this PR introduce any user-facing change?
Yes, as the config was released in Spark 3.5.4 and we are making change with the config.
The config itself is internal and users are not expected to deal with this config. The problem comes from the fact that the config is coupled with how we avoid making breaking change against "the streaming query which had run with prior Spark versions".
There are two cases:
1>
When they start the new query in Spark 3.5.4, there is no offset log to read the static config back from, so the value of config spark.databricks.sql.optimizer.pruneFiltersCanPruneStreamingSubplan will follow the default value, true.
This will be written back to offset log to ensure this value to be kept on streaming query lifecycle.
When they upgrade to the Spark version which we renamed the config, there is offset log to read the static config back, and there is no entry for spark.sql.optimizer.pruneFiltersCanPruneStreamingSubplan, hence we enable backward compatibility mode and put the value false for the config spark.sql.optimizer.pruneFiltersCanPruneStreamingSubplan.
This could break the query if the rule impacts the query, because the effectiveness of the fix is flipped.
2>
When they upgrade their existing streaming query from older version to Spark 3.5.4, there is offset log to read the static config back from, and there is no entry for spark.databricks.sql.optimizer.pruneFiltersCanPruneStreamingSubplan, hence we enable backward compatibility mode and put the value false for the config spark.databricks.sql.optimizer.pruneFiltersCanPruneStreamingSubplan. So the fix is disabled.
When they further upgrade this query to the Spark version which we renamed the config, same thing applies and the fix is disabled. So no change.
Ideally we should take care of the case 1, but the only way we could ensure this is to make an alias to allow both configs to work as the same, and this will enforce us to keep the problematic config forever. (3.5.4 users don't only bump to the next 3.5.x. They can just bump to 4.0.0 hence we have to keep alias for 4.x as well. That's against the rationale for fixing this.)
So the only option seems to tolerate the case 1. Hopefully the failure from that rule is not happening frequently, and also they have to start their query from 3.5.4 (exactly this version) and upgrade, so less and less chance to hit.
How was this patch tested?
CI should test it out.
Was this patch authored or co-authored using generative AI tooling?
No.