Skip to content

[SPARK-51172][SS][3.5] Rename to spark.sql.optimizer.pruneFiltersCanPruneStreamingSubplan #49905

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

HeartSaVioR
Copy link
Contributor

@HeartSaVioR HeartSaVioR commented Feb 12, 2025

Credit to @HyukjinKwon . This PR is just to ensure I've signed up with other committers before proceeding.

What changes were proposed in this pull request?

This PR is a followup of #48149 that proposes to rename spark.databricks.sql.optimizer.pruneFiltersCanPruneStreamingSubplan to spark.sql.optimizer.pruneFiltersCanPruneStreamingSubplan

Why are the changes needed?

For consistent naming.

Does this PR introduce any user-facing change?

Yes, as the config was released in Spark 3.5.4 and we are making change with the config.

The config itself is internal and users are not expected to deal with this config. The problem comes from the fact that the config is coupled with how we avoid making breaking change against "the streaming query which had run with prior Spark versions".

There are two cases:

  1. The streaming query has started from Spark 3.5.4
  2. The streaming query has started before Spark 3.5.4, and had migrated to Spark 3.5.4

1>
When they start the new query in Spark 3.5.4, there is no offset log to read the static config back from, so the value of config spark.databricks.sql.optimizer.pruneFiltersCanPruneStreamingSubplan will follow the default value, true.

This will be written back to offset log to ensure this value to be kept on streaming query lifecycle.

When they upgrade to the Spark version which we renamed the config, there is offset log to read the static config back, and there is no entry for spark.sql.optimizer.pruneFiltersCanPruneStreamingSubplan, hence we enable backward compatibility mode and put the value false for the config spark.sql.optimizer.pruneFiltersCanPruneStreamingSubplan.

This could break the query if the rule impacts the query, because the effectiveness of the fix is flipped.

2>
When they upgrade their existing streaming query from older version to Spark 3.5.4, there is offset log to read the static config back from, and there is no entry for spark.databricks.sql.optimizer.pruneFiltersCanPruneStreamingSubplan, hence we enable backward compatibility mode and put the value false for the config spark.databricks.sql.optimizer.pruneFiltersCanPruneStreamingSubplan. So the fix is disabled.

When they further upgrade this query to the Spark version which we renamed the config, same thing applies and the fix is disabled. So no change.

Ideally we should take care of the case 1, but the only way we could ensure this is to make an alias to allow both configs to work as the same, and this will enforce us to keep the problematic config forever. (3.5.4 users don't only bump to the next 3.5.x. They can just bump to 4.0.0 hence we have to keep alias for 4.x as well. That's against the rationale for fixing this.)

So the only option seems to tolerate the case 1. Hopefully the failure from that rule is not happening frequently, and also they have to start their query from 3.5.4 (exactly this version) and upgrade, so less and less chance to hit.

How was this patch tested?

CI should test it out.

Was this patch authored or co-authored using generative AI tooling?

No.

…treamingSubplan

This PR is a followup of apache#48149 that proposes to rename `spark.databricks.sql.optimizer.pruneFiltersCanPruneStreamingSubplan` to `spark.sql.optimizer.pruneFiltersCanPruneStreamingSubplan`

For consistent naming.

No, the main PR has not been released yet.

CI should test it out.

No.

Closes apache#49897 from HyukjinKwon/SPARK-49699.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
@github-actions github-actions bot added the SQL label Feb 12, 2025
@HeartSaVioR
Copy link
Contributor Author

HeartSaVioR commented Feb 12, 2025

@dongjoon-hyun @HyukjinKwon
Please take a look. I've described the user-facing change; it's actually applied for prior PR as well (master/4.0), but let's put this here instead.
Please let me know if we want to solve the issue in different way. Thanks!

@HeartSaVioR HeartSaVioR changed the title [SPARK-51172][SS] Rename to spark.sql.optimizer.pruneFiltersCanPruneStreamingSubplan [SPARK-51172][SS][3.5] Rename to spark.sql.optimizer.pruneFiltersCanPruneStreamingSubplan Feb 12, 2025
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although this is an internal configuration, it seems that we need to follow a deprecation procedure because this is exposed to the user already.

It means that we need the following still in branch-3.5.

.withAlternative("spark.databricks.sql.optimizer.pruneFiltersCanPruneStreamingSubplan")

And, the following.

DeprecatedConfig(PRUNE_FILTERS_CAN_PRUNE_STREAMING_SUBPLAN.head, ...

WDYT, @HeartSaVioR and @HyukjinKwon ?

@HeartSaVioR
Copy link
Contributor Author

This config is not user facing one - this is so much advanced topic and I don't expect any user to try configuring by themselves. Technically we should have just fixed the issue, but config is there to avoid breakage of existing query.

@HeartSaVioR
Copy link
Contributor Author

HeartSaVioR commented Feb 12, 2025

What do we get from that deprecation procedural when we no longer have that config in very next (technically) minor release? If you think we need to follow that deprecation, this has to be there for a couple minor releases, otherwise it sounds to me just adding more work without outcome. Users do not upgrade every single version release.

@dongjoon-hyun
Copy link
Member

this has to be there for a couple minor releases, otherwise it sounds to me just adding more work without outcome.

Are you aware of the meaning of .withAlternative("spark.databricks.sql.optimizer.pruneFiltersCanPruneStreamingSubplan"), @HeartSaVioR ?

@HeartSaVioR
Copy link
Contributor Author

HeartSaVioR commented Feb 12, 2025

What I meant is, don't we just want to remove the config ASAP as I was under the impression like that in the previous discussion thread. If I was wrong, I feel like we were overreacted. We should have just filed a JIRA ticket marked as a blocker, and followed the full process. Even master and 4.0 we should not just remove that.

If I wasn't wrong, I believe this is the fastest way with very small impact. We change the config, release the new version, guiding users a way to mitigate if they are in combination of "starting query from 3.5.4, and upgrading to this version, and somehow hit some weird issue". This is just the same with what we were in Spark 3.5.3-. The whole effect is that they went back with the time whether there was no fix.

@dongjoon-hyun
Copy link
Member

Apache Spark 4.0.0 RC1 is scheduled in three days (2025-02-15). So, we are on the same page on master and branch-4.0.

What I meant is, don't we just want to remove the config ASAP

Screenshot 2025-02-12 at 15 57 47

For this part, you are more accurate with the rule.

Even master and 4.0 we should not just remove that.

You can block Apache Spark 4.0.0 release with this, if you want. Then, do the Apache Spark 3.5.5 release. I'm with you, @HeartSaVioR .

@dongjoon-hyun
Copy link
Member

cc @cloud-fan

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Feb 13, 2025

Technically, I aim not to block 4.0.0 release and do 3.5.5 after 4.0.0 release immediately. That was my best compromised idea. But, you are right, @HeartSaVioR .

@HeartSaVioR
Copy link
Contributor Author

HeartSaVioR commented Feb 13, 2025

I could actually promise that the chance for case 1 is super small, but if we want to follow practice regardless, I'll file one. I am just seeking the way to be pragmatic if everyone is OK with it.

@HeartSaVioR
Copy link
Contributor Author

HeartSaVioR commented Feb 13, 2025

Please notice that following full deprecation is not just adding alternative but we are willing to support both case 1 and case 2, and I need to come up with testing. Please understand if this could delay the first RC of 4.0.0. I'm trying to prioritize but it's not something I have full control of.

@HyukjinKwon
Copy link
Member

I actually also prefer to just remove this ....

@dongjoon-hyun
Copy link
Member

I'm technically not okay with covering up our mistake silently in the community. At the same time, I agree with you guys that I really don't want to block the Apache Spark 4.0.0 release.

For the record, given that we have well-known internal configs like the following, internal or not is not a criteria for the decision here.

val CASE_SENSITIVE = buildConf(SqlApiConfHelper.CASE_SENSITIVE_KEY)
.internal()

val CONVERT_CTAS = buildConf("spark.sql.hive.convertCTAS")
.internal()

The bottom line is that how can we handle this properly in the Apache way. Since this is our mistake, we should be responsible to correct this instead of ignoring any possibility.

As I mentioned early, we cannot delete the exposed config at Apache Spark 3.5.x. We can delete that in Apache Spark 4.0.0+ only.
#49905 (review)

@HeartSaVioR
Copy link
Contributor Author

HeartSaVioR commented Feb 13, 2025

You don't seem to understand my explanation clearly. If we want to avoid breakage in case 1, we can never delete the config in 4.0. Probably a couple minor releases later. Maybe 4.2, or just 5.0?

There must be no case where the user upgrades the Spark from 3.5.4 to the version we remove the config.

Please let me know if we are OK with this. I'm ready to file a blocker ticket.

@HeartSaVioR
Copy link
Contributor Author

HeartSaVioR commented Feb 13, 2025

https://issues.apache.org/jira/browse/SPARK-51187 Here is it. I can't guarantee that this can be done before the date of proposed first RC. If we want to make graceful fix, this ticket is definitely a blocker for Spark 4.0.0.
@cloud-fan
Please do not start triggering release process before we address blockers. I can post to dev@ in case we want to notice the community broadly.

@cloud-fan
Copy link
Contributor

To avoid trouble, we can always call withAlternative to still support the old name. It's just a one line change.

@HeartSaVioR
Copy link
Contributor Author

HeartSaVioR commented Feb 13, 2025

That's not the case, because streaming engine has the ability to persist the value of the config and set it back to session config when we restart the query. The incorrect config name is in the offset log when they ever run the query in Spark 3.5.4. It's not a single line change if we really want to fix this without any trouble.

@HeartSaVioR
Copy link
Contributor Author

HeartSaVioR commented Feb 13, 2025

I'm thinking of the fix. I'm trying to prioritize but I might not come up with the fix in this weekend. Shall we give me a bit more time to address this properly? Probably early next week.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HeartSaVioR
Copy link
Contributor Author

I'll address the recent discussion to other PRs. Closing this.

@dongjoon-hyun
Copy link
Member

Thank you for closing this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants