Skip to content

Conversation

pan3793
Copy link
Member

@pan3793 pan3793 commented Oct 17, 2025

What changes were proposed in this pull request?

As title.

Why are the changes needed?

For example, when executing select version() on spark-4.1.0-preview2-bin-hadoop3, it returns 4.1.0, which is misleading.

0: jdbc:sc://localhost:15002> select version();
+-------------------------------------------------+
|                    version()                    |
+-------------------------------------------------+
| 4.1.0 c5ff48cc2b2c5a1526789ae414ff4c63b053d3ec  |
+-------------------------------------------------+

Another case is that the vendor version usually has a suffix, e.g., Foo Spark 4.0.1-foo1.2.3, in this case, version() returns only the short version 4.0.1 is misleading.

As a reference, in CDH Hive, version() returns the full version.

hive> SELECT version()
2.1.1-cdh6.3.2 b3393cf499504df1d2a12d34b4285e5d0c02be11

MariaDB version()'s result contains vendor info

> SELECT version()
11.8.2-MariaDB-ubu2404

PostgreSQL version()'s result also contains vendor info

> SELECT version()
PostgreSQL 17.5 (Debian 17.5-1.pgdg120+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 12.2.0-14) 12.2.0, 64-bit

Does this PR introduce any user-facing change?

No, for the Apache official release, because all official release versions are same as the short version.

But users would see the full Spark version for preview releases when executing SELECT version()

How was this patch tested?

UT is tuned.

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions github-actions bot added the SQL label Oct 17, 2025
@pan3793
Copy link
Member Author

pan3793 commented Oct 17, 2025

cc @yaooqinn @dongjoon-hyun

@beliefer
Copy link
Contributor

I think we should keep the legacy behavior. We should add one parameter (e.g. isExtended default is false) to show the extra version info.

@yaooqinn
Copy link
Member

Following @dongjoon-hyun's #26209 (comment), I changed VERSION to VERSION_SHORT.

I'd still leave the decision to @dongjoon-hyun.

@pan3793
Copy link
Member Author

pan3793 commented Oct 17, 2025

@beliefer I don't see the necessity to introduce a legacy config because the result has no difference for the official Apache release.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. As mentioned in the above, we cannot make a breaking change like this, @pan3793 .
  2. In addition, Yes, this was intentionally designed in that way.

@pan3793
Copy link
Member Author

pan3793 commented Oct 17, 2025

@dongjoon-hyun, if we only count official Apache Spark releases, for the user perspective, the result does not change, so I suppose it will break nothing?

And I wonder why this is designed to return a short version? I experienced many times in daily Spark support cases that users claim they are using an x.y.z Spark but actually a vender modified version (and it does have a vender suffix in the full version), this definitely is a misleading info!

@dongjoon-hyun
Copy link
Member

No. Please see the rational of shortVersion. You are breaking the dev environment too by exposing -SNAPSHOT annoying, @pan3793 .

/**
* Given a Spark version string, return the short version string.
* E.g., for 3.0.0-SNAPSHOT, return '3.0.0'.
*/
def shortVersion(sparkVersion: String): String = {
shortVersionRegex.findFirstMatchIn(sparkVersion) match {
case Some(m) => m.group(1)
case None =>
throw new IllegalArgumentException(s"Spark tried to parse '$sparkVersion' as a Spark" +
s" version string, but it could not find the major/minor/maintenance version numbers.")
}
}

In addition, if we only consider official Spark release, why do we need this PR?

@pan3793
Copy link
Member Author

pan3793 commented Oct 18, 2025

@dongjoon-hyun I understand sometimes having -SNAPSHOT introduces noise for the dev process, e.g., golden files, but this is easy to tackle, for example, by redacting or using variables like SparkBuildInfo.spark_version() instead of a string literal, for assertions. As you can see, all tests now pass with just 2 line changes in the test code.

In addition, if we only consider the official Spark release, why do we need this PR?

My point is, we should use the official Spark release to justify whether it's a breaking change or not, in the meantime, we should make the code friendly for vendor distributions.

I also checked the CDH Hive, version() returns full version, which is much intuitive.

hive> SELECT version()
2.1.1-cdh6.3.2 b3393cf499504df1d2a12d34b4285e5d0c02be11

@HyukjinKwon
Copy link
Member

I wonder if we should better just have the another expression like full_version

@pan3793
Copy link
Member Author

pan3793 commented Oct 20, 2025

@HyukjinKwon Having another function looks too overkill to me.

@beliefer @dongjoon-hyun @HyukjinKwon

I tested two additional popular RDBMS, MariaDB and PostgreSQL, both support version() function, and the results contain vender info.

If there are concerns about breaking changes for vendor Spark distributions, how about introducing an internal config, e.g., spark.sql.legacy.useShortVersion.enabled? But given this does not change behavior for Apache releases, I think we don't need to expose it to docs and migration guide.

@yaooqinn
Copy link
Member

Adding a legacy configuration appears contradictory under the hypothesis of non-breaking changes.

Or if this change benefits vendor-specific Spark distributions, they can maintain it themselves by modifying the version() or simply adding a new UDF. This seems inappropriate for the community to make any assumptions on their behalf.

Or if it's for Spark code contributors for bug hunting, version w/-or-w/o can still be vague。 Instead, they should refer to the git hashtag.

@pan3793
Copy link
Member Author

pan3793 commented Oct 20, 2025

Adding a legacy configuration appears contradictory under the hypothesis of non-breaking changes.

@yaooqinn I hold the same view, I prefer to not have config, but if someone want, I'm also fine to have an internel config.

This seems inappropriate for the community to make any assumptions on their behalf.

I'm not fully agree with this. It's a fact that there are many vendors have their own Spark distributions, if possible, we should make code vendor-friendly if it won't introducing additional complexity.

Consider some common practice for vendors: the vendor usually has their own support channel, version suffix, docs site. In history,

  • SPARK-41134 changed the internal error message from ... fill a bug report in ... to ... report this bug to the corresponding communities or vendors, and provide the full stack trace.
  • SPARK-42249 put the spark_doc_root into spark-version-info.properties instead of hard coding, which also simplified vendor for customize.

Or if it's for Spark code contributors for bug hunting, version w/-or-w/o can still be vague. Instead, they should refer to the git hashtag.

I agree git hash is more accurate, but full version is still valuable in many cases. For example, if there is a well known bug in Foo Spark 3.2.1-foo1.2.3, it's easy to justify with a full version. And if user runs Spark SQL in an access-limited env, without allowing sharing screenshoot or copy-pasting, when they hit issues and asking for help publicly, the git hash is likely not being included in the provided version info.

@beliefer
Copy link
Contributor

beliefer commented Oct 20, 2025

@beliefer I don't see the necessity to introduce a legacy config because the result has no difference for the official Apache release.

We do not need a legacy config. It is just an optional parameter.

> SELECT version()
2.1.1
> SELECT version(true)
2.1.1-cdh6.3.2 b3393cf499504df1d2a12d34b4285e5d0c02be11

@pan3793
Copy link
Member Author

pan3793 commented Oct 20, 2025

@beliefer, this actually introduces user-facing changes and complicates things, which is exactly what I want to avoid whenever possible.

To summarize, my current proposed candidates are:

  1. (preferred) - just change version() result from short_version to spark_version. it changes nothing for Apache releases.

  2. (acceptable) - option one, but have an internal config to allow reverting to previous behavior, which benefits Apache preview releases and vendor cases.

  3. (not entirely convinced) - keep things as-is.

  4. (disagree) - introduce another function or new args for version(), this complicates things and might confuse the user further.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants