[SPARK-53940][SQL] Function version() should return full spark version instead of short version #52641

pan3793 · 2025-10-17T03:45:21Z

What changes were proposed in this pull request?

As title.

Why are the changes needed?

For example, when executing select version() on spark-4.1.0-preview2-bin-hadoop3, it returns 4.1.0, which is misleading.

0: jdbc:sc://localhost:15002> select version();
+-------------------------------------------------+
|                    version()                    |
+-------------------------------------------------+
| 4.1.0 c5ff48cc2b2c5a1526789ae414ff4c63b053d3ec  |
+-------------------------------------------------+

Another case is that the vendor version usually has a suffix, e.g., Foo Spark 4.0.1-foo1.2.3, in this case, version() returns only the short version 4.0.1 is misleading.

As a reference, in CDH Hive, version() returns the full version.

hive> SELECT version()
2.1.1-cdh6.3.2 b3393cf499504df1d2a12d34b4285e5d0c02be11

MariaDB version()'s result contains vendor info

> SELECT version()
11.8.2-MariaDB-ubu2404

PostgreSQL version()'s result also contains vendor info

> SELECT version()
PostgreSQL 17.5 (Debian 17.5-1.pgdg120+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 12.2.0-14) 12.2.0, 64-bit

Does this PR introduce any user-facing change?

No, for the Apache official release, because all official release versions are same as the short version.

But users would see the full Spark version for preview releases when executing SELECT version()

How was this patch tested?

UT is tuned.

Was this patch authored or co-authored using generative AI tooling?

No.

…n instead of short version

pan3793 · 2025-10-17T03:47:04Z

cc @yaooqinn @dongjoon-hyun

beliefer · 2025-10-17T03:58:28Z

I think we should keep the legacy behavior. We should add one parameter (e.g. isExtended default is false) to show the extra version info.

yaooqinn · 2025-10-17T04:00:20Z

Following @dongjoon-hyun's #26209 (comment), I changed VERSION to VERSION_SHORT.

I'd still leave the decision to @dongjoon-hyun.

pan3793 · 2025-10-17T04:52:41Z

@beliefer I don't see the necessity to introduce a legacy config because the result has no difference for the official Apache release.

dongjoon-hyun

As mentioned in the above, we cannot make a breaking change like this, @pan3793 .
In addition, Yes, this was intentionally designed in that way.

pan3793 · 2025-10-17T15:44:59Z

@dongjoon-hyun, if we only count official Apache Spark releases, for the user perspective, the result does not change, so I suppose it will break nothing?

And I wonder why this is designed to return a short version? I experienced many times in daily Spark support cases that users claim they are using an x.y.z Spark but actually a vender modified version (and it does have a vender suffix in the full version), this definitely is a misleading info!

dongjoon-hyun · 2025-10-17T20:34:07Z

No. Please see the rational of shortVersion. You are breaking the dev environment too by exposing -SNAPSHOT annoying, @pan3793 .

spark/common/utils/src/main/scala/org/apache/spark/util/VersionUtils.scala

Lines 41 to 52 in 8499a62

    
             /** 
        
              * Given a Spark version string, return the short version string. 
        
              * E.g., for 3.0.0-SNAPSHOT, return '3.0.0'. 
        
              */ 
        
             def shortVersion(sparkVersion: String): String = { 
        
               shortVersionRegex.findFirstMatchIn(sparkVersion) match { 
        
                 case Some(m) => m.group(1) 
        
                 case None => 
        
                   throw new IllegalArgumentException(s"Spark tried to parse '$sparkVersion' as a Spark" + 
        
                     s" version string, but it could not find the major/minor/maintenance version numbers.") 
        
               } 
        
             }

In addition, if we only consider official Spark release, why do we need this PR?

pan3793 · 2025-10-18T04:29:02Z

@dongjoon-hyun I understand sometimes having -SNAPSHOT introduces noise for the dev process, e.g., golden files, but this is easy to tackle, for example, by redacting or using variables like SparkBuildInfo.spark_version() instead of a string literal, for assertions. As you can see, all tests now pass with just 2 line changes in the test code.

In addition, if we only consider the official Spark release, why do we need this PR?

My point is, we should use the official Spark release to justify whether it's a breaking change or not, in the meantime, we should make the code friendly for vendor distributions.

I also checked the CDH Hive, version() returns full version, which is much intuitive.

hive> SELECT version()
2.1.1-cdh6.3.2 b3393cf499504df1d2a12d34b4285e5d0c02be11

HyukjinKwon · 2025-10-20T00:43:33Z

I wonder if we should better just have the another expression like full_version

pan3793 · 2025-10-20T03:14:03Z

@HyukjinKwon Having another function looks too overkill to me.

@beliefer @dongjoon-hyun @HyukjinKwon

I tested two additional popular RDBMS, MariaDB and PostgreSQL, both support version() function, and the results contain vender info.

If there are concerns about breaking changes for vendor Spark distributions, how about introducing an internal config, e.g., spark.sql.legacy.useShortVersion.enabled? But given this does not change behavior for Apache releases, I think we don't need to expose it to docs and migration guide.

yaooqinn · 2025-10-20T04:00:35Z

Adding a legacy configuration appears contradictory under the hypothesis of non-breaking changes.

Or if this change benefits vendor-specific Spark distributions, they can maintain it themselves by modifying the version() or simply adding a new UDF. This seems inappropriate for the community to make any assumptions on their behalf.

Or if it's for Spark code contributors for bug hunting, version w/-or-w/o can still be vague。 Instead, they should refer to the git hashtag.

pan3793 · 2025-10-20T05:21:17Z

Adding a legacy configuration appears contradictory under the hypothesis of non-breaking changes.

@yaooqinn I hold the same view, I prefer to not have config, but if someone want, I'm also fine to have an internel config.

This seems inappropriate for the community to make any assumptions on their behalf.

I'm not fully agree with this. It's a fact that there are many vendors have their own Spark distributions, if possible, we should make code vendor-friendly if it won't introducing additional complexity.

Consider some common practice for vendors: the vendor usually has their own support channel, version suffix, docs site. In history,

SPARK-41134 changed the internal error message from ... fill a bug report in ... to ... report this bug to the corresponding communities or vendors, and provide the full stack trace.
SPARK-42249 put the spark_doc_root into spark-version-info.properties instead of hard coding, which also simplified vendor for customize.

Or if it's for Spark code contributors for bug hunting, version w/-or-w/o can still be vague. Instead, they should refer to the git hashtag.

I agree git hash is more accurate, but full version is still valuable in many cases. For example, if there is a well known bug in Foo Spark 3.2.1-foo1.2.3, it's easy to justify with a full version. And if user runs Spark SQL in an access-limited env, without allowing sharing screenshoot or copy-pasting, when they hit issues and asking for help publicly, the git hash is likely not being included in the provided version info.

beliefer · 2025-10-20T06:37:30Z

@beliefer I don't see the necessity to introduce a legacy config because the result has no difference for the official Apache release.

We do not need a legacy config. It is just an optional parameter.

> SELECT version()
2.1.1
> SELECT version(true)
2.1.1-cdh6.3.2 b3393cf499504df1d2a12d34b4285e5d0c02be11

pan3793 · 2025-10-20T12:12:33Z

@beliefer, this actually introduces user-facing changes and complicates things, which is exactly what I want to avoid whenever possible.

To summarize, my current proposed candidates are:

(preferred) - just change version() result from short_version to spark_version. it changes nothing for Apache releases.
(acceptable) - option one, but have an internal config to allow reverting to previous behavior, which benefits Apache preview releases and vendor cases.
(not entirely convinced) - keep things as-is.
(disagree) - introduce another function or new args for version(), this complicates things and might confuse the user further.

[SPARK-53940][SQL] Function version() should return full spark versio…

9b1f63d

…n instead of short version

github-actions bot added the SQL label Oct 17, 2025

remove unused import

dc8ecd8

dongjoon-hyun requested changes Oct 17, 2025

View reviewed changes

[SPARK-53940][SQL] Function version() should return full spark version instead of short version #52641

Are you sure you want to change the base?

[SPARK-53940][SQL] Function version() should return full spark version instead of short version #52641

Conversation

pan3793 commented Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

pan3793 commented Oct 17, 2025

Uh oh!

beliefer commented Oct 17, 2025

Uh oh!

yaooqinn commented Oct 17, 2025

Uh oh!

pan3793 commented Oct 17, 2025

Uh oh!

dongjoon-hyun left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pan3793 commented Oct 17, 2025

Uh oh!

dongjoon-hyun commented Oct 17, 2025

Uh oh!

pan3793 commented Oct 18, 2025

Uh oh!

HyukjinKwon commented Oct 20, 2025

Uh oh!

pan3793 commented Oct 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yaooqinn commented Oct 20, 2025

Uh oh!

pan3793 commented Oct 20, 2025

Uh oh!

beliefer commented Oct 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pan3793 commented Oct 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

pan3793 commented Oct 17, 2025 •

edited

Loading

dongjoon-hyun left a comment •

edited

Loading

pan3793 commented Oct 20, 2025 •

edited

Loading

beliefer commented Oct 20, 2025 •

edited

Loading

pan3793 commented Oct 20, 2025 •

edited

Loading