-
Notifications
You must be signed in to change notification settings - Fork 28.5k
[SPARK-33721][SQL] Support to use Hive build-in functions by configuration #30686
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Can one of the admins verify this patch? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are too many differences, for example:
hive> select timestamp '2020-06-01 24:00:00';
OK
2020-06-02 00:00:00
Time taken: 0.034 seconds, Fetched: 1 row(s)
spark-sql> select to_timestamp('2020-06-01 24:00:00');
NULL
Could we add these differences to http://spark.apache.org/docs/latest/sql-migration-guide.html#compatibility-with-apache-hive?
Which version of hive is based on in this migration guide? I found some inconsistency not listed in the doc,but based on hive 1.x,so I guess this doc is based on hive 2.x? Since Spark 2.x is prebuilt with hive 1.x, meaning hive 1.x is still widely used,so I think we may note these inconsistency in the doc? |
I believe we just comprehensively document. If there are some specific behaviours for a specific Hive version, we should better document it as well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, but I'm -1 for this approach for two reasons.
- Hive/Spark SQL difference has been existing for a long time. Users had better be aware of it.
- This will block us from upgrading internal Hive execution library due to the incompatibility across Hive versions. For example, this may imply that we cannot upgrade the Hive execution library from 2.4 to 3.1 in the future without breaking change.
Instead of this approach, the users had better register the required Hive UDF explicitly from user side.
cc @cloud-fan and @gatorsmile
Yea, a better approach would be to switch the catalog to a Hive-compatible implememtation, but that requires we add |
According to the above advice, I'll close this (AS-IS) PR. |
What changes were proposed in this pull request?
Hive and Spark SQL engines have many differences in built-in functions.The differences between several functions are shown below:
Notice: Hive version is 1.2.1
Why are the changes needed?
There is no conclusion on which engine is more accurate. Users prefer to be able to make choices according to their real production environment.
I think we should do some improvement for this.
Does this PR introduce any user-facing change?
No
How was this patch tested?
manual