Skip to content

[SPARK-33721][SQL] Support to use Hive build-in functions by configuration #30686

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

southernriver
Copy link
Contributor

@southernriver southernriver commented Dec 9, 2020

What changes were proposed in this pull request?

Hive and Spark SQL engines have many differences in built-in functions.The differences between several functions are shown below:

build-in functions SQL result of Hive SQL result of Spark SQL
unix_timestamp select unix_timestamp(concat('2020-06-01',  ' 24:00:00')); 1591027200 NULL
to_date select to_date('0000-00-00'); 0002-11-30 NULL
datediff select datediff(CURRENT_DATE, '0000-00-00'); 737986 NULL
collect_set selectc1,c2,concat_ws('##', collect_set(c3)) c3_set from bigdata_offline.test_collect_set group by c1, c2; bigdata_offline.test_collect_set contains data:(1, 1, '1'),(1, 1, '2'),(1, 1, '3'),(1, 1, '4'),(1, 1, '5') c1  c2  c3_set1   1   2##3##4##5##1 c1  c2      c3_set1   1   3##1##2##5##4

Notice: Hive version is 1.2.1

Why are the changes needed?

There is no conclusion on which engine is more accurate. Users prefer to be able to make choices according to their real production environment.

I think we should do some improvement for this.

Does this PR introduce any user-facing change?

No

How was this patch tested?

manual

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

Copy link
Member

@wangyum wangyum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are too many differences, for example:

hive> select timestamp '2020-06-01 24:00:00';
OK
2020-06-02 00:00:00
Time taken: 0.034 seconds, Fetched: 1 row(s)

spark-sql> select to_timestamp('2020-06-01 24:00:00');
NULL

Could we add these differences to http://spark.apache.org/docs/latest/sql-migration-guide.html#compatibility-with-apache-hive?

@github-actions github-actions bot added the SQL label Dec 9, 2020
@sqlwindspeaker
Copy link

There are too many differences, for example:

hive> select timestamp '2020-06-01 24:00:00';
OK
2020-06-02 00:00:00
Time taken: 0.034 seconds, Fetched: 1 row(s)

spark-sql> select to_timestamp('2020-06-01 24:00:00');
NULL

Could we add these differences to http://spark.apache.org/docs/latest/sql-migration-guide.html#compatibility-with-apache-hive?

Which version of hive is based on in this migration guide?

I found some inconsistency not listed in the doc,but based on hive 1.x,so I guess this doc is based on hive 2.x?

Since Spark 2.x is prebuilt with hive 1.x, meaning hive 1.x is still widely used,so I think we may note these inconsistency in the doc?

@HyukjinKwon
Copy link
Member

I believe we just comprehensively document. If there are some specific behaviours for a specific Hive version, we should better document it as well.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, but I'm -1 for this approach for two reasons.

  1. Hive/Spark SQL difference has been existing for a long time. Users had better be aware of it.
  2. This will block us from upgrading internal Hive execution library due to the incompatibility across Hive versions. For example, this may imply that we cannot upgrade the Hive execution library from 2.4 to 3.1 in the future without breaking change.

Instead of this approach, the users had better register the required Hive UDF explicitly from user side.

cc @cloud-fan and @gatorsmile

@HyukjinKwon
Copy link
Member

Yeah, I don't support this change either, I guess @wangyum too. Maybe we should just document the diff as guided by @wangyum.

@cloud-fan
Copy link
Contributor

Yea, a better approach would be to switch the catalog to a Hive-compatible implememtation, but that requires we add FunctionCatalog API first.

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Dec 11, 2020

According to the above advice, I'll close this (AS-IS) PR.
@southernriver . You can reopen this if you change the approach, or you can open a new PR.
Thank you for making a PR and sorry for the decision on this PR at this time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants