Skip to content

[SPARK-42051][SQL] Codegen Support for HiveGenericUDF #39555

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 9 commits into from

Conversation

yaooqinn
Copy link
Member

@yaooqinn yaooqinn commented Jan 13, 2023

What changes were proposed in this pull request?

As a subtask of SPARK-42050, this PR adds Codegen Support for HiveGenericUDF

Why are the changes needed?

improve codegen coverage and performance

Does this PR introduce any user-facing change?

no

How was this patch tested?

new UT added

@github-actions github-actions bot added the SQL label Jan 13, 2023
@yaooqinn
Copy link
Member Author

yaooqinn commented Jan 13, 2023

cc @cloud-fan @HyukjinKwon @dongjoon-hyun PTAL, thanks

@yaooqinn yaooqinn self-assigned this Jan 13, 2023
@@ -128,11 +129,10 @@ private[hive] class DeferredObjectAdapter(oi: ObjectInspector, dataType: DataTyp
override def get(): AnyRef = wrapper(func()).asInstanceOf[AnyRef]
}

private[hive] case class HiveGenericUDF(
case class HiveGenericUDF(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible to rewrite HiveGenericUDF with Invoke? Then we can simply use RuntimeReplaceable

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems quite complicated to handle the hive value mapping and function wrapping

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @sunchao , too

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. The current approach also looks reasonable to me.

Could you review this once more when you have some time, @cloud-fan and @sunchao ?

@dongjoon-hyun
Copy link
Member

Also cc @LuciferYang , too

@yaooqinn
Copy link
Member Author

Belated Happy new year!
Comments addressed, thank you all.

@yaooqinn
Copy link
Member Author

yaooqinn commented Feb 1, 2023

Since the last commit is for adding a test, I will merge this. thanks @dongjoon-hyun @LuciferYang.

Merged to master

@yaooqinn yaooqinn closed this in 34fb408 Feb 1, 2023
@dongjoon-hyun
Copy link
Member

Thank you, @yaooqinn and @LuciferYang !

@LuciferYang
Copy link
Contributor

late LGTM

@@ -120,19 +121,18 @@ private[hive] class DeferredObjectAdapter(oi: ObjectInspector, dataType: DataTyp
extends DeferredObject with HiveInspectors {

private val wrapper = wrapperFor(oi, dataType)
private var func: () => Any = _
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yaooqinn @dongjoon-hyun This change removes deferred evaluation and means it is no longer possible to implement short-circuiting in Hive generic UDFs. I filed https://issues.apache.org/jira/browse/SPARK-44616 for this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

During the upgrade from Spark 3.3.1 to 3.5.1, we encountered syntax issues with this pr. The problem arose from DeferredObject currently passing a value instead of a function, which prevented users from catching exceptions in GenericUDF, resulting in semantic differences.

Here is an example case we encountered. Originally, the semantics were that str_to_map_udf would throw an exception due to issues with the input string, while merge_map_udf could catch the exception and return a null value. However, currently, any exception encountered by str_to_map_udf will cause the program to fail.

select merge_map_udf(str_to_map_udf(col1), parse_map_udf(col2), map("key", "value")) from table

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yaooqinn is it easy to fix? If not we should probably revert it as this is not a critical perf improvement.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for being late,my network glitches a lot recently. and thanks for reporting this issue. It’s easy to make a fix

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yaooqinn, is this already underway? I tried this on local #47193

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@panbingkun thank you

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to fix it in #47268 in another way, @yaooqinn would you please take a look?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants