Skip to content

[SPARK-48236][BUILD] Add commons-lang:commons-lang:2.6 back to support legacy Hive UDF jars #46528

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

dongjoon-hyun
Copy link
Member

@dongjoon-hyun dongjoon-hyun commented May 10, 2024

What changes were proposed in this pull request?

This PR aims to add commons-lang:commons-lang:2.6 back to support legacy Hive UDF jars . This is a partial revert of SPARK-47018 .

Why are the changes needed?

Recently, we dropped commons-lang:commons-lang during Hive upgrade.

However, only Apache Hive 2.3.10 or 4.0.0 dropped it. In other words, Hive 2.0.0 ~ 2.3.9 and Hive 3.0.0 ~ 3.1.3 requires it. As a result, all existing UDF jars built against those versions requires commons-lang:commons-lang still.

For example, Apache Hive 3.1.3 GenericUDFTrim.java code:

import org.apache.commons.lang.StringUtils;
return StringUtils.strip(val, " ");

As a result, Maven CIs are broken.

The root cause is that the existing test UDF jar hive-test-udfs.jar was built from old Hive (before 2.3.10) libraries which requires commons-lang:commons-lang:2.6.

HiveUDFDynamicLoadSuite:
- Spark should be able to run Hive UDF using jar regardless of current thread context classloader (UDF
20:21:25.129 WARN org.apache.spark.SparkContext: The JAR file:///home/runner/work/spark/spark/sql/hive/src/test/noclasspath/hive-test-udfs.jar at spark://localhost:33327/jars/hive-test-udfs.jar has been added already. Overwriting of added jar is not supported in the current version.

*** RUN ABORTED ***
A needed class was not found. This could be due to an error in your runpath. Missing class: org/apache/commons/lang/StringUtils
  java.lang.NoClassDefFoundError: org/apache/commons/lang/StringUtils
  at org.apache.hadoop.hive.contrib.udf.example.GenericUDFTrim2.performOp(GenericUDFTrim2.java:43)
  at org.apache.hadoop.hive.ql.udf.generic.GenericUDFBaseTrim.evaluate(GenericUDFBaseTrim.java:75)
  at org.apache.hadoop.hive.ql.udf.generic.GenericUDF.initializeAndFoldConstants(GenericUDF.java:170)
  at org.apache.spark.sql.hive.HiveGenericUDFEvaluator.returnInspector$lzycompute(hiveUDFEvaluators.scala:118)
  at org.apache.spark.sql.hive.HiveGenericUDFEvaluator.returnInspector(hiveUDFEvaluators.scala:117)
  at org.apache.spark.sql.hive.HiveGenericUDF.dataType$lzycompute(hiveUDFs.scala:132)
  at org.apache.spark.sql.hive.HiveGenericUDF.dataType(hiveUDFs.scala:132)
  at org.apache.spark.sql.hive.HiveUDFExpressionBuilder$.makeHiveFunctionExpression(HiveSessionStateBuilder.scala:184)
  at org.apache.spark.sql.hive.HiveUDFExpressionBuilder$.$anonfun$makeExpression$1(HiveSessionStateBuilder.scala:164)
  at org.apache.spark.util.Utils$.withContextClassLoader(Utils.scala:185)
  ...
  Cause: java.lang.ClassNotFoundException: org.apache.commons.lang.StringUtils
  at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:445)
  at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:593)
  at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:526)
  at org.apache.hadoop.hive.contrib.udf.example.GenericUDFTrim2.performOp(GenericUDFTrim2.java:43)
  at org.apache.hadoop.hive.ql.udf.generic.GenericUDFBaseTrim.evaluate(GenericUDFBaseTrim.java:75)
  at org.apache.hadoop.hive.ql.udf.generic.GenericUDF.initializeAndFoldConstants(GenericUDF.java:170)
  at org.apache.spark.sql.hive.HiveGenericUDFEvaluator.returnInspector$lzycompute(hiveUDFEvaluators.scala:118)
  at org.apache.spark.sql.hive.HiveGenericUDFEvaluator.returnInspector(hiveUDFEvaluators.scala:117)
  at org.apache.spark.sql.hive.HiveGenericUDF.dataType$lzycompute(hiveUDFs.scala:132)
  at org.apache.spark.sql.hive.HiveGenericUDF.dataType(hiveUDFs.scala:132)
  ...

Does this PR introduce any user-facing change?

To support the existing customer UDF jars.

How was this patch tested?

Manually.

$ build/mvn -Dtest=none -DwildcardSuites=org.apache.spark.sql.hive.HiveUDFDynamicLoadSuite test
...
HiveUDFDynamicLoadSuite:
14:21:56.034 WARN org.apache.hadoop.hive.metastore.ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 2.3.0

14:21:56.035 WARN org.apache.hadoop.hive.metastore.ObjectStore: setMetaStoreSchemaVersion called but recording version is disabled: version = 2.3.0, comment = Set by MetaStore dongjoon@127.0.0.1

14:21:56.041 WARN org.apache.hadoop.hive.metastore.ObjectStore: Failed to get database default, returning NoSuchObjectException

- Spark should be able to run Hive UDF using jar regardless of current thread context classloader (UDF
14:21:57.576 WARN org.apache.spark.SparkContext: The JAR file:///Users/dongjoon/APACHE/spark-merge/sql/hive/src/test/noclasspath/hive-test-udfs.jar at spark://localhost:55526/jars/hive-test-udfs.jar has been added already. Overwriting of added jar is not supported in the current version.

- Spark should be able to run Hive UDF using jar regardless of current thread context classloader (GENERIC_UDF
14:21:58.314 WARN org.apache.spark.SparkContext: The JAR file:///Users/dongjoon/APACHE/spark-merge/sql/hive/src/test/noclasspath/hive-test-udfs.jar at spark://localhost:55526/jars/hive-test-udfs.jar has been added already. Overwriting of added jar is not supported in the current version.

- Spark should be able to run Hive UDF using jar regardless of current thread context classloader (GENERIC_UDAF
14:21:58.943 WARN org.apache.spark.SparkContext: The JAR file:///Users/dongjoon/APACHE/spark-merge/sql/hive/src/test/noclasspath/hive-test-udfs.jar at spark://localhost:55526/jars/hive-test-udfs.jar has been added already. Overwriting of added jar is not supported in the current version.

- Spark should be able to run Hive UDF using jar regardless of current thread context classloader (UDAF
14:21:59.333 WARN org.apache.hadoop.hive.ql.session.SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory.

14:21:59.364 WARN org.apache.hadoop.hive.conf.HiveConf: HiveConf of name hive.internal.ss.authz.settings.applied.marker does not exist

14:21:59.370 WARN org.apache.hadoop.hive.metastore.HiveMetaStore: Location: file:/Users/dongjoon/APACHE/spark-merge/sql/hive/target/tmp/warehouse-49291492-9d48-4360-a354-ace73a2c76ce/src specified for non-external table:src

14:21:59.718 WARN org.apache.hadoop.hive.metastore.ObjectStore: Failed to get database global_temp, returning NoSuchObjectException

14:21:59.770 WARN org.apache.spark.SparkContext: The JAR file:///Users/dongjoon/APACHE/spark-merge/sql/hive/src/test/noclasspath/hive-test-udfs.jar at spark://localhost:55526/jars/hive-test-udfs.jar has been added already. Overwriting of added jar is not supported in the current version.

- Spark should be able to run Hive UDF using jar regardless of current thread context classloader (GENERIC_UDTF
14:22:00.403 WARN org.apache.hadoop.hive.common.FileUtils: File file:/Users/dongjoon/APACHE/spark-merge/sql/hive/target/tmp/warehouse-49291492-9d48-4360-a354-ace73a2c76ce/src does not exist; Force to delete it.

14:22:00.404 ERROR org.apache.hadoop.hive.common.FileUtils: Failed to delete file:/Users/dongjoon/APACHE/spark-merge/sql/hive/target/tmp/warehouse-49291492-9d48-4360-a354-ace73a2c76ce/src

14:22:00.441 WARN org.apache.hadoop.hive.conf.HiveConf: HiveConf of name hive.internal.ss.authz.settings.applied.marker does not exist

14:22:00.453 WARN org.apache.hadoop.hive.ql.session.SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory.

14:22:00.537 WARN org.apache.hadoop.hive.conf.HiveConf: HiveConf of name hive.internal.ss.authz.settings.applied.marker does not exist

Run completed in 8 seconds, 612 milliseconds.
Total number of tests run: 5
Suites: completed 2, aborted 0
Tests: succeeded 5, failed 0, canceled 0, ignored 0, pending 0
All tests passed.

Was this patch authored or co-authored using generative AI tooling?

@dongjoon-hyun
Copy link
Member Author

WDYT, cc @pan3793 , @sunchao , @LuciferYang , @yaooqinn , @viirya ?

pom.xml Outdated
Comment on lines 195 to 196
<!-- To support Hive UDF jars built by Hive 2.3.9 and below -->
<commons-lang2.version>2.6</commons-lang2.version>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. I updated the comment.

@dongjoon-hyun
Copy link
Member Author

Thank you, @viirya !

@dongjoon-hyun
Copy link
Member Author

Merged to master for Apache Spark 4.0.0.

@dongjoon-hyun dongjoon-hyun deleted the SPARK-48236 branch May 10, 2024 22:48
JacobZheng0927 pushed a commit to JacobZheng0927/spark that referenced this pull request May 11, 2024
…ort legacy Hive UDF jars

### What changes were proposed in this pull request?

This PR aims to add `commons-lang:commons-lang:2.6` back to support legacy Hive UDF jars . This is a partial revert of SPARK-47018 .

### Why are the changes needed?

Recently, we dropped `commons-lang:commons-lang` during Hive upgrade.
- apache#46468

However, only Apache Hive 2.3.10 or 4.0.0 dropped it. In other words, Hive 2.0.0 ~ 2.3.9 and Hive 3.0.0 ~ 3.1.3 requires it. As a result, all existing  UDF jars built against those versions requires `commons-lang:commons-lang` still.

- apache/hive#4892

For example, Apache Hive 3.1.3 code:
- https://github.com/apache/hive/blob/af7059e2bdc8b18af42e0b7f7163b923a0bfd424/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFTrim.java#L21
```
import org.apache.commons.lang.StringUtils;
```

- https://github.com/apache/hive/blob/af7059e2bdc8b18af42e0b7f7163b923a0bfd424/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFTrim.java#L42
```
return StringUtils.strip(val, " ");
```

As a result, Maven CIs are broken.
- https://github.com/apache/spark/actions/runs/9032639456/job/24825599546 (Maven / Java 17)
- https://github.com/apache/spark/actions/runs/9033374547/job/24835284769 (Maven / Java 21)

The root cause is that the existing test UDF jar `hive-test-udfs.jar` was built from old Hive (before 2.3.10) libraries which requires `commons-lang:commons-lang:2.6`.
```
HiveUDFDynamicLoadSuite:
- Spark should be able to run Hive UDF using jar regardless of current thread context classloader (UDF
20:21:25.129 WARN org.apache.spark.SparkContext: The JAR file:///home/runner/work/spark/spark/sql/hive/src/test/noclasspath/hive-test-udfs.jar at spark://localhost:33327/jars/hive-test-udfs.jar has been added already. Overwriting of added jar is not supported in the current version.

*** RUN ABORTED ***
A needed class was not found. This could be due to an error in your runpath. Missing class: org/apache/commons/lang/StringUtils
  java.lang.NoClassDefFoundError: org/apache/commons/lang/StringUtils
  at org.apache.hadoop.hive.contrib.udf.example.GenericUDFTrim2.performOp(GenericUDFTrim2.java:43)
  at org.apache.hadoop.hive.ql.udf.generic.GenericUDFBaseTrim.evaluate(GenericUDFBaseTrim.java:75)
  at org.apache.hadoop.hive.ql.udf.generic.GenericUDF.initializeAndFoldConstants(GenericUDF.java:170)
  at org.apache.spark.sql.hive.HiveGenericUDFEvaluator.returnInspector$lzycompute(hiveUDFEvaluators.scala:118)
  at org.apache.spark.sql.hive.HiveGenericUDFEvaluator.returnInspector(hiveUDFEvaluators.scala:117)
  at org.apache.spark.sql.hive.HiveGenericUDF.dataType$lzycompute(hiveUDFs.scala:132)
  at org.apache.spark.sql.hive.HiveGenericUDF.dataType(hiveUDFs.scala:132)
  at org.apache.spark.sql.hive.HiveUDFExpressionBuilder$.makeHiveFunctionExpression(HiveSessionStateBuilder.scala:184)
  at org.apache.spark.sql.hive.HiveUDFExpressionBuilder$.$anonfun$makeExpression$1(HiveSessionStateBuilder.scala:164)
  at org.apache.spark.util.Utils$.withContextClassLoader(Utils.scala:185)
  ...
  Cause: java.lang.ClassNotFoundException: org.apache.commons.lang.StringUtils
  at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:445)
  at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:593)
  at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:526)
  at org.apache.hadoop.hive.contrib.udf.example.GenericUDFTrim2.performOp(GenericUDFTrim2.java:43)
  at org.apache.hadoop.hive.ql.udf.generic.GenericUDFBaseTrim.evaluate(GenericUDFBaseTrim.java:75)
  at org.apache.hadoop.hive.ql.udf.generic.GenericUDF.initializeAndFoldConstants(GenericUDF.java:170)
  at org.apache.spark.sql.hive.HiveGenericUDFEvaluator.returnInspector$lzycompute(hiveUDFEvaluators.scala:118)
  at org.apache.spark.sql.hive.HiveGenericUDFEvaluator.returnInspector(hiveUDFEvaluators.scala:117)
  at org.apache.spark.sql.hive.HiveGenericUDF.dataType$lzycompute(hiveUDFs.scala:132)
  at org.apache.spark.sql.hive.HiveGenericUDF.dataType(hiveUDFs.scala:132)
  ...
```

### Does this PR introduce _any_ user-facing change?

To support the existing customer UDF jars.

### How was this patch tested?

Manually.

```
$ build/mvn -Dtest=none -DwildcardSuites=org.apache.spark.sql.hive.HiveUDFDynamicLoadSuite test
...
HiveUDFDynamicLoadSuite:
14:21:56.034 WARN org.apache.hadoop.hive.metastore.ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 2.3.0

14:21:56.035 WARN org.apache.hadoop.hive.metastore.ObjectStore: setMetaStoreSchemaVersion called but recording version is disabled: version = 2.3.0, comment = Set by MetaStore dongjoon127.0.0.1

14:21:56.041 WARN org.apache.hadoop.hive.metastore.ObjectStore: Failed to get database default, returning NoSuchObjectException

- Spark should be able to run Hive UDF using jar regardless of current thread context classloader (UDF
14:21:57.576 WARN org.apache.spark.SparkContext: The JAR file:///Users/dongjoon/APACHE/spark-merge/sql/hive/src/test/noclasspath/hive-test-udfs.jar at spark://localhost:55526/jars/hive-test-udfs.jar has been added already. Overwriting of added jar is not supported in the current version.

- Spark should be able to run Hive UDF using jar regardless of current thread context classloader (GENERIC_UDF
14:21:58.314 WARN org.apache.spark.SparkContext: The JAR file:///Users/dongjoon/APACHE/spark-merge/sql/hive/src/test/noclasspath/hive-test-udfs.jar at spark://localhost:55526/jars/hive-test-udfs.jar has been added already. Overwriting of added jar is not supported in the current version.

- Spark should be able to run Hive UDF using jar regardless of current thread context classloader (GENERIC_UDAF
14:21:58.943 WARN org.apache.spark.SparkContext: The JAR file:///Users/dongjoon/APACHE/spark-merge/sql/hive/src/test/noclasspath/hive-test-udfs.jar at spark://localhost:55526/jars/hive-test-udfs.jar has been added already. Overwriting of added jar is not supported in the current version.

- Spark should be able to run Hive UDF using jar regardless of current thread context classloader (UDAF
14:21:59.333 WARN org.apache.hadoop.hive.ql.session.SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory.

14:21:59.364 WARN org.apache.hadoop.hive.conf.HiveConf: HiveConf of name hive.internal.ss.authz.settings.applied.marker does not exist

14:21:59.370 WARN org.apache.hadoop.hive.metastore.HiveMetaStore: Location: file:/Users/dongjoon/APACHE/spark-merge/sql/hive/target/tmp/warehouse-49291492-9d48-4360-a354-ace73a2c76ce/src specified for non-external table:src

14:21:59.718 WARN org.apache.hadoop.hive.metastore.ObjectStore: Failed to get database global_temp, returning NoSuchObjectException

14:21:59.770 WARN org.apache.spark.SparkContext: The JAR file:///Users/dongjoon/APACHE/spark-merge/sql/hive/src/test/noclasspath/hive-test-udfs.jar at spark://localhost:55526/jars/hive-test-udfs.jar has been added already. Overwriting of added jar is not supported in the current version.

- Spark should be able to run Hive UDF using jar regardless of current thread context classloader (GENERIC_UDTF
14:22:00.403 WARN org.apache.hadoop.hive.common.FileUtils: File file:/Users/dongjoon/APACHE/spark-merge/sql/hive/target/tmp/warehouse-49291492-9d48-4360-a354-ace73a2c76ce/src does not exist; Force to delete it.

14:22:00.404 ERROR org.apache.hadoop.hive.common.FileUtils: Failed to delete file:/Users/dongjoon/APACHE/spark-merge/sql/hive/target/tmp/warehouse-49291492-9d48-4360-a354-ace73a2c76ce/src

14:22:00.441 WARN org.apache.hadoop.hive.conf.HiveConf: HiveConf of name hive.internal.ss.authz.settings.applied.marker does not exist

14:22:00.453 WARN org.apache.hadoop.hive.ql.session.SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory.

14:22:00.537 WARN org.apache.hadoop.hive.conf.HiveConf: HiveConf of name hive.internal.ss.authz.settings.applied.marker does not exist

Run completed in 8 seconds, 612 milliseconds.
Total number of tests run: 5
Suites: completed 2, aborted 0
Tests: succeeded 5, failed 0, canceled 0, ignored 0, pending 0
All tests passed.
```

### Was this patch authored or co-authored using generative AI tooling?

Closes apache#46528 from dongjoon-hyun/SPARK-48236.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants