You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[SPARK-26422][R] Support to disable Hive support in SparkR even for Hadoop versions unsupported by Hive fork
## What changes were proposed in this pull request?
Currently, even if I explicitly disable Hive support in SparkR session as below:
```r
sparkSession <- sparkR.session("local[4]", "SparkR", Sys.getenv("SPARK_HOME"),
enableHiveSupport = FALSE)
```
produces when the Hadoop version is not supported by our Hive fork:
```
java.lang.reflect.InvocationTargetException
...
Caused by: java.lang.IllegalArgumentException: Unrecognized Hadoop major version number: 3.1.1.3.1.0.0-78
at org.apache.hadoop.hive.shims.ShimLoader.getMajorVersion(ShimLoader.java:174)
at org.apache.hadoop.hive.shims.ShimLoader.loadShims(ShimLoader.java:139)
at org.apache.hadoop.hive.shims.ShimLoader.getHadoopShims(ShimLoader.java:100)
at org.apache.hadoop.hive.conf.HiveConf$ConfVars.<clinit>(HiveConf.java:368)
... 43 more
Error in handleErrors(returnStatus, conn) :
java.lang.ExceptionInInitializerError
at org.apache.hadoop.hive.conf.HiveConf.<clinit>(HiveConf.java:105)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.spark.util.Utils$.classForName(Utils.scala:193)
at org.apache.spark.sql.SparkSession$.hiveClassesArePresent(SparkSession.scala:1116)
at org.apache.spark.sql.api.r.SQLUtils$.getOrCreateSparkSession(SQLUtils.scala:52)
at org.apache.spark.sql.api.r.SQLUtils.getOrCreateSparkSession(SQLUtils.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
```
The root cause is that:
```
SparkSession.hiveClassesArePresent
```
check if the class is loadable or not to check if that's in classpath but `org.apache.hadoop.hive.conf.HiveConf` has a check for Hadoop version as static logic which is executed right away. This throws an `IllegalArgumentException` and that's not caught:
https://github.com/apache/spark/blob/36edbac1c8337a4719f90e4abd58d38738b2e1fb/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala#L1113-L1121
So, currently, if users have a Hive built-in Spark with unsupported Hadoop version by our fork (namely 3+), there's no way to use SparkR even though it could work.
This PR just propose to change the order of bool comparison so that we can don't execute `SparkSession.hiveClassesArePresent` when:
1. `enableHiveSupport` is explicitly disabled
2. `spark.sql.catalogImplementation` is `in-memory`
so that we **only** check `SparkSession.hiveClassesArePresent` when Hive support is explicitly enabled by short circuiting.
## How was this patch tested?
It's difficult to write a test since we don't run tests against Hadoop 3 yet. See #21588. Manually tested.
Closes#23356 from HyukjinKwon/SPARK-26422.
Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 305e9b5)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
0 commit comments