[SPARK-33376][SQL] Remove the option of "sharesHadoopClasses" in Hive IsolatedClientLoader #30284

sunchao · 2020-11-06T23:12:46Z

What changes were proposed in this pull request?

This removes the sharesHadoopClasses flag from IsolatedClientLoader in Hive module.

Why are the changes needed?

Currently, when initializing IsolatedClientLoader, users can set the sharesHadoopClasses flag to decide whether the HiveClient created should share Hadoop classes with Spark itself or not. In the latter case, the client will only load Hadoop classes from the Hive dependencies.

There are two reasons to remove this:

this feature is currently used in two cases: 1) unit tests, 2) when the Hadoop version defined in Maven can not be found when spark.sql.hive.metastore.jars is equal to "maven", which could be very rare.
when sharesHadoopClasses is false, Spark doesn't really only use Hadoop classes from Hive jars: we also download hadoop-client jar and put all the sub-module jars (e.g., hadoop-common, hadoop-hdfs) together with the Hive jars, and the Hadoop version used by hadoop-client is the same version used by Spark itself. As result, we're mixing two versions of Hadoop jars in the classpath, which could potentially cause issues, especially considering that the default Hadoop version is already 3.2.0 while most Hive versions supported by the IsolatedClientLoader is still using Hadoop 2.x or even lower.

Does this PR introduce any user-facing change?

This affects Spark users in one scenario: when spark.sql.hive.metastore.jars is set to maven AND the Hadoop version specified in pom file cannot be downloaded, currently the behavior is to switch to not share Hadoop classes, but with the PR it will share Hadoop classes with Spark.

How was this patch tested?

Existing UTs.

SparkQA · 2020-11-06T23:55:33Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35349/

SparkQA · 2020-11-07T00:16:33Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35349/

SparkQA · 2020-11-07T01:08:18Z

Test build #130740 has finished for PR 30284 at commit b735947.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sunchao · 2020-11-07T19:04:26Z

cc @cloud-fan @gatorsmile @srowen

cloud-fan

looks fine to me

HyukjinKwon · 2020-11-09T09:26:54Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/client/IsolatedClientLoader.scala

@@ -75,7 +72,6 @@ private[hive] object IsolatedClientLoader extends Logging {
              "again. Hadoop classes will not be shared between Spark and Hive metastore client. " +


Shall we update this message since we now always share? Actually we should think about throwing an exception that says all related metastore jars have to be specified instead.

Forgot about this place - will fix. Yes agree that an exception might be more appropriate. It seems a bit arbitrary to fall back to 2.7.4 here since the original Hadoop version could vary based on the Hadoop profile (either 2.7.4 or 3.2.0 at the moment).

HyukjinKwon · 2020-11-09T10:23:30Z

@sunchao, I think the PR description explains why the current sharesHadoopClasses=false works in a corner case, and why it might not currently work but I think that doesn't mean that we should drop this. For example, we can place the the downloaded jars ahead to have a higher precedence in the classpath.

sunchao · 2020-11-09T18:17:07Z

For example, we can place the the downloaded jars ahead to have a higher precedence in the classpath.

@HyukjinKwon I'm not sure if this can solve the problem. The fundamental issue is we are trying to mix a single Hadoop version (i.e., the one used by Spark, for hadoop-client jar) with various others from the different Hive versions supported, so it is bound to to load some classes from the former while some others from the latter.

To truly solve the issue and make sharesHadoopClasses stick to its meaning, IMO we'd have to pick a matching Hadoop version that is used by the specific Hive version. For instance, if the Hive version is 2.3.7, we should use Hadoop 2.7.2, and if Hive version is 2.2.0, we should pick Hadoop 2.6.0 instead. However I'm not sure if this is something we should do given that some of the Hadoop version used by Hive is really old, and also this is a rarely used feature like I mentioned in the PR description.

HyukjinKwon

Okay. I am okay with this except one comment https://github.com/apache/spark/pull/30284/files#r519663083

sunchao · 2020-11-10T02:08:40Z

Thanks, addressed your comment @HyukjinKwon . I agree that we should think about not using the fallback version and instead throw exception if the specified Hadoop version cannot be downloaded. @cloud-fan let us know what you think.

SparkQA · 2020-11-10T03:07:24Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35428/

SparkQA · 2020-11-10T03:29:32Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35428/

SparkQA · 2020-11-10T05:25:18Z

Test build #130819 has finished for PR 30284 at commit 0b035b2.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-11-10T05:29:37Z

retest this please

SparkQA · 2020-11-10T06:33:40Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35446/

SparkQA · 2020-11-10T07:03:01Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35446/

SparkQA · 2020-11-10T08:05:02Z

Test build #130837 has finished for PR 30284 at commit 0b035b2.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-11-10T08:32:51Z

retest this please

SparkQA · 2020-11-10T09:17:12Z

Test build #130851 has finished for PR 30284 at commit 0b035b2.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-11-10T10:09:09Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35459/

SparkQA · 2020-11-10T10:31:51Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35459/

cloud-fan · 2020-11-10T12:27:02Z

retest this please

SparkQA · 2020-11-10T13:41:43Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35474/

SparkQA · 2020-11-10T14:03:37Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35474/

SparkQA · 2020-11-10T15:01:51Z

Test build #130866 has finished for PR 30284 at commit 0b035b2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-11-10T15:41:01Z

thanks, merging to master!

xkrogen · 2020-11-10T16:18:11Z

I see this got merged but I think the last discussion indicated that we would throw an exception if the requested Hadoop version could not be download, instead of falling back to an arbitrarily decided 2.7.4 version. @sunchao should we address this in a follow-on?

cloud-fan · 2020-11-10T16:24:08Z

I'm concerned about the comment

// If the error message contains hadoop, it is probably because the hadoop
// version cannot be resolved.

If we can't precisely know if the requested Hadoop version could not be download, throwing exception seems risky.

xkrogen · 2020-11-10T16:31:43Z

While I do agree that string-matching on "hadoop" is a bit sketchy, I don't think I agree with your conclusion about risk -- currently we only catch the RuntimeException if it contains "hadoop", else we allow the exception to bubble up. Adding an exception throw at line 74 instead of calling downloadVersion again just aligns the behavior of "missing Hadoop JARs" with the behavior of a non-Hadoop exception. The current logic is more risky in my opinion, bringing unpredictable behavior. Given that this feature is mainly used in unit tests, it brings an increased degree of non-determinism to the unit tests.

cloud-fan · 2020-11-10T17:16:06Z

So we want to change from "fallback to Hadoop 2.7 which may or may not work" to "don't fallback just fail". It's risky as it may break existing workloads, but if it's mainly used in tests, then failing is better.

sunchao · 2020-11-10T17:58:05Z

I tend to agree with @xkrogen that fail-fast is better in this case, and of course we'd have to document this in migration guide so Spark users will be informed properly.

This is not only used by UT: Spark users can set spark.sql.hive.metastore.jars to maven and if they happen to rely on a custom Hadoop version that is not downloadable, this code path will be triggered. I think this is a pretty rare case though.

Remove shares Hadoop classes

b735947

github-actions bot added the SQL label Nov 6, 2020

sunchao changed the title ~~Remove shares Hadoop classes~~ [SPARK-33376][SQL]Remove the option of "sharesHadoopClasses" in Hive IsolatedClientLoader Nov 6, 2020

sunchao changed the title ~~[SPARK-33376][SQL]Remove the option of "sharesHadoopClasses" in Hive IsolatedClientLoader~~ [SPARK-33376][SQL] Remove the option of "sharesHadoopClasses" in Hive IsolatedClientLoader Nov 6, 2020

sunchao changed the title ~~[SPARK-33376][SQL] Remove the option of "sharesHadoopClasses" in Hive IsolatedClientLoader~~ [SPARK-33376][HIVE] Remove the option of "sharesHadoopClasses" in Hive IsolatedClientLoader Nov 6, 2020

cloud-fan approved these changes Nov 9, 2020

View reviewed changes

HyukjinKwon changed the title ~~[SPARK-33376][HIVE] Remove the option of "sharesHadoopClasses" in Hive IsolatedClientLoader~~ [SPARK-33376][SQL] Remove the option of "sharesHadoopClasses" in Hive IsolatedClientLoader Nov 9, 2020

HyukjinKwon reviewed Nov 9, 2020

View reviewed changes

HyukjinKwon reviewed Nov 10, 2020

View reviewed changes

Address comments

0b035b2

cloud-fan closed this in 3165ca7 Nov 10, 2020

sunchao deleted the SPARK-33376 branch November 11, 2020 00:57

sunchao mentioned this pull request Jan 15, 2021

[SPARK-33212][BUILD] Upgrade to Hadoop 3.2.2 and move to shaded clients for Hadoop 3.x profile #30701

Closed

		@@ -75,7 +72,6 @@ private[hive] object IsolatedClientLoader extends Logging {
		"again. Hadoop classes will not be shared between Spark and Hive metastore client. " +

[SPARK-33376][SQL] Remove the option of "sharesHadoopClasses" in Hive IsolatedClientLoader #30284

[SPARK-33376][SQL] Remove the option of "sharesHadoopClasses" in Hive IsolatedClientLoader #30284

Uh oh!

Conversation

sunchao commented Nov 6, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

SparkQA commented Nov 6, 2020

Uh oh!

SparkQA commented Nov 7, 2020

Uh oh!

SparkQA commented Nov 7, 2020

Uh oh!

sunchao commented Nov 7, 2020

Uh oh!

cloud-fan left a comment

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Nov 9, 2020

Choose a reason for hiding this comment

Uh oh!

sunchao Nov 9, 2020

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Nov 9, 2020

Uh oh!

sunchao commented Nov 9, 2020

Uh oh!

HyukjinKwon left a comment

Choose a reason for hiding this comment

Uh oh!

sunchao commented Nov 10, 2020

Uh oh!

SparkQA commented Nov 10, 2020

Uh oh!

SparkQA commented Nov 10, 2020

Uh oh!

SparkQA commented Nov 10, 2020

Uh oh!

cloud-fan commented Nov 10, 2020

Uh oh!

SparkQA commented Nov 10, 2020

Uh oh!

SparkQA commented Nov 10, 2020

Uh oh!

SparkQA commented Nov 10, 2020

Uh oh!

cloud-fan commented Nov 10, 2020

Uh oh!

SparkQA commented Nov 10, 2020

Uh oh!

SparkQA commented Nov 10, 2020

Uh oh!

SparkQA commented Nov 10, 2020

Uh oh!

cloud-fan commented Nov 10, 2020

Uh oh!

SparkQA commented Nov 10, 2020

Uh oh!

SparkQA commented Nov 10, 2020

Uh oh!

SparkQA commented Nov 10, 2020

Uh oh!

cloud-fan commented Nov 10, 2020

Uh oh!

xkrogen commented Nov 10, 2020

Uh oh!

cloud-fan commented Nov 10, 2020

Uh oh!

xkrogen commented Nov 10, 2020

Uh oh!

cloud-fan commented Nov 10, 2020

Uh oh!

sunchao commented Nov 10, 2020

Uh oh!

Uh oh!