Skip to content

[SPARK-33376][SQL] Remove the option of "sharesHadoopClasses" in Hive IsolatedClientLoader #30284

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

sunchao
Copy link
Member

@sunchao sunchao commented Nov 6, 2020

What changes were proposed in this pull request?

This removes the sharesHadoopClasses flag from IsolatedClientLoader in Hive module.

Why are the changes needed?

Currently, when initializing IsolatedClientLoader, users can set the sharesHadoopClasses flag to decide whether the HiveClient created should share Hadoop classes with Spark itself or not. In the latter case, the client will only load Hadoop classes from the Hive dependencies.

There are two reasons to remove this:

  1. this feature is currently used in two cases: 1) unit tests, 2) when the Hadoop version defined in Maven can not be found when spark.sql.hive.metastore.jars is equal to "maven", which could be very rare.
  2. when sharesHadoopClasses is false, Spark doesn't really only use Hadoop classes from Hive jars: we also download hadoop-client jar and put all the sub-module jars (e.g., hadoop-common, hadoop-hdfs) together with the Hive jars, and the Hadoop version used by hadoop-client is the same version used by Spark itself. As result, we're mixing two versions of Hadoop jars in the classpath, which could potentially cause issues, especially considering that the default Hadoop version is already 3.2.0 while most Hive versions supported by the IsolatedClientLoader is still using Hadoop 2.x or even lower.

Does this PR introduce any user-facing change?

This affects Spark users in one scenario: when spark.sql.hive.metastore.jars is set to maven AND the Hadoop version specified in pom file cannot be downloaded, currently the behavior is to switch to not share Hadoop classes, but with the PR it will share Hadoop classes with Spark.

How was this patch tested?

Existing UTs.

@github-actions github-actions bot added the SQL label Nov 6, 2020
@sunchao sunchao changed the title Remove shares Hadoop classes [SPARK-33376][SQL]Remove the option of "sharesHadoopClasses" in Hive IsolatedClientLoader Nov 6, 2020
@sunchao sunchao changed the title [SPARK-33376][SQL]Remove the option of "sharesHadoopClasses" in Hive IsolatedClientLoader [SPARK-33376][SQL] Remove the option of "sharesHadoopClasses" in Hive IsolatedClientLoader Nov 6, 2020
@sunchao sunchao changed the title [SPARK-33376][SQL] Remove the option of "sharesHadoopClasses" in Hive IsolatedClientLoader [SPARK-33376][HIVE] Remove the option of "sharesHadoopClasses" in Hive IsolatedClientLoader Nov 6, 2020
@SparkQA
Copy link

SparkQA commented Nov 6, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35349/

@SparkQA
Copy link

SparkQA commented Nov 7, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35349/

@SparkQA
Copy link

SparkQA commented Nov 7, 2020

Test build #130740 has finished for PR 30284 at commit b735947.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sunchao
Copy link
Member Author

sunchao commented Nov 7, 2020

cc @cloud-fan @gatorsmile @srowen

Copy link
Contributor

@cloud-fan cloud-fan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks fine to me

@HyukjinKwon HyukjinKwon changed the title [SPARK-33376][HIVE] Remove the option of "sharesHadoopClasses" in Hive IsolatedClientLoader [SPARK-33376][SQL] Remove the option of "sharesHadoopClasses" in Hive IsolatedClientLoader Nov 9, 2020
@@ -75,7 +72,6 @@ private[hive] object IsolatedClientLoader extends Logging {
"again. Hadoop classes will not be shared between Spark and Hive metastore client. " +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we update this message since we now always share? Actually we should think about throwing an exception that says all related metastore jars have to be specified instead.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forgot about this place - will fix. Yes agree that an exception might be more appropriate. It seems a bit arbitrary to fall back to 2.7.4 here since the original Hadoop version could vary based on the Hadoop profile (either 2.7.4 or 3.2.0 at the moment).

@HyukjinKwon
Copy link
Member

@sunchao, I think the PR description explains why the current sharesHadoopClasses=false works in a corner case, and why it might not currently work but I think that doesn't mean that we should drop this. For example, we can place the the downloaded jars ahead to have a higher precedence in the classpath.

@sunchao
Copy link
Member Author

sunchao commented Nov 9, 2020

For example, we can place the the downloaded jars ahead to have a higher precedence in the classpath.

@HyukjinKwon I'm not sure if this can solve the problem. The fundamental issue is we are trying to mix a single Hadoop version (i.e., the one used by Spark, for hadoop-client jar) with various others from the different Hive versions supported, so it is bound to to load some classes from the former while some others from the latter.

To truly solve the issue and make sharesHadoopClasses stick to its meaning, IMO we'd have to pick a matching Hadoop version that is used by the specific Hive version. For instance, if the Hive version is 2.3.7, we should use Hadoop 2.7.2, and if Hive version is 2.2.0, we should pick Hadoop 2.6.0 instead. However I'm not sure if this is something we should do given that some of the Hadoop version used by Hive is really old, and also this is a rarely used feature like I mentioned in the PR description.

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. I am okay with this except one comment https://github.com/apache/spark/pull/30284/files#r519663083

@sunchao
Copy link
Member Author

sunchao commented Nov 10, 2020

Thanks, addressed your comment @HyukjinKwon . I agree that we should think about not using the fallback version and instead throw exception if the specified Hadoop version cannot be downloaded. @cloud-fan let us know what you think.

@SparkQA
Copy link

SparkQA commented Nov 10, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35428/

@SparkQA
Copy link

SparkQA commented Nov 10, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35428/

@SparkQA
Copy link

SparkQA commented Nov 10, 2020

Test build #130819 has finished for PR 30284 at commit 0b035b2.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Nov 10, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35446/

@SparkQA
Copy link

SparkQA commented Nov 10, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35446/

@SparkQA
Copy link

SparkQA commented Nov 10, 2020

Test build #130837 has finished for PR 30284 at commit 0b035b2.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Nov 10, 2020

Test build #130851 has finished for PR 30284 at commit 0b035b2.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 10, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35459/

@SparkQA
Copy link

SparkQA commented Nov 10, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35459/

@cloud-fan
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Nov 10, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35474/

@SparkQA
Copy link

SparkQA commented Nov 10, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35474/

@SparkQA
Copy link

SparkQA commented Nov 10, 2020

Test build #130866 has finished for PR 30284 at commit 0b035b2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 3165ca7 Nov 10, 2020
@xkrogen
Copy link
Contributor

xkrogen commented Nov 10, 2020

I see this got merged but I think the last discussion indicated that we would throw an exception if the requested Hadoop version could not be download, instead of falling back to an arbitrarily decided 2.7.4 version. @sunchao should we address this in a follow-on?

@cloud-fan
Copy link
Contributor

I'm concerned about the comment

// If the error message contains hadoop, it is probably because the hadoop
// version cannot be resolved.

If we can't precisely know if the requested Hadoop version could not be download, throwing exception seems risky.

@xkrogen
Copy link
Contributor

xkrogen commented Nov 10, 2020

While I do agree that string-matching on "hadoop" is a bit sketchy, I don't think I agree with your conclusion about risk -- currently we only catch the RuntimeException if it contains "hadoop", else we allow the exception to bubble up. Adding an exception throw at line 74 instead of calling downloadVersion again just aligns the behavior of "missing Hadoop JARs" with the behavior of a non-Hadoop exception. The current logic is more risky in my opinion, bringing unpredictable behavior. Given that this feature is mainly used in unit tests, it brings an increased degree of non-determinism to the unit tests.

@cloud-fan
Copy link
Contributor

So we want to change from "fallback to Hadoop 2.7 which may or may not work" to "don't fallback just fail". It's risky as it may break existing workloads, but if it's mainly used in tests, then failing is better.

@sunchao
Copy link
Member Author

sunchao commented Nov 10, 2020

I tend to agree with @xkrogen that fail-fast is better in this case, and of course we'd have to document this in migration guide so Spark users will be informed properly.

This is not only used by UT: Spark users can set spark.sql.hive.metastore.jars to maven and if they happen to rely on a custom Hadoop version that is not downloadable, this code path will be triggered. I think this is a pretty rare case though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants