-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-33376][SQL] Remove the option of "sharesHadoopClasses" in Hive IsolatedClientLoader #30284
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Kubernetes integration test starting |
Kubernetes integration test status success |
Test build #130740 has finished for PR 30284 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks fine to me
@@ -75,7 +72,6 @@ private[hive] object IsolatedClientLoader extends Logging { | |||
"again. Hadoop classes will not be shared between Spark and Hive metastore client. " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we update this message since we now always share? Actually we should think about throwing an exception that says all related metastore jars have to be specified instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Forgot about this place - will fix. Yes agree that an exception might be more appropriate. It seems a bit arbitrary to fall back to 2.7.4 here since the original Hadoop version could vary based on the Hadoop profile (either 2.7.4 or 3.2.0 at the moment).
@sunchao, I think the PR description explains why the current |
@HyukjinKwon I'm not sure if this can solve the problem. The fundamental issue is we are trying to mix a single Hadoop version (i.e., the one used by Spark, for To truly solve the issue and make |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay. I am okay with this except one comment https://github.com/apache/spark/pull/30284/files#r519663083
Thanks, addressed your comment @HyukjinKwon . I agree that we should think about not using the fallback version and instead throw exception if the specified Hadoop version cannot be downloaded. @cloud-fan let us know what you think. |
Kubernetes integration test starting |
Kubernetes integration test status success |
Test build #130819 has finished for PR 30284 at commit
|
retest this please |
Kubernetes integration test starting |
Kubernetes integration test status success |
Test build #130837 has finished for PR 30284 at commit
|
retest this please |
Test build #130851 has finished for PR 30284 at commit
|
Kubernetes integration test starting |
Kubernetes integration test status success |
retest this please |
Kubernetes integration test starting |
Kubernetes integration test status success |
Test build #130866 has finished for PR 30284 at commit
|
thanks, merging to master! |
I see this got merged but I think the last discussion indicated that we would throw an exception if the requested Hadoop version could not be download, instead of falling back to an arbitrarily decided 2.7.4 version. @sunchao should we address this in a follow-on? |
I'm concerned about the comment
If we can't precisely know if the requested Hadoop version could not be download, throwing exception seems risky. |
While I do agree that string-matching on |
So we want to change from "fallback to Hadoop 2.7 which may or may not work" to "don't fallback just fail". It's risky as it may break existing workloads, but if it's mainly used in tests, then failing is better. |
I tend to agree with @xkrogen that fail-fast is better in this case, and of course we'd have to document this in migration guide so Spark users will be informed properly. This is not only used by UT: Spark users can set |
What changes were proposed in this pull request?
This removes the
sharesHadoopClasses
flag fromIsolatedClientLoader
in Hive module.Why are the changes needed?
Currently, when initializing
IsolatedClientLoader
, users can set thesharesHadoopClasses
flag to decide whether theHiveClient
created should share Hadoop classes with Spark itself or not. In the latter case, the client will only load Hadoop classes from the Hive dependencies.There are two reasons to remove this:
spark.sql.hive.metastore.jars
is equal to "maven", which could be very rare.sharesHadoopClasses
is false, Spark doesn't really only use Hadoop classes from Hive jars: we also downloadhadoop-client
jar and put all the sub-module jars (e.g.,hadoop-common
,hadoop-hdfs
) together with the Hive jars, and the Hadoop version used byhadoop-client
is the same version used by Spark itself. As result, we're mixing two versions of Hadoop jars in the classpath, which could potentially cause issues, especially considering that the default Hadoop version is already 3.2.0 while most Hive versions supported by theIsolatedClientLoader
is still using Hadoop 2.x or even lower.Does this PR introduce any user-facing change?
This affects Spark users in one scenario: when
spark.sql.hive.metastore.jars
is set tomaven
AND the Hadoop version specified in pom file cannot be downloaded, currently the behavior is to switch to not share Hadoop classes, but with the PR it will share Hadoop classes with Spark.How was this patch tested?
Existing UTs.