-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-33212][BUILD] Move to shaded clients for Hadoop 3.x profile #29843
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Great! Thanks, @sunchao . |
Test build #128992 has finished for PR 29843 at commit
|
The Python/YARN tests are failing with:
which is weird. I also was not able to reproduce all the Python related test failures locally. |
@sunchao, I think it's because Jenkins uses Python 2 for Can you explicitly change the test to use spark/resource-managers/yarn/src/test/scala/org/apache/spark/deploy/yarn/YarnClusterSuite.scala Lines 177 to 180 in be2eca2
This fails in this PR specifically because the tests failed here have |
Ah, thanks @HyukjinKwon ! Yeah on my local laptop I set those two environment variables and tests are passing. Will make the change in the test suite. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- have there been changes in the Hadoop APIs which broke Spark? If so: report a regression.
- consider including "and move to shaded hadoop-client" in the title
I should warn that the hadoop-cloud codebase isn't shaded. Long story, but "nobody has sat down to do it across everything" is probably the most accurate
@@ -308,7 +307,7 @@ private[spark] class ApplicationMaster( | |||
// The client-mode AM doesn't listen for incoming connections, so report an invalid port. | |||
registerAM(Utils.localHostName, -1, sparkConf, | |||
sparkConf.getOption("spark.driver.appUIAddress"), appAttemptId) | |||
addAmIpFilter(Some(driverRef), ProxyUriUtils.getPath(appAttemptId.getApplicationId)) | |||
addAmIpFilter(Some(driverRef), s"/proxy/$appAttemptId") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This a regression in the Hadoop APIs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no - this is to remove dependency on Hadoop server side (ProxyUriUtils
, which also calls private API and is not good).
+expect a 3.2.2 before end of year. If there are regressions, that would be the place to get them addressed |
Thanks @steveloughran .
No haven't seen any API breakage yet.
Good suggestion. Done. Regarding hadoop-cloud codebase, I think Spark already has its own |
@@ -308,7 +307,7 @@ private[spark] class ApplicationMaster( | |||
// The client-mode AM doesn't listen for incoming connections, so report an invalid port. | |||
registerAM(Utils.localHostName, -1, sparkConf, | |||
sparkConf.getOption("spark.driver.appUIAddress"), appAttemptId) | |||
addAmIpFilter(Some(driverRef), ProxyUriUtils.getPath(appAttemptId.getApplicationId)) | |||
addAmIpFilter(Some(driverRef), s"/proxy/$appAttemptId") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why we are not using getApplicationId
as before?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops my bad. Should be appAttemptId.getApplicationId
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, do we need uri encode as ProxyUriUtils.getPath
does internally?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah we need that too. Good catch. I had it before but something messed up during rebase. Will fix.
Test build #129064 has finished for PR 29843 at commit
|
Test build #129071 has finished for PR 29843 at commit
|
Hmm interesting. There is still one Python test in YARN failing with:
after I changed |
@@ -118,11 +118,15 @@ private[hive] object IsolatedClientLoader extends Logging { | |||
hadoopVersion: String, | |||
ivyPath: Option[String], | |||
remoteRepos: String): Seq[URL] = { | |||
val hadoopJarName = if (hadoopVersion.startsWith("3")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this valid for all Hadoop 3.x verisons?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I believe so. These modules should be available in any production Hadoop 3.x releases I think. See https://issues.apache.org/jira/browse/HADOOP-11804, it is fixed in 3.0.0-alpha2.
BTW the other SparkQA also failed with:
|
7bca429
to
f6f19b5
Compare
Test build #129088 has started for PR 29843 at commit |
Kubernetes integration test starting |
Test build #129092 has finished for PR 29843 at commit
|
Kubernetes integration test status failure |
Merged into master. Thanks all for reviewing! |
Great! Thank you guys! |
Great! Thank you all! |
sorry I said that backwards, by default if we are including a shaded version of Hadoop, that means users won't pick up all the dependencies Hadoop has by default. This is kind of the point for guava for spark. But it also means if users were relying on this behavior its now broken. Ideally users wouldn't rely on this but many times they don't know or think about it. I guess my point is I think its worth documenting this as its a change in the behavior. |
+1 @tgravescs -- it's definitely common for users to implicitly depend on the transitive tree of Spark/Hadoop dependencies. This is a move in the right direction to prevent such behavior in the future, but probably shouldn't go in any point releases, and should be mentioned in release notes. |
Thanks @tgravescs and @xkrogen . Those are valid points. I'll document them in release notes. And yes, this is going to 3.1.0 which is not a point release. |
@sunchao instead of upgrading to 3.2.1 which has a known backward compatibility issue, maybe we can explore to upgrade Hadoop to 3.3 directly. The backward compatibility issue is fixed in Hadoop 3.3, and Hadoop 3.3 has many good stuff for cloud support. cc @steveloughran |
+1 for Hadoop 3.3.0+ |
BTW, @sunchao is working on that. |
To be ruthless about even my own code
It would probably be safest to build/release with a 3.2.x but allow 3.3.x to be built with if someone really wanted it. |
Thanks @steveloughran for sharing the info! Totally agree. Yes my current plan is to wait until 3.2.2 comes out and then upgrade Spark to use that, and hope community will agree with that. People who want to use 3.3.x can still use the existing
you mean 3.3 here right? does 3.3.3 exist yet? |
If that's the Hadoop community's recommendation, +1 for 3.2.2. |
Sorry, I meant 3.3.1. If you are happy with a 3.3.0 release, then go for it. It has lots of shiny new features for s3a and abfs in particular. For spark standalone that would really help people. As for spark-in-cluster, those people who offer clusters-as-a-service can build their own spark releases with the dependencies they choose, can't they? |
Hi, All. Sadly, I noticed that this breaks
|
Thanks @dongjoon-hyun for reporting. Do you know why this was not revealed in the previous CI runs? I'll take a look to see if there is any workaround from Spark side. Also wonder if @steveloughran has any suggestions on this since you were involved in the Hadoop issue. |
Although K8s IT test suite has one test coverage for
In short, the regression is that |
Thanks @dongjoon-hyun . Do you know an easy way to reproduce this? I'll see what we can do from Spark side. On the other hand, Hadoop 3.2.2 is coming soon, and let's see if we can quickly come up with a fix in the Hadoop community. |
Thank, please try to build a distribution with
In the above example, I used |
Since we need a working |
Currently, we are searching all options.
|
If someone wants to sit down and produce a hadoop-cloud shaded module which hides things downstream, happy to let them. It's been one of those things I'd like to have, though I'd prefer even more: java 9 modules. But the s3a and abfs modules do make heavy use of common code in hadoop-common (specifically: o.a.h.fs.impl, soon o.a.h.fs.statistics.impl) and can't just treat hadoop-common as a third party api. I think java 9 modules is how we should really be going |
Thanks @steveloughran , yeah agree that the java 9 modules feature looks promising (it was discussed some years back in HADOOP-11656 but now the timing should be more right). I can try to spend sometime looking at this. A related question is whether we'd want to go further and apply it to other Hadoop modules as well. This will take a while though. In the meanwhile, I'm wondering what we can do to ship this in the soon-coming Spark 3.1 release. One possible solution, maybe, is to still use non-shaded client when the diff --git a/pom.xml b/pom.xml
index 3ae2e7420e..12c36af557 100644
--- a/pom.xml
+++ b/pom.xml
@@ -3238,6 +3238,11 @@
<profile>
<id>hadoop-cloud</id>
+ <properties>
+ <hadoop-client-api.artifact>hadoop-client</hadoop-client-api.artifact>
+ <hadoop-client-runtime.artifact>hadoop-client</hadoop-client-runtime.artifact>
+ <hadoop-client-minicluster.artifact>hadoop-client</hadoop-client-minicluster.artifact>
+ </properties>
<modules>
<module>hadoop-cloud</module>
</modules> Or maybe we could try to shade the |
I created a WIP PR to demonstrate how this work. Feel free to comment on it. |
Thank you, @sunchao ! |
What changes were proposed in this pull request?
This switches Spark to use shaded Hadoop clients, namely hadoop-client-api and hadoop-client-runtime, for Hadoop 3.x. For Hadoop 2.7, we'll still use the same modules such as hadoop-client.
In order to still keep default Hadoop profile to be hadoop-3.2, this defines the following Maven properties:
which default to:
but all switch to
hadoop-client
when the Hadoop profile is hadoop-2.7. A side affect from this is we'll import the same dependency multiple times. For this I have to disable Maven enforcerbanDuplicatePomDependencyVersions
.Besides above, there are the following changes:
ProxyUriUtils.getPath
fromApplicationMaster
which is a server-side/private API.IsolatedClientLoader
to excludehadoop-auth
jars when Hadoop version is 3.x. This change should only matter when we're not sharing Hadoop classes with Spark (which is mostly used in tests).Why are the changes needed?
This serves two purposes:
Does this PR introduce any user-facing change?
When people use Spark with
hadoop-provided
option, they should make sure class path containshadoop-client-api
andhadoop-client-runtime
jars. In addition, they may need to make sure these jars appear before other Hadoop jars in the order. Otherwise, classes may be loaded from the other non-shaded Hadoop jars and cause potential conflicts.How was this patch tested?
Relying on existing tests.