-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-33212][BUILD] Upgrade to Hadoop 3.2.2 and move to shaded clients for Hadoop 3.x profile #30701
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great! Is hadoop-aws
tested manually, @sunchao ?
Test build #132548 has finished for PR 30701 at commit
|
Kubernetes integration test starting |
Kubernetes integration test status failure |
yes manually tested it and it looks good |
Kubernetes integration test starting |
Kubernetes integration test status failure |
Test build #132598 has started for PR 30701 at commit |
Kubernetes integration test starting |
Kubernetes integration test status failure |
Test build #132581 has finished for PR 30701 at commit
|
ace75c8
to
17513d3
Compare
Test build #133805 has started for PR 30701 at commit |
Hadoop 3.2.2-rc5 became 3.2.2 on Jan 9, does this PR have a chance to merge into 3.1? |
Sorry, but it's too late for Apache Spark 3.1.1, @pan3793 .
|
BTW, @sunchao . Since Apache Hadoop 3.2.2 is out, shall we convert this from |
@dongjoon-hyun yes will do that soon. |
…t-api to make hadoop-aws work" This reverts commit 290aa02.
17513d3
to
bd1a69c
Compare
@dongjoon-hyun updated the title and description, updated the PR and let's see how CI goes. |
Thank you for updating, @sunchao ! |
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking forward to this! Thanks for keeping it going.
<execution> | ||
<id>enforce-no-duplicate-dependencies</id> | ||
<goals> | ||
<goal>enforce</goal> | ||
</goals> | ||
<configuration> | ||
<rules> | ||
<banDuplicatePomDependencyVersions/> | ||
</rules> | ||
</configuration> | ||
</execution> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any way to provide specific exclusions for the enforcement? It's a shame to have to completely turn off the rule for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure whether exclusion rules for this. Just tried on my laptop with this and compilation for Hadoop 2.7 works, probably because there are lots of changes since I initially removed this. Let me try it once again with Spark CI.
@@ -112,11 +112,24 @@ private[hive] object IsolatedClientLoader extends Logging { | |||
hadoopVersion: String, | |||
ivyPath: Option[String], | |||
remoteRepos: String): Seq[URL] = { | |||
val hadoopJarNames = if (hadoopVersion.startsWith("3")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this break if hadoopVersion
is 3.1.0, 3.2.1, etc. (due to the previous issues with the shaded client JARs)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean HADOOP-16080? yes things could still break in the following cases.
- users build Spark without
-Phadoop-cloud
AND use a version doesn't have the fix in HADOOP-16080, such as:
$ bin/spark-shell --packages org.apache.hadoop:hadoop-aws:3.2.0,org.apache.hadoop:hadoop-common:3.2.0
However I think we should recommend users to stick to the same version used by Spark, i.e., 3.2.2
- users build Spark with custom Hadoop version such as 3.1.0/3.2.1 you mentioned via the
hadoop.version
property, and use this to talk to cloud storage like S3.
To enable these use cases we may have to introduce another Maven property to switch back to non-shaded client, and update here as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yah I think (2) is my primary concern. We support building against a custom version using -Dhadoop.version
, but right now it will break if you use -Phadoop-3 -Dhadoop.version=3.1.0
. This one I believe you can at least work around by changing the -Dhadoop-client-{runtime,api,minicluster}.artifact
properties, but here there's no way to work around it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. Yes we can be more precise here, such as using a version map. I can update this.
// this introduced from lower version of Hive could conflict with jars in Hadoop 3.2+, so | ||
// exclude here in favor of the ones in Hadoop 3.2+ | ||
Seq("org.apache.hadoop:hadoop-auth") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't Hive pull in other non-shaded Hadoop dependencies that could cause issues?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually this code is no longer required after #30284. Currently Spark will always load Hadoop classes from the built-in Hadoop version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I forgot about that one. Thanks for the clarification.
Kubernetes integration test starting |
Kubernetes integration test status success |
Test build #134125 has finished for PR 30701 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM. Thank you, @sunchao .
Thanks @dongjoon-hyun for merging! @xkrogen I'll address your comments in a follow-up PR. |
Thanks @sunchao ! |
… and add more strict Hadoop version check ### What changes were proposed in this pull request? 1. Add back Maven enforcer for duplicate dependencies check 2. More strict check on Hadoop versions which support shaded client in `IsolatedClientLoader`. To do proper version check, this adds a util function `majorMinorPatchVersion` to extract major/minor/patch version from a string. 3. Cleanup unnecessary code ### Why are the changes needed? The Maven enforcer was removed as part of #30556. This proposes to add it back. Also, Hadoop shaded client doesn't work in certain cases (see [these comments](#30701 (comment)) for details). This strictly checks that the current Hadoop version (i.e., 3.2.2 at the moment) has good support of shaded client or otherwise fallback to old unshaded ones. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #31203 from sunchao/SPARK-33212-followup. Lead-authored-by: Chao Sun <sunchao@apple.com> Co-authored-by: Chao Sun <sunchao@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
… and add more strict Hadoop version check ### What changes were proposed in this pull request? 1. Add back Maven enforcer for duplicate dependencies check 2. More strict check on Hadoop versions which support shaded client in `IsolatedClientLoader`. To do proper version check, this adds a util function `majorMinorPatchVersion` to extract major/minor/patch version from a string. 3. Cleanup unnecessary code ### Why are the changes needed? The Maven enforcer was removed as part of apache#30556. This proposes to add it back. Also, Hadoop shaded client doesn't work in certain cases (see [these comments](apache#30701 (comment)) for details). This strictly checks that the current Hadoop version (i.e., 3.2.2 at the moment) has good support of shaded client or otherwise fallback to old unshaded ones. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes apache#31203 from sunchao/SPARK-33212-followup. Lead-authored-by: Chao Sun <sunchao@apple.com> Co-authored-by: Chao Sun <sunchao@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
…lient dependencies in root pom ### What changes were proposed in this pull request? This PR is a followup of #30701. It uses properties of `hadoop-client-api.artifact`, `hadoop-client-runtime.artifact` and `hadoop-client-minicluster.artifact` explicitly to set the dependencies and versions. Otherwise, it is logically incorrect. For example, if you build with Hadoop 2, this dependency becomes `hadoop-client-api:2.7.4` internally, which does not exist in Hadoop 2 (https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-client-api). ### Why are the changes needed? - To fix the logical incorrectness. - It fixes a potential issue: this actually caused an issue when `generate-sources` plugin is used together with Hadoop 2 by default, which attempts to pull 2.7.4 of `hadoop-client-api`, `hadoop-client-runtime` and `hadoop-client-minicluster` for whatever reason. ### Does this PR introduce _any_ user-facing change? No for users and dev. It's more a cleanup. ### How was this patch tested? Manually checked the dependencies are correctly placed. Closes #31467 from HyukjinKwon/SPARK-33212. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>
…lient dependencies in root pom ### What changes were proposed in this pull request? This PR is a followup of apache/spark#30701. It uses properties of `hadoop-client-api.artifact`, `hadoop-client-runtime.artifact` and `hadoop-client-minicluster.artifact` explicitly to set the dependencies and versions. Otherwise, it is logically incorrect. For example, if you build with Hadoop 2, this dependency becomes `hadoop-client-api:2.7.4` internally, which does not exist in Hadoop 2 (https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-client-api). ### Why are the changes needed? - To fix the logical incorrectness. - It fixes a potential issue: this actually caused an issue when `generate-sources` plugin is used together with Hadoop 2 by default, which attempts to pull 2.7.4 of `hadoop-client-api`, `hadoop-client-runtime` and `hadoop-client-minicluster` for whatever reason. ### Does this PR introduce _any_ user-facing change? No for users and dev. It's more a cleanup. ### How was this patch tested? Manually checked the dependencies are correctly placed. Closes #31467 from HyukjinKwon/SPARK-33212. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>
…ts for Hadoop 3.x profile This: 1. switches Spark to use shaded Hadoop clients, namely hadoop-client-api and hadoop-client-runtime, for Hadoop 3.x. 2. upgrade built-in version for Hadoop 3.x to Hadoop 3.2.2 Note that for Hadoop 2.7, we'll still use the same modules such as hadoop-client. In order to still keep default Hadoop profile to be hadoop-3.2, this defines the following Maven properties: ``` hadoop-client-api.artifact hadoop-client-runtime.artifact hadoop-client-minicluster.artifact ``` which default to: ``` hadoop-client-api hadoop-client-runtime hadoop-client-minicluster ``` but all switch to `hadoop-client` when the Hadoop profile is hadoop-2.7. A side affect from this is we'll import the same dependency multiple times. For this I have to disable Maven enforcer `banDuplicatePomDependencyVersions`. Besides above, there are the following changes: - explicitly add a few dependencies which are imported via transitive dependencies from Hadoop jars, but are removed from the shaded client jars. - removed the use of `ProxyUriUtils.getPath` from `ApplicationMaster` which is a server-side/private API. - modified `IsolatedClientLoader` to exclude `hadoop-auth` jars when Hadoop version is 3.x. This change should only matter when we're not sharing Hadoop classes with Spark (which is _mostly_ used in tests). Hadoop 3.2.2 is released with new features and bug fixes, so it's good for the Spark community to adopt it. However, latest Hadoop versions starting from Hadoop 3.2.1 have upgraded to use Guava 27+. In order to resolve Guava conflicts, this takes the approach by switching to shaded client jars provided by Hadoop. This also has the benefits of avoid pulling other 3rd party dependencies from Hadoop side so as to avoid more potential future conflicts. When people use Spark with `hadoop-provided` option, they should make sure class path contains `hadoop-client-api` and `hadoop-client-runtime` jars. In addition, they may need to make sure these jars appear before other Hadoop jars in the order. Otherwise, classes may be loaded from the other non-shaded Hadoop jars and cause potential conflicts. Relying on existing tests. Closes apache#30701 from sunchao/test-hadoop-3.2.2. Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
… and add more strict Hadoop version check 1. Add back Maven enforcer for duplicate dependencies check 2. More strict check on Hadoop versions which support shaded client in `IsolatedClientLoader`. To do proper version check, this adds a util function `majorMinorPatchVersion` to extract major/minor/patch version from a string. 3. Cleanup unnecessary code The Maven enforcer was removed as part of apache#30556. This proposes to add it back. Also, Hadoop shaded client doesn't work in certain cases (see [these comments](apache#30701 (comment)) for details). This strictly checks that the current Hadoop version (i.e., 3.2.2 at the moment) has good support of shaded client or otherwise fallback to old unshaded ones. No. Existing tests. Closes apache#31203 from sunchao/SPARK-33212-followup. Lead-authored-by: Chao Sun <sunchao@apple.com> Co-authored-by: Chao Sun <sunchao@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
@@ -66,7 +66,13 @@ | |||
</dependency> | |||
<dependency> | |||
<groupId>org.apache.hadoop</groupId> | |||
<artifactId>hadoop-client</artifactId> | |||
<artifactId>${hadoop-client-api.artifact}</artifactId> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sunchao @dongjoon-hyun I find there are some bad case with using variables here:
- if we run
mvn dependency:tree -pl resource-managers/yarn -Phadoop-2.7 -Pyarn -am
, the dependency-tree is correct - if we run
mvn dependency:tree -pl resource-managers/yarn -Phadoop-2.7 -Pyarn
, the dependency-tree is wrong, the dependency of hadoop 3.3.1 exists in the dependency tree, even though we have added the profile of - Phadoop2.7
So if we execute
mvn clean install -DskipTests -pl resource-managers/yarn -am -Phadoop-2.7 -Pyarn
mvn test -pl resource-managers/yarn -Phadoop-2.7 -Pyarn -DwildcardSuites=org.apache.spark.deploy.yarn.YarnClusterSuite
The test case YarnClusterSuite
will failed as follows:
Discovery starting.
Discovery completed in 277 milliseconds.
Run starting. Expected test count is: 27
YarnClusterSuite:
*** RUN ABORTED ***
java.lang.NoClassDefFoundError: org/apache/hadoop/shaded/com/google/inject/servlet/ServletModule
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:757)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:468)
at java.net.URLClassLoader.access$100(URLClassLoader.java:74)
at java.net.URLClassLoader$1.run(URLClassLoader.java:369)
at java.net.URLClassLoader$1.run(URLClassLoader.java:363)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:362)
at java.lang.ClassLoader.loadClass(ClassLoader.java:419)
...
Cause: java.lang.ClassNotFoundException: org.apache.hadoop.shaded.com.google.inject.servlet.ServletModule
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:419)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
at java.lang.ClassLoader.loadClass(ClassLoader.java:352)
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:757)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:468)
at java.net.URLClassLoader.access$100(URLClassLoader.java:74)
at java.net.URLClassLoader$1.run(URLClassLoader.java:369)
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have to add the -am
to test the yarn
module separately to ensure that the dependency is correct
mvn test -pl resource-managers/yarn -Phadoop-2.7 -am -Pyarn -DwildcardSuites=org.apache.spark.deploy.yarn.YarnClusterSuite -Dtest=none
Run completed in 5 minutes, 7 seconds.
Total number of tests run: 27
Suites: completed 2, aborted 0
Tests: succeeded 27, failed 0, canceled 0, ignored 0, pending 0
All tests passed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting. Let me take a look. Thanks for reporting it @LuciferYang !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This behavior looks expected to me... If you specify -pl resource-managers/yarn
without the -am
flag, then you're explicitly telling Maven not to rebuild the core
module. If the core
module has previously been built with Hadoop 3.3.1, then you will just use it as-is with the old dependencies. IMO, this is just an example of how -pl
without -am
needs to be used with care.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @LuciferYang . I agree with @xkrogen .
When you use without --am
, the other modules are the published SNAPSHOT poms and jars which are built with the default (Hadoop 3) configuration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry forgot about this. @LuciferYang just to recap, do you still see a bug here? or the behavior is expected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sunchao Yes, this problem still exists, only behavior of branch-3.1 is expected at present
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I test these command in 3.2-rc4(3.2-rc5 can't build with hadoop-2.7 now) , the problem still exists
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@LuciferYang could you check with the fix in #34100? I just tested it with the command you pasted above:
mvn clean install -DskipTests -pl resource-managers/yarn -am -Phadoop-2.7 -Pyarn
mvn test -pl resource-managers/yarn -Phadoop-2.7 -Pyarn -DwildcardSuites=org.apache.spark.deploy.yarn.YarnClusterSuite
and the tests all passed for me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sunchao Yes, the behavior is expected now ~ thx ~
### What changes were proposed in this pull request? This PR aims to remove unused `commons-beanutils` dependency from `pom.xml` and `LICENSE-binary`. ### Why are the changes needed? #30701 removed `commons-beanutils` from `hadoop-3` profile at Apache Spark 3.2.0. - #30701 #40788 removed `hadoop-2` profile from Apache Spark 3.5.0 - #40788 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45705 from dongjoon-hyun/SPARK-47548. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request? This PR aims to remove unused `commons-beanutils` dependency from `pom.xml` and `LICENSE-binary`. ### Why are the changes needed? apache#30701 removed `commons-beanutils` from `hadoop-3` profile at Apache Spark 3.2.0. - apache#30701 apache#40788 removed `hadoop-2` profile from Apache Spark 3.5.0 - apache#40788 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#45705 from dongjoon-hyun/SPARK-47548. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
What changes were proposed in this pull request?
This:
Note that for Hadoop 2.7, we'll still use the same modules such as hadoop-client.
In order to still keep default Hadoop profile to be hadoop-3.2, this defines the following Maven properties:
which default to:
but all switch to
hadoop-client
when the Hadoop profile is hadoop-2.7. A side affect from this is we'll import the same dependency multiple times. For this I have to disable Maven enforcerbanDuplicatePomDependencyVersions
.Besides above, there are the following changes:
ProxyUriUtils.getPath
fromApplicationMaster
which is a server-side/private API.IsolatedClientLoader
to excludehadoop-auth
jars when Hadoop version is 3.x. This change should only matter when we're not sharing Hadoop classes with Spark (which is mostly used in tests).Why are the changes needed?
Hadoop 3.2.2 is released with new features and bug fixes, so it's good for the Spark community to adopt it. However, latest Hadoop versions starting from Hadoop 3.2.1 have upgraded to use Guava 27+. In order to resolve Guava conflicts, this takes the approach by switching to shaded client jars provided by Hadoop. This also has the benefits of avoid pulling other 3rd party dependencies from Hadoop side so as to avoid more potential future conflicts.
Does this PR introduce any user-facing change?
When people use Spark with
hadoop-provided
option, they should make sure class path containshadoop-client-api
andhadoop-client-runtime
jars. In addition, they may need to make sure these jars appear before other Hadoop jars in the order. Otherwise, classes may be loaded from the other non-shaded Hadoop jars and cause potential conflicts.How was this patch tested?
Relying on existing tests.