Skip to content

[SPARK-33212][BUILD] Move to shaded clients for Hadoop 3.x profile #29843

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 20 commits into from

Conversation

sunchao
Copy link
Member

@sunchao sunchao commented Sep 22, 2020

What changes were proposed in this pull request?

This switches Spark to use shaded Hadoop clients, namely hadoop-client-api and hadoop-client-runtime, for Hadoop 3.x. For Hadoop 2.7, we'll still use the same modules such as hadoop-client.

In order to still keep default Hadoop profile to be hadoop-3.2, this defines the following Maven properties:

hadoop-client-api.artifact
hadoop-client-runtime.artifact
hadoop-client-minicluster.artifact

which default to:

hadoop-client-api
hadoop-client-runtime
hadoop-client-minicluster

but all switch to hadoop-client when the Hadoop profile is hadoop-2.7. A side affect from this is we'll import the same dependency multiple times. For this I have to disable Maven enforcer banDuplicatePomDependencyVersions.

Besides above, there are the following changes:

  • explicitly add a few dependencies which are imported via transitive dependencies from Hadoop jars, but are removed from the shaded client jars.
  • removed the use of ProxyUriUtils.getPath from ApplicationMaster which is a server-side/private API.
  • modified IsolatedClientLoader to exclude hadoop-auth jars when Hadoop version is 3.x. This change should only matter when we're not sharing Hadoop classes with Spark (which is mostly used in tests).

Why are the changes needed?

This serves two purposes:

  • to unblock Spark from upgrading to Hadoop 3.2.2/3.3.0+. Latest Hadoop versions have upgraded to use Guava 27+ and in order to adopt the latest Hadoop versions in Spark, we'll need to resolve the Guava conflicts. This takes the approach by switching to shaded client jars provided by Hadoop.
  • avoid pulling 3rd party dependencies from Hadoop and avoid potential future conflicts.

Does this PR introduce any user-facing change?

When people use Spark with hadoop-provided option, they should make sure class path contains hadoop-client-api and hadoop-client-runtime jars. In addition, they may need to make sure these jars appear before other Hadoop jars in the order. Otherwise, classes may be loaded from the other non-shaded Hadoop jars and cause potential conflicts.

How was this patch tested?

Relying on existing tests.

@dongjoon-hyun
Copy link
Member

Great! Thanks, @sunchao .

@SparkQA
Copy link

SparkQA commented Sep 22, 2020

Test build #128992 has finished for PR 29843 at commit 09ea1e3.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sunchao
Copy link
Member Author

sunchao commented Sep 23, 2020

The Python/YARN tests are failing with:

/usr/bin/python: No module named socketserver

which is weird. I also was not able to reproduce all the Python related test failures locally.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Sep 23, 2020

@sunchao, I think it's because Jenkins uses Python 2 for python executable which we dropped (whereas GitHub Actions uses Python 3 properly for python executable). That module has been renamed to socketserver in Python 3.

Can you explicitly change the test to use python3 instead of python in the tests at, for example, here:

"spark.yarn.appMasterEnv.PYSPARK_DRIVER_PYTHON"
-> sys.env.getOrElse("PYSPARK_DRIVER_PYTHON", "python"),
"spark.yarn.appMasterEnv.PYSPARK_PYTHON"
-> sys.env.getOrElse("PYSPARK_PYTHON", "python")),

This fails in this PR specifically because the tests failed here have ExtendedYarnTest tag which will only be run when some changes are made into Yarn.

@sunchao
Copy link
Member Author

sunchao commented Sep 23, 2020

Ah, thanks @HyukjinKwon ! Yeah on my local laptop I set those two environment variables and tests are passing. Will make the change in the test suite.

Copy link
Contributor

@steveloughran steveloughran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. have there been changes in the Hadoop APIs which broke Spark? If so: report a regression.
  2. consider including "and move to shaded hadoop-client" in the title

I should warn that the hadoop-cloud codebase isn't shaded. Long story, but "nobody has sat down to do it across everything" is probably the most accurate

@@ -308,7 +307,7 @@ private[spark] class ApplicationMaster(
// The client-mode AM doesn't listen for incoming connections, so report an invalid port.
registerAM(Utils.localHostName, -1, sparkConf,
sparkConf.getOption("spark.driver.appUIAddress"), appAttemptId)
addAmIpFilter(Some(driverRef), ProxyUriUtils.getPath(appAttemptId.getApplicationId))
addAmIpFilter(Some(driverRef), s"/proxy/$appAttemptId")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This a regression in the Hadoop APIs?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no - this is to remove dependency on Hadoop server side (ProxyUriUtils, which also calls private API and is not good).

@steveloughran
Copy link
Contributor

+expect a 3.2.2 before end of year. If there are regressions, that would be the place to get them addressed

@sunchao sunchao changed the title [WIP][SPARK-29250] Upgrade to Hadoop 3.2.1 [WIP][SPARK-29250] Upgrade to Hadoop 3.2.1 and move to shaded client Sep 23, 2020
@sunchao
Copy link
Member Author

sunchao commented Sep 23, 2020

Thanks @steveloughran .

have there been changes in the Hadoop APIs which broke Spark? If so: report a regression.

No haven't seen any API breakage yet.

consider including "and move to shaded hadoop-client" in the title

Good suggestion. Done.

Regarding hadoop-cloud codebase, I think Spark already has its own hadoop-cloud module which shades Guava from Hadoop modules like hadoop-cloud-storage and hadoop-azure, so I think we are good. I also didn't spot any Guava stuff from there in Maven dependency tree.

@@ -308,7 +307,7 @@ private[spark] class ApplicationMaster(
// The client-mode AM doesn't listen for incoming connections, so report an invalid port.
registerAM(Utils.localHostName, -1, sparkConf,
sparkConf.getOption("spark.driver.appUIAddress"), appAttemptId)
addAmIpFilter(Some(driverRef), ProxyUriUtils.getPath(appAttemptId.getApplicationId))
addAmIpFilter(Some(driverRef), s"/proxy/$appAttemptId")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we are not using getApplicationId as before?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops my bad. Should be appAttemptId.getApplicationId.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, do we need uri encode as ProxyUriUtils.getPath does internally?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah we need that too. Good catch. I had it before but something messed up during rebase. Will fix.

@SparkQA
Copy link

SparkQA commented Sep 24, 2020

Test build #129064 has finished for PR 29843 at commit 6a8a8b6.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 24, 2020

Test build #129071 has finished for PR 29843 at commit 7bca429.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sunchao
Copy link
Member Author

sunchao commented Sep 24, 2020

Hmm interesting. There is still one Python test in YARN failing with:

Exception: Python in worker has different version 3.6 than that in driver 3.8, PySpark cannot run with different minor versions. Please check environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are correctly set.

after I changed YarnClusterSuite to use python3. Seems we've already specified to use 3.8 in the CI config, so not sure why that happens. @HyukjinKwon any hint? Thanks.

@@ -118,11 +118,15 @@ private[hive] object IsolatedClientLoader extends Logging {
hadoopVersion: String,
ivyPath: Option[String],
remoteRepos: String): Seq[URL] = {
val hadoopJarName = if (hadoopVersion.startsWith("3")) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this valid for all Hadoop 3.x verisons?

Copy link
Member Author

@sunchao sunchao Sep 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I believe so. These modules should be available in any production Hadoop 3.x releases I think. See https://issues.apache.org/jira/browse/HADOOP-11804, it is fixed in 3.0.0-alpha2.

@sunchao
Copy link
Member Author

sunchao commented Sep 24, 2020

BTW the other SparkQA also failed with:

java.io.IOException: Cannot run program "python3": error=2, No such file or directory

@sunchao sunchao changed the title [WIP][SPARK-29250] Upgrade to Hadoop 3.2.1 and move to shaded client [WIP][SPARK-29250][test-maven][test-hadoop2.7] Upgrade to Hadoop 3.2.1 and move to shaded client Sep 24, 2020
@SparkQA
Copy link

SparkQA commented Sep 24, 2020

Test build #129088 has started for PR 29843 at commit f6f19b5.

@SparkQA
Copy link

SparkQA commented Sep 24, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/33709/

@SparkQA
Copy link

SparkQA commented Sep 24, 2020

Test build #129092 has finished for PR 29843 at commit c9534c7.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 24, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/33709/

@dbtsai
Copy link
Member

dbtsai commented Oct 22, 2020

Merged into master. Thanks all for reviewing!

@dongjoon-hyun
Copy link
Member

Great! Thank you guys!

@sunchao
Copy link
Member Author

sunchao commented Oct 22, 2020

Great! Thank you all!

@tgravescs
Copy link
Contributor

sorry I said that backwards, by default if we are including a shaded version of Hadoop, that means users won't pick up all the dependencies Hadoop has by default. This is kind of the point for guava for spark. But it also means if users were relying on this behavior its now broken. Ideally users wouldn't rely on this but many times they don't know or think about it. I guess my point is I think its worth documenting this as its a change in the behavior.

@xkrogen
Copy link
Contributor

xkrogen commented Oct 22, 2020

+1 @tgravescs -- it's definitely common for users to implicitly depend on the transitive tree of Spark/Hadoop dependencies. This is a move in the right direction to prevent such behavior in the future, but probably shouldn't go in any point releases, and should be mentioned in release notes.

@sunchao sunchao deleted the SPARK-29250 branch October 22, 2020 17:09
@sunchao
Copy link
Member Author

sunchao commented Oct 22, 2020

Thanks @tgravescs and @xkrogen . Those are valid points. I'll document them in release notes. And yes, this is going to 3.1.0 which is not a point release.

@dbtsai
Copy link
Member

dbtsai commented Oct 22, 2020

@sunchao instead of upgrading to 3.2.1 which has a known backward compatibility issue, maybe we can explore to upgrade Hadoop to 3.3 directly. The backward compatibility issue is fixed in Hadoop 3.3, and Hadoop 3.3 has many good stuff for cloud support. cc @steveloughran

@dongjoon-hyun
Copy link
Member

+1 for Hadoop 3.3.0+

@dongjoon-hyun
Copy link
Member

BTW, @sunchao is working on that.

@steveloughran
Copy link
Contributor

To be ruthless about even my own code

  • 3.2.2. is lowest risk
  • 3.3.3 has a lot of s3 and abfs changes, but there's enough changes elsewhere to make it more traumatic
  • And a 3.3.1 is really needed/due for the stabilisation there.

It would probably be safest to build/release with a 3.2.x but allow 3.3.x to be built with if someone really wanted it.

@sunchao
Copy link
Member Author

sunchao commented Nov 16, 2020

Thanks @steveloughran for sharing the info! Totally agree. Yes my current plan is to wait until 3.2.2 comes out and then upgrade Spark to use that, and hope community will agree with that. People who want to use 3.3.x can still use the existing -Dhadoop.version flag to enable it.

3.3.3 has a lot of s3 and abfs changes, but there's enough changes elsewhere to make it more traumatic

you mean 3.3 here right? does 3.3.3 exist yet?

@dongjoon-hyun
Copy link
Member

If that's the Hadoop community's recommendation, +1 for 3.2.2.

@steveloughran
Copy link
Contributor

Sorry, I meant 3.3.1.

If you are happy with a 3.3.0 release, then go for it. It has lots of shiny new features for s3a and abfs in particular. For spark standalone that would really help people. As for spark-in-cluster, those people who offer clusters-as-a-service can build their own spark releases with the dependencies they choose, can't they?

@dongjoon-hyun
Copy link
Member

😄 @steveloughran

@dongjoon-hyun
Copy link
Member

Hi, All.

Sadly, I noticed that this breaks master branch like the following. This seems to be a known issue, HADOOP-16080 since Apache Hadoop 3.1.1. According to HADOOP-16080, there is no official release yet. Do we have a workaround in Spark-side?

Traceback (most recent call last):
  File "/tmp/spark-e8674f85-69c2-4dac-bf60-9a20c608e954/ctas.py", line 6, in <module>
    spark = SparkSession.builder.appName(sys.argv[1]).getOrCreate()
  File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/session.py", line 228, in getOrCreate
  File "/opt/spark/python/lib/pyspark.zip/pyspark/context.py", line 382, in getOrCreate
  File "/opt/spark/python/lib/pyspark.zip/pyspark/context.py", line 147, in __init__
  File "/opt/spark/python/lib/pyspark.zip/pyspark/context.py", line 209, in _do_init
  File "/opt/spark/python/lib/pyspark.zip/pyspark/context.py", line 319, in _initialize_context
  File "/opt/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1569, in __call__
  File "/opt/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: java.lang.NoSuchMethodError: 'void org.apache.hadoop.util.SemaphoredDelegatingExecutor.<init>(com.google.common.util.concurrent.ListeningExecutorService, int, boolean)'
	at org.apache.hadoop.fs.s3a.S3AFileSystem.create(S3AFileSystem.java:772)
	at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1118)
	at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1098)
	at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:987)
	at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:975)
	at org.apache.spark.deploy.SparkHadoopUtil$.createFile(SparkHadoopUtil.scala:510)

@sunchao
Copy link
Member Author

sunchao commented Nov 25, 2020

Thanks @dongjoon-hyun for reporting. Do you know why this was not revealed in the previous CI runs? I'll take a look to see if there is any workaround from Spark side. Also wonder if @steveloughran has any suggestions on this since you were involved in the Hadoop issue.

@dongjoon-hyun
Copy link
Member

Although K8s IT test suite has one test coverage for hadoop-aws, it uses spark.jars.packages=org.apache.hadoop:hadoop-aws:3.2.0 explicitly. So, it seems to download hadoop-aws and its dependency explicitly on the fly.

Do you know why this was not revealed in the previous CI runs?

In short, the regression is that dev/make-distribution.sh -Phadoop-cloud ... doesn't make a complete distribution for cloud support. So, in the clusters who doesn't allow Maven access, it will fail. This is a big regression.

@sunchao
Copy link
Member Author

sunchao commented Nov 25, 2020

Thanks @dongjoon-hyun . Do you know an easy way to reproduce this? I'll see what we can do from Spark side. On the other hand, Hadoop 3.2.2 is coming soon, and let's see if we can quickly come up with a fix in the Hadoop community.

@dongjoon-hyun
Copy link
Member

Thank, please try to build a distribution with hadoop-cloud and access S3.

NO_MANUAL=1 ./dev/make-distribution.sh --pip --tgz -Pkubernetes,hadoop-3.2,hadoop-cloud

In the above example, I used S3 as a Spark event log directory.

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Nov 25, 2020

Since we need a working hadoop-aws feature for Spark QA, I inevitably made a PR as one of the alternative to see the feasibility on reverting. I'll keep it open until next Monday before branch-cut or until we find a workaround.

@dongjoon-hyun
Copy link
Member

Currently, we are searching all options.

  • The Spark-side workaround
  • New Hadoop release.
  • Reverting to the old dependency

@steveloughran
Copy link
Contributor

If someone wants to sit down and produce a hadoop-cloud shaded module which hides things downstream, happy to let them. It's been one of those things I'd like to have, though I'd prefer even more: java 9 modules. But the s3a and abfs modules do make heavy use of common code in hadoop-common (specifically: o.a.h.fs.impl, soon o.a.h.fs.statistics.impl) and can't just treat hadoop-common as a third party api.

I think java 9 modules is how we should really be going

@sunchao
Copy link
Member Author

sunchao commented Nov 28, 2020

Thanks @steveloughran , yeah agree that the java 9 modules feature looks promising (it was discussed some years back in HADOOP-11656 but now the timing should be more right). I can try to spend sometime looking at this. A related question is whether we'd want to go further and apply it to other Hadoop modules as well.

This will take a while though. In the meanwhile, I'm wondering what we can do to ship this in the soon-coming Spark 3.1 release. One possible solution, maybe, is to still use non-shaded client when the hadoop-cloud profile is picked up:

diff --git a/pom.xml b/pom.xml
index 3ae2e7420e..12c36af557 100644
--- a/pom.xml
+++ b/pom.xml
@@ -3238,6 +3238,11 @@

     <profile>
       <id>hadoop-cloud</id>
+      <properties>
+        <hadoop-client-api.artifact>hadoop-client</hadoop-client-api.artifact>
+        <hadoop-client-runtime.artifact>hadoop-client</hadoop-client-runtime.artifact>
+        <hadoop-client-minicluster.artifact>hadoop-client</hadoop-client-minicluster.artifact>
+      </properties>
       <modules>
         <module>hadoop-cloud</module>
       </modules>

Or maybe we could try to shade the hadoop-aws jar in the spark-hadoop-cloud_2.12 module itself so that it invokes the shaded API from hadoop-common side. This won't work if Spark users decide to use their own Hadoop jars (via hadoop-provided) so we may have to make the hadoop-aws a compile scope dependency.

@sunchao
Copy link
Member Author

sunchao commented Nov 30, 2020

Or maybe we could try to shade the hadoop-aws jar in the spark-hadoop-cloud_2.12 module itself so that it invokes the shaded API from hadoop-common side. This won't work if Spark users decide to use their own Hadoop jars (via hadoop-provided) so we may have to make the hadoop-aws a compile scope dependency.

I created a WIP PR to demonstrate how this work. Feel free to comment on it.

@dongjoon-hyun
Copy link
Member

Thank you, @sunchao !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.