[WIP][SPARK-33212][BUILD] Provide hadoop-aws-shaded jar in hadoop-cloud module #30556

sunchao · 2020-11-30T19:02:27Z

What changes were proposed in this pull request?

This creates a hadoop-aws-shaded jar within hadoop-cloud module, which shades Guava and relocate them to the org.apache.hadoop.shaded namespace used by the Hadoop side.

A large portion of the change involves moving hadoop-aws, hadoop-azure and hadoop-openstack under the new hadoop-2.7 profile. This is to make sure they are not included when the default hadoop-3.2 profile is active, since otherwise we'd have both hadoop-aws and hadoop-aws-shaded after build.

Why are the changes needed?

Due to HADOOP-15387, hadoop-aws module currently doesn't work with shaded Hadoop clients, namely hadoop-client-api and hadoop-client-runtime, because the former references some private API from hadoop-common which uses Guava and is shaded in hadoop-client-api. Consequently, when talking to S3 Spark users may encounter the following error:

: java.lang.NoSuchMethodError: 'void org.apache.hadoop.util.SemaphoredDelegatingExecutor.<init>(com.google.common.util.concurrent.ListeningExecutorService, int, boolean)'
	at org.apache.hadoop.fs.s3a.S3AFileSystem.create(S3AFileSystem.java:772)
	at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1118)
	at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1098)
	at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:987)
	at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:975)
	at org.apache.spark.deploy.SparkHadoopUtil$.createFile(SparkHadoopUtil.scala:510)

This PR mitigate the issue by shading hadoop-aws within Spark itself, and replace the existing hadoop-aws jar with hadoop-aws-shaded jar. This, however, shall be treated as a temporary fix and should eventually be replaced by the shaded jar from Hadoop side, once HADOOP-15387 is resolved.

Also note that, with this PR, hadoop-aws-shaded jar becomes a hard dependency, and shall be included even if hadoop-provided is specified by Spark users.

Does this PR introduce any user-facing change?

Yes, the newly introduced hadoop-aws-shaded jar becomes a hard dependency, and shall be included even if hadoop-provided is specified by Spark users.

How was this patch tested?

I manually checked the bytecode of the generated hadoop-aws-shaded jar and verified that all the Guava references are now relocated .

dongjoon-hyun · 2020-11-30T21:10:41Z

Thank you, @sunchao .

dongjoon-hyun · 2020-11-30T21:12:41Z

cc @srowen and @steveloughran

AngersZhuuuu · 2020-12-01T01:43:50Z

For users of hadoop S3A using a custom version with custom code (such as me), they can change hadoop build info and deploy to their own maven repo and change spark dependencies to their own maven repo.
In this way can solve problem of user who will use their own jar, right? But it seems like a lot of work.

sunchao · 2020-12-01T03:10:17Z

@AngersZhuuuu yes that should work. On the other hand, if you are building Spark with hadoop-provided option on, you can also build your own shaded hadoop-aws jar and put it in the classpath.

AngersZhuuuu · 2020-12-01T03:15:02Z

@AngersZhuuuu yes that should work. On the other hand, if you are building Spark with hadoop-provided option on, you can also build your own shaded hadoop-aws jar and put it in the classpath.

Thanks for your suggestion. Is there any new developments in Hive 2.3 support for Hadoop-3.3.0? I saw this pr apache/hive#1356 but I am not sure it will make hive run well with hadoop-3.3.0

sunchao · 2020-12-01T03:27:22Z

@AngersZhuuuu you mean for Spark to work with Hive and Hadoop 3.3.0, right? the major issue is around resolving potential Guava conflicts between these components. Hadoop 3.2.1+/3.3.0+ has moved to Guava 27 while Hive/Spark are still on Guava 14. One of the motivations to move to the shaded client in Spark is to isolate the Guava dependencies on the Hadoop side. Similarly, @viirya is working on the above PR to shade Guava from Hive side.

AngersZhuuuu · 2020-12-01T03:31:42Z

@AngersZhuuuu you mean for Spark to work with Hive and Hadoop 3.3.0, right? the major issue is around resolving potential Guava conflicts between these components. Hadoop 3.2.1+/3.3.0+ has moved to Guava 27 while Hive/Spark are still on Guava 14. One of the motivations to move to the shaded client in Spark is to isolate the Guava dependencies on the Hadoop side. Similarly, @viirya is working on the above PR to shade Guava from Hive side.

Not only spark, we need hive can support running with hadoop-3.3.0 too. Seems doesn't look finished yet.
https://issues.apache.org/jira/browse/HIVE-21569
https://issues.apache.org/jira/browse/HIVE-22916

sunchao · 2020-12-01T03:37:36Z

Ah ok. That part, as far as I know, is stuck because Hive has dependency on a old version of Spark which blocks it from upgrading Guava. Hopefully that should be unblocked after HADOOP-17288 is shipped in the upcoming release (whether that be 3.3.1 or 3.4.0). I'm not sure whether the change will go to Hive 2.3 branch though.

AngersZhuuuu · 2020-12-01T04:29:44Z

All right, there is still a lot of work to be done to integrate these engine with hadoop.

steveloughran · 2020-12-01T13:21:29Z

I am perfectly happy to have the -shaded stuff in hadoop-common, as that stops the ASF getting upset about the spark project publishing o.a.hadoop artifacts (see : org.apache.hive artifacts). Other people would be happy too. Get that in to hadoop branches 3.2 and 3.3 and you can pick it up there.

we would also want a hadoop-cloud-storage-shaded which pulled in the relevant artifacts from the other modules

Now, one troublespot will be hadoop-common dependencies, especially the fs implementation stuff. I've been trying with all recent work to put them in .impl packages so that they can be isolated for java 9 modules; there's inevitably stuff in org.apache.hadoop.fs; the committers need hadoop-mapreduce-client. Suggestions?

Anyway, I'd rather a move to java 9 modules over shading, which is a workaround designed to hide CVEs in larger JARs. We just have to make do with that now, but it doesn't mean we need to stay with it.

so: serious question. What would it take to make hadoop-* modular? As well as the module-info files, we'd need to understand what is public and what isn't...and how to deal with

mixed public/private packages
stuff we thought was private but turns out gets needed by apps
stuff needed by extension points

thoughts?

sunchao · 2020-12-01T20:59:14Z

@steveloughran thanks for your comments. What is your opinion on this PR though? this allows us to move forward in parallel while waiting for a proper fix from Hadoop side (which I can also help with). It seems that Spark doesn't officially publish any jars from hadoop-cloud module though.

Personally I feel the Java 9 modules feature is promising but on the other hand it seems like a radical change, given a lot of emphasis have been around shading in Hadoop. Further, I haven't seen wide adoption of this yet within the "eco-system" and not sure if it will impose chain effects on downstream apps depending on Hadoop.

so: serious question. What would it take to make hadoop-* modular?

I think this, together with moving to modules, should be properly discussed in a Hadoop JIRA or dev list. I still have lots of homework to do regarding the modules approach, but IMO we'd have to take a close look at downstream projects (at least those within close proximity of Hadoop), and figuring out the current state. From my experience Spark is relatively loosely decoupled with Hadoop but projects such as Hive may prove to be quite different.

Within Hadoop, we should examine what is public, what is "project-private" and what is "module-private", and tighten the APIs so that even different modules within Hadoop can play more nicely with each other.

dongjoon-hyun · 2020-12-01T22:11:01Z

Hi, All. It turns out that master branch's normal distribution (without hadoop-cloud) is also affected.

$ bin/spark-shell --conf spark.hadoop.fs.s3a.access.key=$AWS_ACCESS_KEY_ID --conf spark.hadoop.fs.s3a.secret.key=$AWS_SECRET_ACCESS_KEY -c spark.eventLog.enabled=true -c spark.eventLog.dir=s3a://dongjoon/ --packages org.apache.hadoop:hadoop-aws:3.2.0,org.apache.hadoop:hadoop-common:3.2.0
...
java.lang.NoSuchMethodError: 'org.apache.hadoop.conf.Configuration org.apache.hadoop.security.ProviderUtils.excludeIncompatibleCredentialProviders(org.apache.hadoop.conf.Configuration, java.lang.Class)'
  at org.apache.hadoop.fs.s3a.S3AUtils.getAWSAccessKeys(S3AUtils.java:740)
  at org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider.<init>(SimpleAWSCredentialsProvider.java:58)
  at org.apache.hadoop.fs.s3a.S3AUtils.createAWSCredentialProviderSet(S3AUtils.java:600)
  at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:257)
  at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)
  at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
  at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
  at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
  at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
  at org.apache.spark.util.Utils$.getHadoopFileSystem(Utils.scala:1853)
  at org.apache.spark.deploy.history.EventLogFileWriter.<init>(EventLogFileWriters.scala:60)
  at org.apache.spark.deploy.history.SingleEventLogFileWriter.<init>(EventLogFileWriters.scala:213)
  at org.apache.spark.deploy.history.EventLogFileWriter$.apply(EventLogFileWriters.scala:181)
  at org.apache.spark.scheduler.EventLoggingListener.<init>(EventLoggingListener.scala:64)
  at org.apache.spark.SparkContext.<init>(SparkContext.scala:588)

Up to 3.0.1, the following was enough.

$ bin/spark-shell --conf spark.hadoop.fs.s3a.access.key=$AWS_ACCESS_KEY_ID --conf spark.hadoop.fs.s3a.secret.key=$AWS_SECRET_ACCESS_KEY -c spark.eventLog.enabled=true -c spark.eventLog.dir=s3a://dongjoon/spark-events/ --packages org.apache.hadoop:hadoop-aws:3.2.0

sunchao · 2020-12-02T06:45:15Z

Thanks @dongjoon-hyun . This is bad news and it means we'd have to abandon the approach in this PR. The only solution seems have to be on the Hadoop side. I've opened a Hadoop PR and tested it successfully with the code snippet you pasted above. @steveloughran could you take a look there? thanks.

dongjoon-hyun · 2020-12-02T07:08:48Z

Ya. I'll proceed #30508 first since Apache Spark 3.1 branch cut is this Friday.
We can revisit this with later during QA period.

sunchao · 2020-12-02T08:56:37Z

Sure @dongjoon-hyun . This sounds good to me.

… and add more strict Hadoop version check ### What changes were proposed in this pull request? 1. Add back Maven enforcer for duplicate dependencies check 2. More strict check on Hadoop versions which support shaded client in `IsolatedClientLoader`. To do proper version check, this adds a util function `majorMinorPatchVersion` to extract major/minor/patch version from a string. 3. Cleanup unnecessary code ### Why are the changes needed? The Maven enforcer was removed as part of #30556. This proposes to add it back. Also, Hadoop shaded client doesn't work in certain cases (see [these comments](#30701 (comment)) for details). This strictly checks that the current Hadoop version (i.e., 3.2.2 at the moment) has good support of shaded client or otherwise fallback to old unshaded ones. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #31203 from sunchao/SPARK-33212-followup. Lead-authored-by: Chao Sun <sunchao@apple.com> Co-authored-by: Chao Sun <sunchao@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

… and add more strict Hadoop version check ### What changes were proposed in this pull request? 1. Add back Maven enforcer for duplicate dependencies check 2. More strict check on Hadoop versions which support shaded client in `IsolatedClientLoader`. To do proper version check, this adds a util function `majorMinorPatchVersion` to extract major/minor/patch version from a string. 3. Cleanup unnecessary code ### Why are the changes needed? The Maven enforcer was removed as part of apache#30556. This proposes to add it back. Also, Hadoop shaded client doesn't work in certain cases (see [these comments](apache#30701 (comment)) for details). This strictly checks that the current Hadoop version (i.e., 3.2.2 at the moment) has good support of shaded client or otherwise fallback to old unshaded ones. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes apache#31203 from sunchao/SPARK-33212-followup. Lead-authored-by: Chao Sun <sunchao@apple.com> Co-authored-by: Chao Sun <sunchao@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

… and add more strict Hadoop version check 1. Add back Maven enforcer for duplicate dependencies check 2. More strict check on Hadoop versions which support shaded client in `IsolatedClientLoader`. To do proper version check, this adds a util function `majorMinorPatchVersion` to extract major/minor/patch version from a string. 3. Cleanup unnecessary code The Maven enforcer was removed as part of apache#30556. This proposes to add it back. Also, Hadoop shaded client doesn't work in certain cases (see [these comments](apache#30701 (comment)) for details). This strictly checks that the current Hadoop version (i.e., 3.2.2 at the moment) has good support of shaded client or otherwise fallback to old unshaded ones. No. Existing tests. Closes apache#31203 from sunchao/SPARK-33212-followup. Lead-authored-by: Chao Sun <sunchao@apple.com> Co-authored-by: Chao Sun <sunchao@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

wip

c0b5913

github-actions bot added the BUILD label Nov 30, 2020

sunchao marked this pull request as draft November 30, 2020 19:05

sunchao mentioned this pull request Nov 30, 2020

[SPARK-33212][BUILD] Move to shaded clients for Hadoop 3.x profile #29843

Closed

keep hadoop-aws/hadoop-azure/hadoop-openstack in default dependencies

017a0b6

dongjoon-hyun mentioned this pull request Dec 1, 2020

[SPARK-33618][CORE] Use hadoop-client instead of hadoop-client-api to make hadoop-aws work #30508

Closed

sunchao closed this Dec 2, 2020

sunchao mentioned this pull request Jan 15, 2021

[SPARK-33212][FOLLOW-UP][BUILD] Bring back duplicate dependency check and add more strict Hadoop version check #31203

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WIP][SPARK-33212][BUILD] Provide hadoop-aws-shaded jar in hadoop-cloud module #30556

[WIP][SPARK-33212][BUILD] Provide hadoop-aws-shaded jar in hadoop-cloud module #30556

Uh oh!

sunchao commented Nov 30, 2020

Uh oh!

dongjoon-hyun commented Nov 30, 2020

Uh oh!

dongjoon-hyun commented Nov 30, 2020

Uh oh!

AngersZhuuuu commented Dec 1, 2020

Uh oh!

sunchao commented Dec 1, 2020

Uh oh!

AngersZhuuuu commented Dec 1, 2020

Uh oh!

sunchao commented Dec 1, 2020

Uh oh!

AngersZhuuuu commented Dec 1, 2020

Uh oh!

sunchao commented Dec 1, 2020

Uh oh!

AngersZhuuuu commented Dec 1, 2020

Uh oh!

steveloughran commented Dec 1, 2020

Uh oh!

sunchao commented Dec 1, 2020

Uh oh!

dongjoon-hyun commented Dec 1, 2020 •

edited

Loading

Uh oh!

sunchao commented Dec 2, 2020

Uh oh!

dongjoon-hyun commented Dec 2, 2020

Uh oh!

sunchao commented Dec 2, 2020

Uh oh!

Uh oh!

[WIP][SPARK-33212][BUILD] Provide hadoop-aws-shaded jar in hadoop-cloud module #30556

[WIP][SPARK-33212][BUILD] Provide hadoop-aws-shaded jar in hadoop-cloud module #30556

Uh oh!

Conversation

sunchao commented Nov 30, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

dongjoon-hyun commented Nov 30, 2020

Uh oh!

dongjoon-hyun commented Nov 30, 2020

Uh oh!

AngersZhuuuu commented Dec 1, 2020

Uh oh!

sunchao commented Dec 1, 2020

Uh oh!

AngersZhuuuu commented Dec 1, 2020

Uh oh!

sunchao commented Dec 1, 2020

Uh oh!

AngersZhuuuu commented Dec 1, 2020

Uh oh!

sunchao commented Dec 1, 2020

Uh oh!

AngersZhuuuu commented Dec 1, 2020

Uh oh!

steveloughran commented Dec 1, 2020

Uh oh!

sunchao commented Dec 1, 2020

Uh oh!

dongjoon-hyun commented Dec 1, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sunchao commented Dec 2, 2020

Uh oh!

dongjoon-hyun commented Dec 2, 2020

Uh oh!

sunchao commented Dec 2, 2020

Uh oh!

Uh oh!

dongjoon-hyun commented Dec 1, 2020 •

edited

Loading