Skip to content

[WIP][SPARK-33212][BUILD] Provide hadoop-aws-shaded jar in hadoop-cloud module #30556

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

sunchao
Copy link
Member

@sunchao sunchao commented Nov 30, 2020

What changes were proposed in this pull request?

This creates a hadoop-aws-shaded jar within hadoop-cloud module, which shades Guava and relocate them to the org.apache.hadoop.shaded namespace used by the Hadoop side.

A large portion of the change involves moving hadoop-aws, hadoop-azure and hadoop-openstack under the new hadoop-2.7 profile. This is to make sure they are not included when the default hadoop-3.2 profile is active, since otherwise we'd have both hadoop-aws and hadoop-aws-shaded after build.

Why are the changes needed?

Due to HADOOP-15387, hadoop-aws module currently doesn't work with shaded Hadoop clients, namely hadoop-client-api and hadoop-client-runtime, because the former references some private API from hadoop-common which uses Guava and is shaded in hadoop-client-api. Consequently, when talking to S3 Spark users may encounter the following error:

: java.lang.NoSuchMethodError: 'void org.apache.hadoop.util.SemaphoredDelegatingExecutor.<init>(com.google.common.util.concurrent.ListeningExecutorService, int, boolean)'
	at org.apache.hadoop.fs.s3a.S3AFileSystem.create(S3AFileSystem.java:772)
	at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1118)
	at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1098)
	at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:987)
	at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:975)
	at org.apache.spark.deploy.SparkHadoopUtil$.createFile(SparkHadoopUtil.scala:510)

This PR mitigate the issue by shading hadoop-aws within Spark itself, and replace the existing hadoop-aws jar with hadoop-aws-shaded jar. This, however, shall be treated as a temporary fix and should eventually be replaced by the shaded jar from Hadoop side, once HADOOP-15387 is resolved.

Also note that, with this PR, hadoop-aws-shaded jar becomes a hard dependency, and shall be included even if hadoop-provided is specified by Spark users.

Does this PR introduce any user-facing change?

Yes, the newly introduced hadoop-aws-shaded jar becomes a hard dependency, and shall be included even if hadoop-provided is specified by Spark users.

How was this patch tested?

I manually checked the bytecode of the generated hadoop-aws-shaded jar and verified that all the Guava references are now relocated .

@dongjoon-hyun
Copy link
Member

Thank you, @sunchao .

@dongjoon-hyun
Copy link
Member

cc @srowen and @steveloughran

@AngersZhuuuu
Copy link
Contributor

For users of hadoop S3A using a custom version with custom code (such as me), they can change hadoop build info and deploy to their own maven repo and change spark dependencies to their own maven repo.
In this way can solve problem of user who will use their own jar, right? But it seems like a lot of work.

@sunchao
Copy link
Member Author

sunchao commented Dec 1, 2020

@AngersZhuuuu yes that should work. On the other hand, if you are building Spark with hadoop-provided option on, you can also build your own shaded hadoop-aws jar and put it in the classpath.

@AngersZhuuuu
Copy link
Contributor

@AngersZhuuuu yes that should work. On the other hand, if you are building Spark with hadoop-provided option on, you can also build your own shaded hadoop-aws jar and put it in the classpath.

Thanks for your suggestion. Is there any new developments in Hive 2.3 support for Hadoop-3.3.0? I saw this pr apache/hive#1356 but I am not sure it will make hive run well with hadoop-3.3.0

@sunchao
Copy link
Member Author

sunchao commented Dec 1, 2020

@AngersZhuuuu you mean for Spark to work with Hive and Hadoop 3.3.0, right? the major issue is around resolving potential Guava conflicts between these components. Hadoop 3.2.1+/3.3.0+ has moved to Guava 27 while Hive/Spark are still on Guava 14. One of the motivations to move to the shaded client in Spark is to isolate the Guava dependencies on the Hadoop side. Similarly, @viirya is working on the above PR to shade Guava from Hive side.

@AngersZhuuuu
Copy link
Contributor

@AngersZhuuuu you mean for Spark to work with Hive and Hadoop 3.3.0, right? the major issue is around resolving potential Guava conflicts between these components. Hadoop 3.2.1+/3.3.0+ has moved to Guava 27 while Hive/Spark are still on Guava 14. One of the motivations to move to the shaded client in Spark is to isolate the Guava dependencies on the Hadoop side. Similarly, @viirya is working on the above PR to shade Guava from Hive side.

Not only spark, we need hive can support running with hadoop-3.3.0 too. Seems doesn't look finished yet.
https://issues.apache.org/jira/browse/HIVE-21569
https://issues.apache.org/jira/browse/HIVE-22916

@sunchao
Copy link
Member Author

sunchao commented Dec 1, 2020

Ah ok. That part, as far as I know, is stuck because Hive has dependency on a old version of Spark which blocks it from upgrading Guava. Hopefully that should be unblocked after HADOOP-17288 is shipped in the upcoming release (whether that be 3.3.1 or 3.4.0). I'm not sure whether the change will go to Hive 2.3 branch though.

@AngersZhuuuu
Copy link
Contributor

All right, there is still a lot of work to be done to integrate these engine with hadoop.

@steveloughran
Copy link
Contributor

I am perfectly happy to have the -shaded stuff in hadoop-common, as that stops the ASF getting upset about the spark project publishing o.a.hadoop artifacts (see : org.apache.hive artifacts). Other people would be happy too. Get that in to hadoop branches 3.2 and 3.3 and you can pick it up there.

we would also want a hadoop-cloud-storage-shaded which pulled in the relevant artifacts from the other modules

Now, one troublespot will be hadoop-common dependencies, especially the fs implementation stuff. I've been trying with all recent work to put them in .impl packages so that they can be isolated for java 9 modules; there's inevitably stuff in org.apache.hadoop.fs; the committers need hadoop-mapreduce-client. Suggestions?

Anyway, I'd rather a move to java 9 modules over shading, which is a workaround designed to hide CVEs in larger JARs. We just have to make do with that now, but it doesn't mean we need to stay with it.

so: serious question. What would it take to make hadoop-* modular? As well as the module-info files, we'd need to understand what is public and what isn't...and how to deal with

  • mixed public/private packages
  • stuff we thought was private but turns out gets needed by apps
  • stuff needed by extension points

thoughts?

@sunchao
Copy link
Member Author

sunchao commented Dec 1, 2020

@steveloughran thanks for your comments. What is your opinion on this PR though? this allows us to move forward in parallel while waiting for a proper fix from Hadoop side (which I can also help with). It seems that Spark doesn't officially publish any jars from hadoop-cloud module though.

Personally I feel the Java 9 modules feature is promising but on the other hand it seems like a radical change, given a lot of emphasis have been around shading in Hadoop. Further, I haven't seen wide adoption of this yet within the "eco-system" and not sure if it will impose chain effects on downstream apps depending on Hadoop.

so: serious question. What would it take to make hadoop-* modular?

I think this, together with moving to modules, should be properly discussed in a Hadoop JIRA or dev list. I still have lots of homework to do regarding the modules approach, but IMO we'd have to take a close look at downstream projects (at least those within close proximity of Hadoop), and figuring out the current state. From my experience Spark is relatively loosely decoupled with Hadoop but projects such as Hive may prove to be quite different.

Within Hadoop, we should examine what is public, what is "project-private" and what is "module-private", and tighten the APIs so that even different modules within Hadoop can play more nicely with each other.

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Dec 1, 2020

Hi, All. It turns out that master branch's normal distribution (without hadoop-cloud) is also affected.

$ bin/spark-shell --conf spark.hadoop.fs.s3a.access.key=$AWS_ACCESS_KEY_ID --conf spark.hadoop.fs.s3a.secret.key=$AWS_SECRET_ACCESS_KEY -c spark.eventLog.enabled=true -c spark.eventLog.dir=s3a://dongjoon/ --packages org.apache.hadoop:hadoop-aws:3.2.0,org.apache.hadoop:hadoop-common:3.2.0
...
java.lang.NoSuchMethodError: 'org.apache.hadoop.conf.Configuration org.apache.hadoop.security.ProviderUtils.excludeIncompatibleCredentialProviders(org.apache.hadoop.conf.Configuration, java.lang.Class)'
  at org.apache.hadoop.fs.s3a.S3AUtils.getAWSAccessKeys(S3AUtils.java:740)
  at org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider.<init>(SimpleAWSCredentialsProvider.java:58)
  at org.apache.hadoop.fs.s3a.S3AUtils.createAWSCredentialProviderSet(S3AUtils.java:600)
  at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:257)
  at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)
  at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
  at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
  at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
  at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
  at org.apache.spark.util.Utils$.getHadoopFileSystem(Utils.scala:1853)
  at org.apache.spark.deploy.history.EventLogFileWriter.<init>(EventLogFileWriters.scala:60)
  at org.apache.spark.deploy.history.SingleEventLogFileWriter.<init>(EventLogFileWriters.scala:213)
  at org.apache.spark.deploy.history.EventLogFileWriter$.apply(EventLogFileWriters.scala:181)
  at org.apache.spark.scheduler.EventLoggingListener.<init>(EventLoggingListener.scala:64)
  at org.apache.spark.SparkContext.<init>(SparkContext.scala:588)

Up to 3.0.1, the following was enough.

$ bin/spark-shell --conf spark.hadoop.fs.s3a.access.key=$AWS_ACCESS_KEY_ID --conf spark.hadoop.fs.s3a.secret.key=$AWS_SECRET_ACCESS_KEY -c spark.eventLog.enabled=true -c spark.eventLog.dir=s3a://dongjoon/spark-events/ --packages org.apache.hadoop:hadoop-aws:3.2.0

@sunchao
Copy link
Member Author

sunchao commented Dec 2, 2020

Thanks @dongjoon-hyun . This is bad news and it means we'd have to abandon the approach in this PR. The only solution seems have to be on the Hadoop side. I've opened a Hadoop PR and tested it successfully with the code snippet you pasted above. @steveloughran could you take a look there? thanks.

@dongjoon-hyun
Copy link
Member

Ya. I'll proceed #30508 first since Apache Spark 3.1 branch cut is this Friday.
We can revisit this with later during QA period.

@sunchao
Copy link
Member Author

sunchao commented Dec 2, 2020

Sure @dongjoon-hyun . This sounds good to me.

@sunchao sunchao closed this Dec 2, 2020
dongjoon-hyun pushed a commit that referenced this pull request Jan 26, 2021
… and add more strict Hadoop version check

### What changes were proposed in this pull request?

1. Add back Maven enforcer for duplicate dependencies check
2. More strict check on Hadoop versions which support shaded client in `IsolatedClientLoader`. To do proper version check, this adds a util function `majorMinorPatchVersion` to extract major/minor/patch version from a string.
3. Cleanup unnecessary code

### Why are the changes needed?

The Maven enforcer was removed as part of #30556. This proposes to add it back.

Also, Hadoop shaded client doesn't work in certain cases (see [these comments](#30701 (comment)) for details). This strictly checks that the current Hadoop version (i.e., 3.2.2 at the moment) has good support of shaded client or otherwise fallback to old unshaded ones.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing tests.

Closes #31203 from sunchao/SPARK-33212-followup.

Lead-authored-by: Chao Sun <sunchao@apple.com>
Co-authored-by: Chao Sun <sunchao@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
skestle pushed a commit to skestle/spark that referenced this pull request Feb 3, 2021
… and add more strict Hadoop version check

### What changes were proposed in this pull request?

1. Add back Maven enforcer for duplicate dependencies check
2. More strict check on Hadoop versions which support shaded client in `IsolatedClientLoader`. To do proper version check, this adds a util function `majorMinorPatchVersion` to extract major/minor/patch version from a string.
3. Cleanup unnecessary code

### Why are the changes needed?

The Maven enforcer was removed as part of apache#30556. This proposes to add it back.

Also, Hadoop shaded client doesn't work in certain cases (see [these comments](apache#30701 (comment)) for details). This strictly checks that the current Hadoop version (i.e., 3.2.2 at the moment) has good support of shaded client or otherwise fallback to old unshaded ones.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing tests.

Closes apache#31203 from sunchao/SPARK-33212-followup.

Lead-authored-by: Chao Sun <sunchao@apple.com>
Co-authored-by: Chao Sun <sunchao@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
pan3793 pushed a commit to pan3793/spark that referenced this pull request Aug 30, 2021
… and add more strict Hadoop version check

1. Add back Maven enforcer for duplicate dependencies check
2. More strict check on Hadoop versions which support shaded client in `IsolatedClientLoader`. To do proper version check, this adds a util function `majorMinorPatchVersion` to extract major/minor/patch version from a string.
3. Cleanup unnecessary code

The Maven enforcer was removed as part of apache#30556. This proposes to add it back.

Also, Hadoop shaded client doesn't work in certain cases (see [these comments](apache#30701 (comment)) for details). This strictly checks that the current Hadoop version (i.e., 3.2.2 at the moment) has good support of shaded client or otherwise fallback to old unshaded ones.

No.

Existing tests.

Closes apache#31203 from sunchao/SPARK-33212-followup.

Lead-authored-by: Chao Sun <sunchao@apple.com>
Co-authored-by: Chao Sun <sunchao@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants