-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[WIP][SPARK-33212][BUILD] Provide hadoop-aws-shaded jar in hadoop-cloud module #30556
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Thank you, @sunchao . |
cc @srowen and @steveloughran |
For users of hadoop S3A using a custom version with custom code (such as me), they can change hadoop build info and deploy to their own maven repo and change spark dependencies to their own maven repo. |
@AngersZhuuuu yes that should work. On the other hand, if you are building Spark with |
Thanks for your suggestion. Is there any new developments in Hive 2.3 support for Hadoop-3.3.0? I saw this pr apache/hive#1356 but I am not sure it will make hive run well with hadoop-3.3.0 |
@AngersZhuuuu you mean for Spark to work with Hive and Hadoop 3.3.0, right? the major issue is around resolving potential Guava conflicts between these components. Hadoop 3.2.1+/3.3.0+ has moved to Guava 27 while Hive/Spark are still on Guava 14. One of the motivations to move to the shaded client in Spark is to isolate the Guava dependencies on the Hadoop side. Similarly, @viirya is working on the above PR to shade Guava from Hive side. |
Not only spark, we need hive can support running with hadoop-3.3.0 too. Seems doesn't look finished yet. |
Ah ok. That part, as far as I know, is stuck because Hive has dependency on a old version of Spark which blocks it from upgrading Guava. Hopefully that should be unblocked after HADOOP-17288 is shipped in the upcoming release (whether that be 3.3.1 or 3.4.0). I'm not sure whether the change will go to Hive 2.3 branch though. |
All right, there is still a lot of work to be done to integrate these engine with hadoop. |
I am perfectly happy to have the -shaded stuff in hadoop-common, as that stops the ASF getting upset about the spark project publishing o.a.hadoop artifacts (see : org.apache.hive artifacts). Other people would be happy too. Get that in to hadoop branches 3.2 and 3.3 and you can pick it up there. we would also want a hadoop-cloud-storage-shaded which pulled in the relevant artifacts from the other modules Now, one troublespot will be hadoop-common dependencies, especially the fs implementation stuff. I've been trying with all recent work to put them in .impl packages so that they can be isolated for java 9 modules; there's inevitably stuff in org.apache.hadoop.fs; the committers need hadoop-mapreduce-client. Suggestions? Anyway, I'd rather a move to java 9 modules over shading, which is a workaround designed to hide CVEs in larger JARs. We just have to make do with that now, but it doesn't mean we need to stay with it. so: serious question. What would it take to make hadoop-* modular? As well as the module-info files, we'd need to understand what is public and what isn't...and how to deal with
thoughts? |
@steveloughran thanks for your comments. What is your opinion on this PR though? this allows us to move forward in parallel while waiting for a proper fix from Hadoop side (which I can also help with). It seems that Spark doesn't officially publish any jars from Personally I feel the Java 9 modules feature is promising but on the other hand it seems like a radical change, given a lot of emphasis have been around shading in Hadoop. Further, I haven't seen wide adoption of this yet within the "eco-system" and not sure if it will impose chain effects on downstream apps depending on Hadoop.
I think this, together with moving to modules, should be properly discussed in a Hadoop JIRA or dev list. I still have lots of homework to do regarding the modules approach, but IMO we'd have to take a close look at downstream projects (at least those within close proximity of Hadoop), and figuring out the current state. From my experience Spark is relatively loosely decoupled with Hadoop but projects such as Hive may prove to be quite different. Within Hadoop, we should examine what is public, what is "project-private" and what is "module-private", and tighten the APIs so that even different modules within Hadoop can play more nicely with each other. |
Hi, All. It turns out that
Up to 3.0.1, the following was enough.
|
Thanks @dongjoon-hyun . This is bad news and it means we'd have to abandon the approach in this PR. The only solution seems have to be on the Hadoop side. I've opened a Hadoop PR and tested it successfully with the code snippet you pasted above. @steveloughran could you take a look there? thanks. |
Ya. I'll proceed #30508 first since Apache Spark 3.1 branch cut is this Friday. |
Sure @dongjoon-hyun . This sounds good to me. |
… and add more strict Hadoop version check ### What changes were proposed in this pull request? 1. Add back Maven enforcer for duplicate dependencies check 2. More strict check on Hadoop versions which support shaded client in `IsolatedClientLoader`. To do proper version check, this adds a util function `majorMinorPatchVersion` to extract major/minor/patch version from a string. 3. Cleanup unnecessary code ### Why are the changes needed? The Maven enforcer was removed as part of #30556. This proposes to add it back. Also, Hadoop shaded client doesn't work in certain cases (see [these comments](#30701 (comment)) for details). This strictly checks that the current Hadoop version (i.e., 3.2.2 at the moment) has good support of shaded client or otherwise fallback to old unshaded ones. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #31203 from sunchao/SPARK-33212-followup. Lead-authored-by: Chao Sun <sunchao@apple.com> Co-authored-by: Chao Sun <sunchao@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
… and add more strict Hadoop version check ### What changes were proposed in this pull request? 1. Add back Maven enforcer for duplicate dependencies check 2. More strict check on Hadoop versions which support shaded client in `IsolatedClientLoader`. To do proper version check, this adds a util function `majorMinorPatchVersion` to extract major/minor/patch version from a string. 3. Cleanup unnecessary code ### Why are the changes needed? The Maven enforcer was removed as part of apache#30556. This proposes to add it back. Also, Hadoop shaded client doesn't work in certain cases (see [these comments](apache#30701 (comment)) for details). This strictly checks that the current Hadoop version (i.e., 3.2.2 at the moment) has good support of shaded client or otherwise fallback to old unshaded ones. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes apache#31203 from sunchao/SPARK-33212-followup. Lead-authored-by: Chao Sun <sunchao@apple.com> Co-authored-by: Chao Sun <sunchao@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
… and add more strict Hadoop version check 1. Add back Maven enforcer for duplicate dependencies check 2. More strict check on Hadoop versions which support shaded client in `IsolatedClientLoader`. To do proper version check, this adds a util function `majorMinorPatchVersion` to extract major/minor/patch version from a string. 3. Cleanup unnecessary code The Maven enforcer was removed as part of apache#30556. This proposes to add it back. Also, Hadoop shaded client doesn't work in certain cases (see [these comments](apache#30701 (comment)) for details). This strictly checks that the current Hadoop version (i.e., 3.2.2 at the moment) has good support of shaded client or otherwise fallback to old unshaded ones. No. Existing tests. Closes apache#31203 from sunchao/SPARK-33212-followup. Lead-authored-by: Chao Sun <sunchao@apple.com> Co-authored-by: Chao Sun <sunchao@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
What changes were proposed in this pull request?
This creates a
hadoop-aws-shaded
jar withinhadoop-cloud
module, which shades Guava and relocate them to theorg.apache.hadoop.shaded
namespace used by the Hadoop side.A large portion of the change involves moving
hadoop-aws
,hadoop-azure
andhadoop-openstack
under the newhadoop-2.7
profile. This is to make sure they are not included when the defaulthadoop-3.2
profile is active, since otherwise we'd have bothhadoop-aws
andhadoop-aws-shaded
after build.Why are the changes needed?
Due to HADOOP-15387,
hadoop-aws
module currently doesn't work with shaded Hadoop clients, namelyhadoop-client-api
andhadoop-client-runtime
, because the former references some private API fromhadoop-common
which uses Guava and is shaded inhadoop-client-api
. Consequently, when talking to S3 Spark users may encounter the following error:This PR mitigate the issue by shading
hadoop-aws
within Spark itself, and replace the existinghadoop-aws
jar withhadoop-aws-shaded
jar. This, however, shall be treated as a temporary fix and should eventually be replaced by the shaded jar from Hadoop side, once HADOOP-15387 is resolved.Also note that, with this PR,
hadoop-aws-shaded
jar becomes a hard dependency, and shall be included even ifhadoop-provided
is specified by Spark users.Does this PR introduce any user-facing change?
Yes, the newly introduced
hadoop-aws-shaded
jar becomes a hard dependency, and shall be included even ifhadoop-provided
is specified by Spark users.How was this patch tested?
I manually checked the bytecode of the generated
hadoop-aws-shaded
jar and verified that all the Guava references are now relocated .