Skip to content

[ FLINK-33603][FileSystems] shade guava in gs-fs filesystem #23489

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 9 commits into from

Conversation

singhravidutt
Copy link
Contributor

What is the purpose of the change

Shade the guava version required specifically by google-cloud-storage.

Brief change log

Introduced dependency on guava version(31.1-jre) required by google-cloud-storage. Upgrade of google-cloud-storage lead to runtime failure because of new functionalities added in 31.1-jre.

This change pins guava version to the one required by storage client specifically in flink-gs-fs-hadoop, leaving all other filesystem implementation untouched.

Verifying this change

This change is already covered by existing tests. These tests pass with the new version.

mvn -Pfast -pl flink-filesystems/flink-gs-fs-hadoop clean test

Created a cluster by replacing binaries specific to flink-gs-fs-hadoop and ran the following job

gcloud dataproc jobs submit flink --cluster <cluster-name> --region us-central1 --jar file:///usr/lib/flink/examples/streaming/WordCount.jar -- --input gs://vinaylondhe-public/word_count/word_count.txt --output gs://vinaylondhe-public/rdsingh_wc_output

Earlier it was failing with following error

Caused by: java.lang.NoSuchMethodError: 'com.google.common.collect.ImmutableMap com.google.common.collect.ImmutableMap$Builder.buildOrThrow()'

Execution for the same job successful now with these changes.

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): yes
  • The public API, i.e., is any changed class annotated with @Public(Evolving): no
  • The serializers: no
  • The runtime per-record code paths (performance sensitive): no
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: no
  • The S3 file system connector: no

Documentation

  • Does this pull request introduce a new feature? no
  • If yes, how is the feature documented? not applicable

@flinkbot
Copy link
Collaborator

flinkbot commented Oct 5, 2023

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

Copy link
Contributor

@cnauroth cnauroth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @singhravidutt . Thank you for the patch. This looks right to me, but I see there was a build failure in CI. Can you please investigate?

@singhravidutt
Copy link
Contributor Author

@flinkbot run azure

Copy link
Contributor

@cnauroth cnauroth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 (non-binding)

Thanks for the fix, @singhravidutt . I'm not sure if flinkbot really gave it another run though?

@afedulov
Copy link
Contributor

Copy link
Contributor

@afedulov afedulov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@singhravidutt I see that the ticket you reference in the title has already been closed.
https://issues.apache.org/jira/browse/FLINK-31631
Please create a new issue in Flink JIRA describing the reason for the current change.

@singhravidutt
Copy link
Contributor Author

@flinkbot run azure

@sofam
Copy link

sofam commented Nov 10, 2023

Any progress on this? We need it for upgrading our jobs to 1.18 :)

@jayadeep-jayaraman
Copy link
Contributor

Created FLINK-33603 for this issue. @singhravidutt - can you please update the JIRA request number

@singhravidutt singhravidutt changed the title [FLINK-31631][FileSystems] shade guava in gs-fs filesystem [ FLINK-33603][FileSystems] shade guava in gs-fs filesystem Dec 5, 2023
@singhravidutt
Copy link
Contributor Author

@flinkbot run azure

Copy link
Contributor

@MartijnVisser MartijnVisser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR is failing due to licensing checks (missing entries in the correct NOTICE file). Let me see if I can help you out

Comment on lines 215 to 237
<filters>
<filter>
<artifact>org.apache.flink:flink-fs-hadoop-shaded</artifact>
<excludes>
<exclude>com/google/common/**</exclude>
<exclude>org/checkerframework/**</exclude>
<exclude>com/google/errorprone/**</exclude>
<exclude>com/google/j2objc/**</exclude>
<exclude>com/google/thirdparty/publicsuffix/**</exclude>
</excludes>
</filter>
<filter>
<artifact>*</artifact>
<excludes>
<exclude>properties.dtd</exclude>
<exclude>PropertyList-1.0.dtd</exclude>
<exclude>mozilla/**</exclude>
<exclude>META-INF/maven/**</exclude>
<exclude>META-INF/NOTICE.txt</exclude>
<exclude>META-INF/LICENSE.txt</exclude>
</excludes>
</filter>
</filters>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't exclude these dependencies during shading, but exclude them from when we're adding the dependency.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

flink-fs-hadoop-shaded is a shaded jar which doesn't relocate the classes mention in the exclusion rules here. The only way I see we can exclude them is by mentioning them explicitly here.

<!-- shade google guava version-->
<relocation>
<pattern>com.google.common</pattern>
<shadedPattern>org.apache.flink.fs.gs.com.google.common</shadedPattern>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This wouldn't work, right? Because I'm assuming GCS is looking for com.google.common and not org.apache.flink.fs ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With shading it would rewrite the import:com.google.common in classes of module:flink-gs-fs-hadoop.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes but the code from GCS still expects the import to be com.google.common because it doesn't know about the shaded class.

@MartijnVisser
Copy link
Contributor

Introduced dependency on guava version(31.1-jre) required by google-cloud-storage.

I'm still kind of puzzled. If this is required by google-cloud-storage, shouldn't it have been bundled as a dependency? Why should Flink have to do that?

Copy link
Contributor

@MartijnVisser MartijnVisser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@singhravidutt I think the key part lies here:

https://github.com/apache/flink/pull/23489/files#diff-e4e85c9890e4dacce954ed7e15169017bb15c3e697ae9c255a33843d12fb7bd2R85-R88

This is where Guava gets excluded; that doesn't work for GCS. I think that's the fix for this problem :)
If I remove this exclusion and run a mvn dependency:tree, I see

[INFO] +- com.google.cloud:google-cloud-storage:jar:2.29.1:compile
[INFO] |  +- com.google.guava:guava:jar:32.1.2-jre:compile

The only other reference to Guava for this module is:

[INFO] +- org.apache.flink:flink-core:jar:1.19-SNAPSHOT:provided
[INFO] |  \- org.apache.flink:flink-shaded-guava:jar:31.1-jre-17.0:provided

Which I don't think should be a problem (it's

@MartijnVisser
Copy link
Contributor

@singhravidutt I've created 2ebe7f6 to test it out on private CI. Let me know what you think

@MartijnVisser
Copy link
Contributor

@singhravidutt Can I get your thoughts on #23920 as an alternative solution?

<filters>
<filter>
<artifact>org.apache.flink:flink-fs-hadoop-shaded</artifact>
<excludes>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, we don't need to exclude shaded guava classes since it will not disturb if a right guava version dependency is in the classpath.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[WARNING] flink-fs-hadoop-shaded-1.19-SNAPSHOT.jar, guava-32.1.2-jre.jar define 1837 overlapping classes: 
[WARNING]   - com.google.common.annotations.Beta
[WARNING]   - com.google.common.annotations.GwtCompatible
[WARNING]   - com.google.common.annotations.GwtIncompatible
[WARNING]   - com.google.common.annotations.VisibleForTesting
[WARNING]   - com.google.common.base.Absent
[WARNING]   - com.google.common.base.AbstractIterator
[WARNING]   - com.google.common.base.AbstractIterator$1
[WARNING]   - com.google.common.base.AbstractIterator$State
[WARNING]   - com.google.common.base.Ascii
[WARNING]   - com.google.common.base.CaseFormat
[WARNING]   - 1827 more...

I see this while building the package. My interpretation if it is that because flink-fs-hadoop-shaded is shaded jar AND it's not relocating guava classes. Shaded jar contains classes of guava. Hence just excluding guava as transitive dependency from module:flink-fs-hadoop-shaded is not enough.

flink-gs-fs-hadoop will contain two implementation of guava classes i.e. com.google.common.* one coming from flink-fs-hadoop-shaded which will be from guava version v27.1 and other from guava v32.1.2. As fun:buildOrThrow is not available in with v27. is causes runtime failure.

Hence we have to either repackage every dependency of flink-fs-hadoop-shaded and then add as a dependency or exclude the jars manually.

What are your thoughts on that?

@JingGe
Copy link
Contributor

JingGe commented Dec 13, 2023

Hi folks, thanks for driving it.

@MartijnVisser I am not sure if your version will work, since there is still a guava 27.0-jre in the classpath.

@singhravidutt: #23920 from @MartijnVisser actually shows the right direction. We don't need to exclude guava in google-cloud-storage and then build an explicit guava dependency in this pom. Afaik, your version should also work but add extra effort of building guava dependency instead of leverage the dependency google-cloud-storage already has, i.e. if the next version of google-cloud-storage needs a new guava version, we have to change the guava dependency accordingly again and only a few people were aware of it.

I would suggest a solution combine both of your thoughts plus a small suggestion from me:

  1. exclude guava in flink-fs-hadoop-shaded , thought from @singhravidutt
  2. remove the new guava dependency created in this pom, thought from @MartijnVisser, 1+2 -> make sure there is no guava in the classpath
  3. remove the exclude guava in google-cloud-storage, i.e. enable the gcs built-in guava dependency -> make sure google-cloud-storage has the right guava version in the classpath, thought from @MartijnVisser
  4. remove the exclude guava in org.apache.flink:flink-fs-hadoop-shaded as I commented above.

Please correct me if I am wrong. Look forward to your thoughts.

@singhravidutt
Copy link
Contributor Author

@flinkbot run azure

@singhravidutt
Copy link
Contributor Author

@flinkbot run azure

@MartijnVisser
Copy link
Contributor

@JingGe Thanks for the comments. I've been looking at different options to shade the artifacts, and also try that with a local application that writes to GCS. Unfortunately, I haven't gotten that to work yet :(

@MartijnVisser
Copy link
Contributor

@JingGe @singhravidutt I think I've cracked it :)

  1. I've setup a GCS bucket and created a checkpointing test job in order to reproduce the error that's happening in Flink 1.18.0. That has worked:
2023-12-15 11:42:18,119 ERROR org.apache.flink.util.FatalExitExceptionHandler              [] - FATAL: Thread 'jobmanager-io-thread-5' produced an uncaught exception. Stopping the process...
java.lang.NoSuchMethodError: 'com.google.common.collect.ImmutableMap com.google.common.collect.ImmutableMap$Builder.buildOrThrow()'
  1. I've tested my PR (which is based on 1.19-SNAPSHOT), with that checkpointing works once more:
2023-12-15 11:32:01,039 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Triggering checkpoint 1 (type=CheckpointType{name='Checkpoint', sharingFilesStrategy=FORWARD_BACKWARD}) @ 1702636321029 for job d9bd270aefde98f6b9aec07f39e6e08b.
2023-12-15 11:32:01,931 INFO  org.apache.flink.fs.gs.GSFileSystem                          [] - Creating GSRecoverableWriter with file-system options GSFileSystemOptions{writerTemporaryBucketName=Optional.empty, writerChunkSize=Optional.empty}
2023-12-15 11:32:03,557 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Completed checkpoint 1 for job d9bd270aefde98f6b9aec07f39e6e08b (4201 bytes, checkpointDuration=2526 ms, finalizationTime=2 ms).
2023-12-15 11:32:06,031 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Triggering checkpoint 2 (type=CheckpointType{name='Checkpoint', sharingFilesStrategy=FORWARD_BACKWARD}) @ 1702636326028 for job d9bd270aefde98f6b9aec07f39e6e08b.
2023-12-15 11:32:06,899 INFO  org.apache.flink.fs.gs.GSFileSystem                          [] - Creating GSRecoverableWriter with file-system options GSFileSystemOptions{writerTemporaryBucketName=Optional.empty, writerChunkSize=Optional.empty}
2023-12-15 11:32:08,287 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Completed checkpoint 2 for job d9bd270aefde98f6b9aec07f39e6e08b (4201 bytes, checkpointDuration=2246 ms, finalizationTime=12 ms).

I'm currently testing if my patch would also work when backported to 1.18.

@MartijnVisser
Copy link
Contributor

@snuyanzin I'm also looping you in here

I'm currently testing if my patch would also work when backported to 1.18.

Newsflash: I didn't work 😭 Which in hindsight, is quite obvious: master has 492a886 merged, which influences the dependencies.

Let me do some more tests and I'll get back here

@MartijnVisser
Copy link
Contributor

MartijnVisser commented Dec 15, 2023

There are two possibilities to solve this:

  1. Backport https://issues.apache.org/jira/browse/FLINK-33704 to release-1.18 - This resolves the issue for 1.18. I've opened a PR for that at [FLINK-33704][BP 1.18][Filesytems] Update GCS filesystems to latest available versions #23935
  2. Merge the commit in my updated PR 7c60900 that excludes Guava from flink-fs-hadoop-shaded and includes it, together with the needed com.google.guava:failureaccess for flink-gs-fs-hadoop.

Both of these approaches make checkpointing work again, I've validated that locally. In master (so the upcoming Flink 1.19.0 release) this problem does not appear.

@JingGe @singhravidutt @snuyanzin Any preference? I'm leaning towards 1, since that would be the same solution for both 1.18 and 1.19, with option 2 being a specific fix for 1.18 only.

@singhravidutt
Copy link
Contributor Author

Licensing issue is fixed in the latest commit but I still feel we should exclude guava specific jars coming from flink-fs-hadoop-shaded. @JingGe @MartijnVisser

#23489 (comment)

@MartijnVisser
Copy link
Contributor

Superseded by https://issues.apache.org/jira/browse/FLINK-33704 - Thanks for the help/thinking on this @singhravidutt

@singhravidutt @cnauroth Do you have any bandwidth to add a GCS connector test to Flink?

@cnauroth
Copy link
Contributor

cnauroth commented Feb 3, 2025

Superseded by https://issues.apache.org/jira/browse/FLINK-33704 - Thanks for the help/thinking on this @singhravidutt

@singhravidutt @cnauroth Do you have any bandwidth to add a GCS connector test to Flink?

Hello @MartijnVisser , it's been a while, but I just sent in some new integration tests at #26102 .

CC: @singhravidutt , @jayadeep-jayaraman

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants