Skip to content

[FLINK-31631][FileSystems] Upgrade GCS connector to 2.2.11. #22281

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Mar 31, 2023

Conversation

cnauroth
Copy link
Contributor

@cnauroth cnauroth commented Mar 27, 2023

What is the purpose of the change

Upgrade GCS connector to 2.2.11.

Brief change log

Upgrade the GCS Connector bundled in the Flink distro from version 2.2.3 to 2.2.11. The new release contains multiple bug fixes and enhancements discussed in the Release Notes. Notable changes include:

  • Improved socket timeout handling.
  • Trace logging capabilities.
  • Fix bug that prevented usage of GCS as a Hadoop Credential Provider.
  • Dependency upgrades.
  • Support OAuth2 based client authentication.

Verifying this change

This change is already covered by existing tests. These tests pass with the new version.

mvn -Pfast -pl flink-filesystems/flink-gs-fs-hadoop clean test

Additionally, I built a full distro and successfully ran a job that reads and writes to a GCS bucket.

mvn -Pfast clean package -DskipTests
mkdir ./plugins/gs-fs-hadoop
cp ./opt/flink-gs-fs-hadoop-1.18-SNAPSHOT.jar ./plugins/gs-fs-hadoop

bin/start-cluster.sh

bin/flink run examples/batch/WordCount.jar \
    --input gs://dataproc-datasets-us-central1/shakespeare \
    --output gs://cnauroth-hive-metastore-proxy-dist/output/shakespeare-word-count.txt

The documentation changes were tested by starting up the local server and clicking through the updated links.

cd docs
./build_docs.sh -p

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): yes
  • The public API, i.e., is any changed class annotated with @Public(Evolving): no
  • The serializers: no
  • The runtime per-record code paths (performance sensitive): no
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: no
  • The S3 file system connector: no

Documentation

  • Does this pull request introduce a new feature? no
  • If yes, how is the feature documented? not applicable

@cnauroth
Copy link
Contributor Author

I'd appreciate getting some guidance on anything more required in the licenses directory. According to the Licensing guide, there is supposed to be a license file per bundled dependency in the shaded jar, which would be ~70 files. However, the existing directory contains only 8 files.

@flinkbot
Copy link
Collaborator

flinkbot commented Mar 27, 2023

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

@MartijnVisser
Copy link
Contributor

According to the Licensing guide, there is supposed to be a license file per bundled dependency in the shaded jar

From a Flink POV, a bundle is considered the artifact that's being produced. In your case, you're only touching flink-gs-fs-hadoop.

@MartijnVisser
Copy link
Contributor

@cnauroth
Copy link
Contributor Author

Thank you, @MartijnVisser . I've updated the pull request with additional NOTICE entries, and I've removed the previous changes in the licenses sub-directory.

@MartijnVisser
Copy link
Contributor

  • Support OAuth2 based client authentication.

@cnauroth Do you think that should be included in the documentation?

Copy link
Contributor

@MartijnVisser MartijnVisser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall, just wondering if some of the new features/changes should also be documented. Leaving it open for the final comments before merging it

@MartijnVisser MartijnVisser self-assigned this Mar 28, 2023
@cnauroth
Copy link
Contributor Author

Thanks, @MartijnVisser !

It looks like the established documentation follows a pattern of mentioning how to configure the connector, but not listing all available configuration properties. Instead, it links to the connector's documentation for the full list of properties. I think that's a good strategy, because it avoids extra maintenance for Flink if anything changes about the connector configuration.

However, a potential improvement would be for the documentation to use version-specific links to the connector documentation. Right now, it links to the master branch. This could cause confusion, because the master branch might contain code with new configuration properties that haven't made it into an official release yet. I'd propose updating these 2 links to the specific 2.2.11 release (and the same for the content.zh files):

https://github.com/apache/flink/blob/master/docs/content/docs/deployment/filesystems/gcs.md?plain=1#L58
https://github.com/apache/flink/blob/master/docs/content/docs/deployment/filesystems/gcs.md?plain=1#L71

LMK if you agree, and I can update the PR.

@MartijnVisser
Copy link
Contributor

However, a potential improvement would be for the documentation to use version-specific links to the connector documentation. Right now, it links to the master branch.

That depends on the URL that you use. E.g. https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/filesystems/gcs/ points to master, but https://nightlies.apache.org/flink/flink-docs-release-1.17/docs/deployment/filesystems/gcs/ uses the docs from release-1.17.

So I think you can safely include the improvements to the gcs.md docs in this PR; they would end up in master (which shows also a warning at the top "This documentation is for an unreleased version of Apache Flink. We recommend you use the latest stable version." and when this is released, it would automatically end up in the right branch too.

@cnauroth
Copy link
Contributor Author

I pushed up the documentation changes. Thank you!

@MartijnVisser
Copy link
Contributor

@cnauroth Thanks for the PR, LGTM

@MartijnVisser MartijnVisser merged commit e3f9df6 into apache:master Mar 31, 2023
@cnauroth
Copy link
Contributor Author

@MartijnVisser , thanks for the review and commit!

@yigress
Copy link
Contributor

yigress commented Sep 20, 2023

there seems to be a conflict with guava with the upgraded gcs-sdk version.

2023-09-20 23:22:26,977 INFO  org.apache.flink.fs.gs.GSFileSystem                          [] - Creating GSRecoverableWriter with file-system options GSFileSystemOptions{writerTemporaryBucketName=Optional.empty, writerChunkSize=Optional.empty}
2023-09-20 23:22:26,994 ERROR org.apache.flink.util.FatalExitExceptionHandler              [] - FATAL: Thread 'jobmanager-io-thread-17' produced an uncaught exception. Stopping the process...
java.lang.NoSuchMethodError: 'com.google.common.collect.ImmutableMap com.google.common.collect.ImmutableMap$Builder.buildOrThrow()'
	at com.google.cloud.storage.UnifiedOpts$Opts.getRpcOptions(UnifiedOpts.java:2096) ~[?:?]
	at com.google.cloud.storage.StorageImpl.writer(StorageImpl.java:624) ~[?:?]
	at com.google.cloud.storage.StorageImpl.writer(StorageImpl.java:90) ~[?:?]
	at org.apache.flink.fs.gs.storage.GSBlobStorageImpl.writeBlob(GSBlobStorageImpl.java:64) ~[?:?]
	at org.apache.flink.fs.gs.writer.GSRecoverableFsDataOutputStream.createWriteChannel(GSRecoverableFsDataOutputStream.java:229) ~[?:?]
	at org.apache.flink.fs.gs.writer.GSRecoverableFsDataOutputStream.write(GSRecoverableFsDataOutputStream.java:152) ~[?:?]
	at org.apache.flink.fs.gs.writer.GSRecoverableFsDataOutputStream.write(GSRecoverableFsDataOutputStream.java:135) ~[?:?]
	at org.apache.flink.fs.gs.writer.GSRecoverableFsDataOutputStream.write(GSRecoverableFsDataOutputStream.java:128) ~[?:?]
	at org.apache.flink.runtime.state.filesystem.FsCheckpointMetadataOutputStream.write(FsCheckpointMetadataOutputStream.java:73) ~[flink-dist-1.19-SNAPSHOT.jar:1.19-SNAPSHOT]

@MartijnVisser
Copy link
Contributor

@cnauroth Any thoughts on this?

@cnauroth
Copy link
Contributor Author

@yigress , thank you for reporting this.

The stack trace appears to be in the checkpointing code path, which uses the GCS client directly (not the Hadoop FileSystem). Looking back at my testing notes in the pull request description, I think I tested reading and writing through the FileSystem, but not this direct client code path. Sorry about that.

The patch upgraded from google-cloud-storage 2.2.3 (transitive dependency to Guava 31.0.1-jre) to google-cloud-storage 2.15.0 (transitive dependency to Guava 31.1-jre). Both the GCS client and the FileSystem want Guava 31.1-jre at this point. The FileSystem was insulated though, because it shades Guava. The most likely answer now is for flink-gs-fs-hadoop to shade its Guava too. (I hope that's not going to bump into new problems with flink-fs-hadoop-shaded though.) I'll aim to test a new patch next week.

Long-term, it would be great to find a way to re-route this checkpointing code path through the FileSystem, so that we can keep the dependency management consistent throughout. I haven't explored what that would take yet.

CC: @MartijnVisser , @jayadeep-jayaraman

@jayadeep-jayaraman
Copy link
Contributor

jayadeep-jayaraman commented Sep 23, 2023

Hi @cnauroth - There are multiple code paths and API's in Flink which uses different google libraries and we should make them consistent.

For instance

  • Writes that use saveAsText, saveAsCSV etc go via the hadoop-gcs connector.
  • Checkpoints also go via the hadoop-gcs connector.
  • FileSink API which is used to write the final files to GCS and most probably is causing the issue mentioned above uses the gcs java storage library.

The ideal scenario will be that everything goes via the hadoop-gcs connector.

@cnauroth
Copy link
Contributor Author

Hello @singhravidutt . I heard that you might have a patch already in progress to fix this Guava dependency conflict. If so, would you please open a new JIRA issue and send the pull request? Thank you.

@cnauroth
Copy link
Contributor Author

cnauroth commented Oct 5, 2023

There is now a patch available here at #23469.

@yigress
Copy link
Contributor

yigress commented Oct 5, 2023

There is now a patch available here at #23469.

will it fix this issue?

@czchen
Copy link
Contributor

czchen commented Dec 11, 2023

@cnauroth

The java.lang.NoSuchMethodError issue remains in 1.18.0. Any change it can be fixed in 1.18 series?

@cnauroth
Copy link
Contributor Author

There is now a patch available here at #23469.

will it fix this issue?

@yigress and @czchen , my apologies, I think I referenced the wrong pull request in my last comment. I think #23489 is the one that's relevant. That one is still open.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants