Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core | Spark | Integrations] : Fix kryo serialization failure for FileIO #5437

Merged
merged 8 commits into from
Sep 1, 2022

Conversation

singhpk234
Copy link
Contributor

@singhpk234 singhpk234 commented Aug 4, 2022

About the change

presently spark queries fails when using S3fileIO & GlueCatalog when being used with KryoSerializer ref #5414 (comment). This happens because Immutable map part of S3FileIO properties is not serializable with Kryo and seralizers for it not available in Twitter Chill lib as well (which also spark uses). This PR attemps to fix this by using java collections instead which spark will be able to ser/de as it's serializer is present in Chill lib.

Solves #5414

Testing Done

Manual Test

UT to validate the change fix the issue, without the fix the UT fails with exception mentioned in ticket #5414 (comment)

@github-actions github-actions bot added the AWS label Aug 4, 2022
@singhpk234 singhpk234 marked this pull request as draft August 4, 2022 13:12
@singhpk234 singhpk234 force-pushed the fix/kryo-serialization branch 2 times, most recently from b7cc12c to 3532aa3 Compare August 4, 2022 13:22
@singhpk234 singhpk234 marked this pull request as ready for review August 4, 2022 14:44
@singhpk234 singhpk234 changed the title AWS: Fix kryo serialization for S3 FileIO AWS: Fix kryo serialization failure for S3 FileIO Aug 4, 2022
Copy link
Contributor

@kbendick kbendick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM.

You might want to put a small comment in the code above the creation of tableSpecificCatalogProperties and mention that ImmutableMap isn't used to support Kryo serde as that's the normal style.

Totally up to you though.

@github-actions github-actions bot added the build label Aug 7, 2022
@singhpk234 singhpk234 force-pushed the fix/kryo-serialization branch 2 times, most recently from e058629 to 95bc117 Compare August 7, 2022 11:57
Copy link
Contributor

@rdblue rdblue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is the right way to fix the problem.

The approach in this PR is to avoid passing ImmutableMap into initializeFileIO, but that makes S3FileIO vulnerable to this problem any time the caller doesn't know not to pass that type of map. Instead, S3FileIO should be responsible for making a copy of the incoming map that can be serialized with Kryo.

@rdblue rdblue added this to the Iceberg 0.14.1 Release milestone Aug 7, 2022
@rdblue
Copy link
Contributor

rdblue commented Aug 7, 2022

This looks like the right fix to me. Thanks, @singhpk234! Could you also add tests for the other FileIO implementations?

@github-actions github-actions bot added the spark label Aug 8, 2022
@singhpk234 singhpk234 force-pushed the fix/kryo-serialization branch 3 times, most recently from 8f40d83 to 159e84a Compare August 8, 2022 10:49
Copy link
Contributor

@nastra nastra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but I would define the version number for com.twitter:chill-java:0.10.0 in versions.propsand have build.gradle only use the actual artifact (without the version): testImplementation 'com.twitter:chill-java

@singhpk234 singhpk234 force-pushed the fix/kryo-serialization branch from 1f3d860 to 3aef768 Compare September 1, 2022 08:07
@github-actions github-actions bot added the API label Sep 1, 2022
@singhpk234 singhpk234 force-pushed the fix/kryo-serialization branch from 3aef768 to 6118aa8 Compare September 1, 2022 08:36
@singhpk234 singhpk234 force-pushed the fix/kryo-serialization branch from 6118aa8 to 58e2082 Compare September 1, 2022 08:39
@singhpk234 singhpk234 changed the title AWS: Fix kryo serialization failure for S3 FileIO [Core | Spark | Integrations] : Fix kryo serialization failure for FileIO Sep 1, 2022
@rdblue rdblue merged commit ec9643e into apache:master Sep 1, 2022
@rdblue
Copy link
Contributor

rdblue commented Sep 1, 2022

Thanks, @singhpk234!

@singhpk234
Copy link
Contributor Author

Thanks @rdblue @kbendick @nastra @amogh-jahagirdar for your awesome reviews :) !!!

@singhpk234 singhpk234 deleted the fix/kryo-serialization branch September 2, 2022 05:24
rdblue pushed a commit to rdblue/iceberg that referenced this pull request Sep 2, 2022
rdblue pushed a commit to rdblue/iceberg that referenced this pull request Sep 2, 2022
rdblue pushed a commit to rdblue/iceberg that referenced this pull request Sep 2, 2022
rdblue pushed a commit to rdblue/iceberg that referenced this pull request Sep 2, 2022
rdblue pushed a commit to rdblue/iceberg that referenced this pull request Sep 2, 2022
@akshay-kokate-06
Copy link

@singhpk234 I am still facing this issue using iceberg-spark-runtime-3.3_2.12:1.2.0 and software.amazon.awssdk:bundle:2.20.18

@nastra
Copy link
Contributor

nastra commented Apr 14, 2023

@akshay-kokate-06 what issue are you seeing exactly? Could you provide the stack trace please?

sunchao pushed a commit to sunchao/iceberg that referenced this pull request May 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants