-
Notifications
You must be signed in to change notification settings - Fork 4.5k
fix: use DoFnSignature as cache key to prevent generic type collisions #37352
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: use DoFnSignature as cache key to prevent generic type collisions #37352
Conversation
Summary of ChangesHello @GlobalStar117, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request resolves a critical cache collision issue within the Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
bd95d52 to
0820f45
Compare
|
@GlobalStar117 Thanks for the detailed analysis! You are absolutely correct about Java type erasure: new MyDoFn() and new MyDoFn() indeed share the same runtime Class object. However, in Apache Beam, a DoFn's behavior isn't defined solely by its Class. We rely heavily on TypeDescriptor to handle serialization (Coders) and schema verification. Why this fix is necessary: Even if the raw class is the same, users can override getInputTypeDescriptor() (or use mechanisms that capture types) to provide different type information for the same raw DoFn class. The Evidence: My regression test (testCacheKeyCollisionProof) explicitly creates two instances of the same DoFn class but forces them to return different TypeDescriptors. Without this fix, the factory returns the same cached Invoker for both. The Consequence: If the first Invoker is generated/cached with logic specific to String, and then reused for an Integer context (because the cache key ignored the TypeDescriptor), it leads to runtime issues (like incorrect validation or potential ClassCastException in downstream logic that relies on the Invoker's signature). So, while type erasure applies to the user's class, the generated Invoker needs to be aware of the specific TypeDescriptor context to function correctly within the Beam pipeline. This PR ensures the cache respects that distinction. |
|
@Eliaaazzz Thank you for the detailed clarification! You're absolutely right. I initially implemented the fix using Your explanation makes it clear that the issue is real:
I'll re-implement the original fix using Let me push the corrected fix. |
The ByteBuddyDoFnInvokerFactory was using Class<?> as the cache key for generated invokers. While this seemed correct due to Java type erasure (MyDoFn<String> and MyDoFn<Integer> share the same Class), it fails to account for Apache Beam's TypeDescriptor mechanism. The problem: - Users can override getInputTypeDescriptor() to provide different type information for the same raw DoFn class - The generated invoker needs to respect this TypeDescriptor for correct serialization (Coders) and schema verification - Using Class<?> as the key caused DoFns with different TypeDescriptors but the same raw class to incorrectly share invokers The fix: - Changed the cache key from Class<?> to DoFnSignature - DoFnSignature properly captures the TypeDescriptor information - This ensures each unique type configuration gets its own correctly generated invoker Relates to: #37351 [Gittensor contribution - GlobalStar117]
0820f45 to
53e0245
Compare
|
@Eliaaazzz I've updated the PR with the proper fix: ✅ Changed cache key from This ensures that:
The previous implementation documented the Thank you for the clarification - it helped me understand the deeper issue! |
|
Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment |
|
The failing tests (ReadWriteIT.testReadWrite and ReadWriteIT.testPubsubLiteWriteReadWithSchemaTransform) are PubSub Lite integration tests unrelated to this PR's changes. These appear to be flaky infrastructure tests. Could a maintainer please re-run or approve? |
Summary
This PR fixes a cache collision issue in
ByteBuddyDoFnInvokerFactorywhere DoFns with the same raw class but differentTypeDescriptors were incorrectly sharing invokers.Fixes: #37351
Problem
The factory was using
Class<?>as the cache key for generated invokers. While this seemed correct due to Java type erasure (e.g.,MyDoFn<String>andMyDoFn<Integer>share the sameClassobject), it fails to account for Apache Beam'sTypeDescriptormechanism:getInputTypeDescriptor()to provide different type information for the same raw DoFn classTypeDescriptorfor correct serialization (Coders) and schema verificationClass<?>as the key caused DoFns with differentTypeDescriptors but the same raw class to incorrectly share invokersSolution
Changed the cache key from
Class<?>toDoFnSignature:DoFnSignatureproperly captures theTypeDescriptorinformationChanges
Cache Key Change: Changed
byteBuddyInvokerConstructorCachefromMap<Class<?>, Constructor<?>>toMap<DoFnSignature, Constructor<?>>Updated
getByteBuddyInvokerConstructor: Now usessignatureas the cache key instead offnClassTest Case: Added
testGenericDoFnCacheKeyingWithAnonymousSubclassesto verify correct behavior[Gittensor contribution - GlobalStar117]