-
Notifications
You must be signed in to change notification settings - Fork 4.5k
Description
What happened?
If an object of dynamic class is serialized and deserialized multiple times through cloudpickle's dump and load, the subsequent deserialization overwrite the class states of the class for the deserialized objects.
This behavior was confirmed with pickle_dump.py and pickle_load.py. The dump script serializes an object of a dynamic DoFn class defined within a function. The load script then deserializes these bytes twice, and prints the original function information of the method on_window_end_timer for the first object.
The result shows the methods in the class was changed.
I believe this is because cloudpickle reuses the class at _lookup_class_or_track for the same class tracker id, however _class_setstate always updates the class states although the class is a reused one. Note that the _class_setstate returned as 6th tuple item of reducer_override, which is called at pickle loading instead of __setstate__. See reducer_override and __reduce__
IIUC, This issue can cause unexpected KeyError or ValueError with TimerSpec in Dataflow Python jobs with Apache Beam 2.65.0.
Issue Priority
Priority: 2 (default / most bugs should be filed as P2)
Issue Components
- Component: Python SDK
- Component: Java SDK
- Component: Go SDK
- Component: Typescript SDK
- Component: IO connector
- Component: Beam YAML
- Component: Beam examples
- Component: Beam playground
- Component: Beam katas
- Component: Website
- Component: Infrastructure
- Component: Spark Runner
- Component: Flink Runner
- Component: Samza Runner
- Component: Twister2 Runner
- Component: Hazelcast Jet Runner
- Component: Google Cloud Dataflow Runner