Skip to content

[Bug]: cloudpickle overwrites class states every time loading a same object of dynamic class #35062

@baeminbo

Description

@baeminbo

What happened?

If an object of dynamic class is serialized and deserialized multiple times through cloudpickle's dump and load, the subsequent deserialization overwrite the class states of the class for the deserialized objects.

This behavior was confirmed with pickle_dump.py and pickle_load.py. The dump script serializes an object of a dynamic DoFn class defined within a function. The load script then deserializes these bytes twice, and prints the original function information of the method on_window_end_timer for the first object.

The result shows the methods in the class was changed.

I believe this is because cloudpickle reuses the class at _lookup_class_or_track for the same class tracker id, however _class_setstate always updates the class states although the class is a reused one. Note that the _class_setstate returned as 6th tuple item of reducer_override, which is called at pickle loading instead of __setstate__. See reducer_override and __reduce__

IIUC, This issue can cause unexpected KeyError or ValueError with TimerSpec in Dataflow Python jobs with Apache Beam 2.65.0.

Issue Priority

Priority: 2 (default / most bugs should be filed as P2)

Issue Components

  • Component: Python SDK
  • Component: Java SDK
  • Component: Go SDK
  • Component: Typescript SDK
  • Component: IO connector
  • Component: Beam YAML
  • Component: Beam examples
  • Component: Beam playground
  • Component: Beam katas
  • Component: Website
  • Component: Infrastructure
  • Component: Spark Runner
  • Component: Flink Runner
  • Component: Samza Runner
  • Component: Twister2 Runner
  • Component: Hazelcast Jet Runner
  • Component: Google Cloud Dataflow Runner

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions