Skip to content

Pickles

Jim Fulton edited this page Mar 19, 2013 · 12 revisions

Pickle interoperability between Python 2 and Python 3

It's useful to be able to support accessing databases from both Python 2 and Python 3 because:

  • You may have multiple applications accessing a database, or multiple installations of the same applications that are moved to Python 3 at different times. Supporting both Python versions will make transition much easier.
  • Eventually, ZODB may support other languages, especially Javascript. It would be a shame if we could support Javascript but not Python 2.
  • Some ZODB users have massive databases which cannot be easily (or even realistically) go through a migration process before moving to Python 3

Issues

  1. Python 3 uses different pickling codes than Python 2. In particular, the Python 2 bytes (STRING, BINSTRING, and SHORTBINSTRING) are DWIMilly interpreted as text in some encoding. Python 3 saves bytes with a Python 3-specific bytecode (BYTES and BINBYTES).
  2. Names (attribute, and global) in Python 3 are unicode in Python 3 but bytes in Python 2.

Proposals

Python2 pickle with name conversion

Read and store byte data using Python 2 byte codes using a forked version of pickle, zodbpickle. Fix up names when necessary in Python 3.

  • When finding globals or setting instance state, convert byte names to unicode using an ascii encoding.

    We can only fix up attribute names when no custom set state is used. So this is only a partial solution. Applications with custom __setstate__ methods may not be interoperable accross Python versions or may need to be modified.

  • Note that Python 2 attributes can be stored as unicode. (They can only be accessed with attribute notation if they're ASCII.)

Issues:

  • Cookie.Morsel is a dict subclass that has unicode keys on Python 3, but byte keys in Python 2. Reading a Python 2 Morsel pickle in Python 3 requires the byte->unicode DWIM.

    This is a case we can't handle.

  • People often use "names" (aka "native strings") for dictionary keys. If we want interoperability between Python 2 and Python 3, then it would be bad if in in Python 2, a user did:

    >>> foo['bar'] = 1
    

    and then in Python 3:

    >>> foo['bar']
    

    raised a KeyError. After discussing this a bit, we think this may be fatal.

explicit binary for Python 2

In this option, we create an explicit binary type for Python 2, probably as a subclass of str. We'll define the type in Python 3 as an aliase for bytes.

Application authors will need to analyze their applications and replace true binary strings with instances of this new type. (This includes object ids and tids.)

In Python 2, we fork pickle and cPickle and add support for protocol 3 such that the new pickle byte codes for bytes are used for the new Python 2 binary type.

In Python 3, we'll just use bytes.

Pros:

  • We don't need for fork the Python 3 pickle code. We'll still need a noload() implementation, but not for strings issues. We can probably do this via subclassing, which should be much less of a maintenance burden.
  • We don't need to worry about trying to spot names used for attribute names and dictionary keys.
  • We'll be explicit about what's binary and what's unicode.

Cons:

  • We fork Python 2's pickle and cPickle, but these are unlikely to change, so there's far less risk than forking the Python 3 versions.
  • Application developers need to use the new binary type. Some developer action will be required no matter what we do.
Clone this wiki locally