Skip to content

Specializing adaptive interpreter code object hashes are less unique #94155

Open
@larryhastings

Description

@larryhastings

In Python 3.11 and 3.12, the hash function for a PyCodeObject (code_hash()) no longer hashes the bytecode. I assume this is because the specializing adaptive interpreter can change the bytecode however it likes, which means the bytecode is no longer immutable, which means you can't rely on it not changing, which means you can't use it as part of calculating the hash. Fair enough.

But this means that the hash of a code object no longer factors in the most unique part of a code object. Currently in 3.11 and 3.12 the hash of a code object is calculated using:

  • the unqualified name of the callable
  • the code object's const table (a tuple of immutable objects)
  • the code object's tuple of externally-referenced names (globals, nonlocals)
  • the code object's tuple of locally-referenced names (parameters, local variables, and closures)
  • the total number of arguments
  • the count of positional-only arguments
  • the count of keyword-only arguments
  • the "flags" for this code object

Which means it's not hard to construct code objects with identical hashes but different bytecode. For example:

class A:
    def method(self, a, b):
        return a + b

class B:
    def method(self, a, b):
        return a * b

class C:
    def method(self, a, b):
        return a / b


for cls in (A, B, C):
    o = cls()
    print(o.method(3, 5))
    print(hex(hash(cls.method.__code__)))

The hashes for for A.method.__code__, B.method.__code__, and C.method.__code__ are different in Python 3.10, but identical in Python 3.11b3, and presumably in trunk as well.

Is this scenario realistic? I don't know. Certainly I've never seen it. But it's at least plausible; a base class could have a tiny function that only does a little work, and a subclass could override that function and tweak the work done slightly, without relying on additional external names / closures or employing a different list of locals. It's certainly not impossible.

Obviously this is low priority--there's very little code that hashes code objects. It might cause some collisions and probing when marshal dumps a module exhibiting this behavior, because marshal maintains a hash table of the objects it writes. That's the worst side-effect I can think up.

We could mitigate this slightly by factoring in more values into the code object's hash. Three come immediately to mind:

  • the filename (co_filename)
  • the first line number (co_firstlineno)
  • the first column number (which I'm guessing is encoded in co_positions somehow)

With only the first two, you'd still have hash collisions from code objects defined using lambdas all on the same line:

a, b, c = (lambda a, b: a + b), (lambda a, b: a * b), (lambda a, b: a / b)

for l in (a, b, c):
    print(l(3, 5))
    print(hex(hash(l.__code__)))

As unlikely as it is that someone would stumble over this scenario, it's even less likely that it would cause a problem. But decreasing the likelihood of hash value collisions seems so wholesome and good, I think it's worth pursuing.

Linked PRs

Metadata

Metadata

Assignees

No one assigned

    Labels

    3.11only security fixes3.12bugs and security fixesinterpreter-core(Objects, Python, Grammar, and Parser dirs)type-bugAn unexpected behavior, bug, or error

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions