Description
In Python 3.11 and 3.12, the hash function for a PyCodeObject
(code_hash()
) no longer hashes the bytecode. I assume this is because the specializing adaptive interpreter can change the bytecode however it likes, which means the bytecode is no longer immutable, which means you can't rely on it not changing, which means you can't use it as part of calculating the hash. Fair enough.
But this means that the hash of a code object no longer factors in the most unique part of a code object. Currently in 3.11 and 3.12 the hash of a code object is calculated using:
- the unqualified name of the callable
- the code object's const table (a tuple of immutable objects)
- the code object's tuple of externally-referenced names (globals, nonlocals)
- the code object's tuple of locally-referenced names (parameters, local variables, and closures)
- the total number of arguments
- the count of positional-only arguments
- the count of keyword-only arguments
- the "flags" for this code object
Which means it's not hard to construct code objects with identical hashes but different bytecode. For example:
class A:
def method(self, a, b):
return a + b
class B:
def method(self, a, b):
return a * b
class C:
def method(self, a, b):
return a / b
for cls in (A, B, C):
o = cls()
print(o.method(3, 5))
print(hex(hash(cls.method.__code__)))
The hashes for for A.method.__code__
, B.method.__code__
, and C.method.__code__
are different in Python 3.10, but identical in Python 3.11b3, and presumably in trunk as well.
Is this scenario realistic? I don't know. Certainly I've never seen it. But it's at least plausible; a base class could have a tiny function that only does a little work, and a subclass could override that function and tweak the work done slightly, without relying on additional external names / closures or employing a different list of locals. It's certainly not impossible.
Obviously this is low priority--there's very little code that hashes code objects. It might cause some collisions and probing when marshal dumps a module exhibiting this behavior, because marshal maintains a hash table of the objects it writes. That's the worst side-effect I can think up.
We could mitigate this slightly by factoring in more values into the code object's hash. Three come immediately to mind:
- the filename (
co_filename
) - the first line number (
co_firstlineno
) - the first column number (which I'm guessing is encoded in
co_positions
somehow)
With only the first two, you'd still have hash collisions from code objects defined using lambdas all on the same line:
a, b, c = (lambda a, b: a + b), (lambda a, b: a * b), (lambda a, b: a / b)
for l in (a, b, c):
print(l(3, 5))
print(hex(hash(l.__code__)))
As unlikely as it is that someone would stumble over this scenario, it's even less likely that it would cause a problem. But decreasing the likelihood of hash value collisions seems so wholesome and good, I think it's worth pursuing.