-
-
Notifications
You must be signed in to change notification settings - Fork 31.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bpo-31650: PEP 552 (Deterministic pycs) implementation #4575
Conversation
780bf48
to
cd28db3
Compare
Doc/library/py_compile.rst
Outdated
against the source at runtime to determine if the pyc needs to be | ||
regenerated. | ||
|
||
.. attribute:: UNCHECKED_HASH |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The text below merely says "a hash is included but not validated" but makes no mention of what is validated in this situation. the pyc-invalidation docs cover this in more detail. It seems worth repeating that this just means a hash is generated when generating a pyc but is not checked by default here.
It'd also be good to include a statement of when this might be useful (packaging distribution installed pre-generated pyc files where something else is guaranteeing they cannot go out of sync for example?). Though perhaps that description belongs in import.rst's pyc-invalidation section.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The text below merely says "a hash is included but not validated" but makes no mention of what is validated in this situation. the pyc-invalidation docs cover this in more detail. It seems worth repeating that this just means a hash is generated when generating a pyc but is not checked by default here.
Yeah, I was trying to funnel everything into import.rst
. I will beef up this part a big, though.
It'd also be good to include a statement of when this might be useful (packaging distribution installed pre-generated pyc files where something else is guaranteeing they cannot go out of sync for example?). Though perhaps that description belongs in import.rst's pyc-invalidation section.
I put some commentary about why you might want this in the what's new? entry. Maybe, it should go somewhere more "permanent", too, though.
@@ -266,23 +284,29 @@ def main(): | |||
if args.workers is not None: | |||
args.workers = args.workers or None | |||
|
|||
ivl_mode = args.invalidation_mode.replace('-', '_').upper() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since you use the names in choices=
above to generate the mode name, perhaps use a similar reverse transformation to generate a (sorted/stable-ordered) list of names in the choices tuple above?
choices=sorted(mode.lower().replace('_', '-') for mode in args.invalidation_mode),
or similar?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sgtm
def _validate_bytecode_header(data, source_stats=None, name=None, path=None): | ||
"""Validate the header of the passed-in bytecode against source_stats (if | ||
given) and returning the bytecode that can be compiled by compile(). | ||
def _classify_pyc(data, name, exc_details): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you document what the arguments are?
name
is not obvious - could it be given a better variable name? is it the pyc file's basename or the full pathname?
exc_details
is unobvious.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A similar comment applies to all other new function def's below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will do
Lib/importlib/_bootstrap_external.py
Outdated
return flags | ||
|
||
|
||
def _validate_timestamp_pyc(data, source_stats, name, exc_details): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems suspicious that this raises no error when the source_stats
dict-like object fails to contain mtime
or size
members.
If this is intentional, document why.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I'll clean this up.
file.""" | ||
|
||
def _code_to_timestamp_pyc(code, mtime=0, source_size=0): | ||
"Produce the data for a timestamp-based pyc." | ||
data = bytearray(MAGIC_NUMBER) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is unfortunate that the constant is called MAGIC_NUMBER
when in fact it is a bytes object containing the binary magic data. Can it be renamed as this is "just" _bootstrap_external or would that break some API somewhere?
bytearray(MAGIC_NUMBER)
looks suspicious if a reader naively assumes that MAGIC_NUMBER is an integer...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it turns out the integer version is actually called _RAW_MAGIC_NUMBER
, which a priori sounds more likely to be a bytearray. oops
At any rate, I think it's correct that this constant is exposed as some bytes not a number, since the only valid operation is to write it into a file or compare it to some bytes from a file.
Lib/importlib/_bootstrap_external.py
Outdated
source_stats=st, name=fullname, | ||
path=bytecode_path) | ||
flags = _classify_pyc(data, fullname, exc_details) | ||
bytes_data = data[16:] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider using memoryview(data)[16:]
to avoid a large copy assuming _compile_bytecode is happy with that (it appears to be, marshal.loads()
accepts memoryview
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sgtm
Lib/importlib/_bootstrap_external.py
Outdated
'path': path, | ||
} | ||
_classify_pyc(data, fullname, exc_details) | ||
return _compile_bytecode(data[16:], name=fullname, bytecode_path=path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
memoryview(data)[16:]
if you also do this in the other place i made this comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will do
@@ -5,18 +5,25 @@ | |||
from ._bootstrap import spec_from_loader | |||
from ._bootstrap import _find_spec | |||
from ._bootstrap_external import MAGIC_NUMBER |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a public API exposing MAGIC_NUMBER
so it's probably best to ignore my previous comment about changing the name of the thing to avoid confusion over its type.
Lib/modulefinder.py
Outdated
except ImportError as exc: | ||
self.msgout(2, "raise ImportError: " + str(exc), pathname) | ||
raise | ||
co = marshal.loads(marshal_data) | ||
co = marshal.loads(data[16:]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
memoryview(data)[16:]
Lib/py_compile.py
Outdated
bytecode = importlib._bootstrap_external._code_to_hash_pyc( | ||
code, | ||
source_hash, | ||
invalidation_mode == PycInvalidationMode.CHECKED_HASH, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A ==
test to get a boolean value in a list of args is easy to misread as a typo for a keyword argument being passed. Disambiguate by adding ()s around it or passing this by name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the comments! Update incoming.
cd28db3
to
737f925
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pretty much all of my comments are minor, so no need to have to double-check any changes made based on my feedback.
Doc/library/compileall.rst
Outdated
.. cmdoption:: --invalidation-mode [timestamp|checked-hash|unchecked-hash] | ||
|
||
Control how the generated pycs will be invalidated at runtime. The default | ||
setting, ``timestamp``, means that pyc files with the source timestamp and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just wanted to double-check that we are referring to the files as "pyc" and "pycs" throughout the rest of the docs instead of ".pyc
files" or something? Basically just checking we're being consistent.
Doc/library/importlib.rst
Outdated
@@ -1327,6 +1330,14 @@ an :term:`importer`. | |||
.. versionchanged:: 3.6 | |||
Accepts a :term:`path-like object`. | |||
|
|||
.. function:: source_hash(source_bytes) | |||
|
|||
Return the hash of *source_bytes* as byte string. A hash-based pyc embeds the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"as byte string" -> "as bytes".
Doc/reference/import.rst
Outdated
metadata. | ||
|
||
Python also supports "hash-based" cache files, which store a hash of a source | ||
file contents rather than its metadata. There are two variants of hash-based |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"source file contents" -> "source file's contents"
Doc/reference/import.rst
Outdated
hash in the cache file. If a checked hash-based cache file is found to be | ||
invalid, Python regenerates it and writes a new checked hash-based cache | ||
file. For unchecked hash-based pycs, Python simply assumes the cache file is | ||
valid if it exists. Hash-based pyc validation behavior may be override with the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"override" -> "overridden"
Lib/importlib/_bootstrap_external.py
Outdated
elif len(raw_timestamp) != 4: | ||
message = 'reached EOF while reading timestamp in {!r}'.format(name) | ||
if len(data) < 16: | ||
message = 'reached EOF while reading pyc header of {!r}'.format(name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might as well change this to an f-string.
Lib/importlib/_bootstrap_external.py
Outdated
flags = _r_long(data[4:8]) | ||
# Only the first two flags are defined. | ||
if flags & ~0b11: | ||
raise ImportError('invalid flags {!r} in {!r}'.format(flags, name), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
f-string
Lib/importlib/_bootstrap_external.py
Outdated
|
||
""" | ||
if _r_long(data[8:12]) != (source_mtime & 0xFFFFFFFF): | ||
message = 'bytecode is stale for {!r}'.format(name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
f-string
Lib/importlib/_bootstrap_external.py
Outdated
raise ImportError(message, **exc_details) | ||
if (source_size is not None and | ||
_r_long(data[12:16]) != (source_size & 0xFFFFFFFF)): | ||
raise ImportError('bytecode is stale for {!r}'.format(name), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
f-string
del_source=del_source) | ||
test('_temp', mapping, bc_path) | ||
|
||
def _test_no_marshal(self, *, del_source=False): | ||
with util.create_modules('_temp') as mapping: | ||
bc_path = self.manipulate_bytecode('_temp', mapping, | ||
lambda bc: bc[:12], | ||
lambda bc: bc[:16], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should have made these constants instead of using a literal when I originally wrote this code. Sorry about that. :(
Python/import.c
Outdated
@@ -2216,6 +2249,14 @@ PyInit_imp(void) | |||
d = PyModule_GetDict(m); | |||
if (d == NULL) | |||
goto failure; | |||
PyObject *pyc_mode = PyUnicode_FromString(_Py_CheckHashBasedPycsMode); | |||
if (pyc_mode == NULL) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing curly braces.
737f925
to
b3a7070
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some minor comments to address but overall i believe this is good to go in when you are ready.
thanks!
Modules/main.c
Outdated
@@ -98,6 +104,8 @@ static const char usage_3[] = "\ | |||
also PYTHONWARNINGS=arg\n\ | |||
-x : skip first line of source, allowing use of non-Unix forms of #!cmd\n\ | |||
-X opt : set implementation-specific option\n\ | |||
--check-hash-based-pycs always|default|never:\n\ | |||
control how Python invalidates hash-based .pcy files\n\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.pyc files
Modules/perfmap.c
Outdated
}; | ||
|
||
PyDoc_STRVAR(module_doc, | ||
"perfmap module."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where'd this file come from?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
eek, now you know what crud I have sitting about in my checkout.
if (flags != 0) { | ||
// Hash-based pyc. We currently refuse to handle checked hash-based | ||
// pycs. We could validate hash-based pycs against the source, but it | ||
// seems likely that most people putting hash-based pycs in a zipfile |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 agreed!
Python now supports checking bytecode cache up-to-dateness with a hash of the source contents rather than volatile source metadata. See the PEP for details. While a fairly straightforward idea, quite a lot of code had to be modified due to the pervasiveness of pyc implementation details in the codebase. Changes in this commit include: - The core changes to importlib to understand how to read, validate, and regenerate hash-based pycs. - Support for generating hash-based pycs in py_compile and compileall. - Modifications to our siphash implementation to support passing a custom key. We then expose it to importlib through _imp. - Updates to all places in the interpreter, standard library, and tests that manually generate or parse pyc files to grok the new format. - Support in the interpreter command line code for long options like --check-hash-based-pycs. - Tests and documentation for all of the above.
e5d54da
to
f498a91
Compare
Thanks again for the reviews, Brett and Greg! |
Python now supports checking bytecode cache up-to-dateness with a hash of the
source contents rather than volatile source metadata. See the PEP for details.
While a fairly straightforward idea, quite a lot of code had to be modified due
to the pervasiveness of pyc implementation details in the codebase. Changes in
this commit include:
The core changes to importlib to understand how to read, validate, and
regenerate hash-based pycs.
Support for generating hash-based pycs in py_compile and compileall.
Modifications to our siphash implementation to support passing a custom
key. We then expose it to importlib through _imp.
Updates to all places in the interpreter, standard library, and tests that
manually generate or parse pyc files to grok the new format.
Support in the interpreter command line code for long options like
--check-hash-based-pycs.
Tests and documentation for all of the above.
https://bugs.python.org/issue31650