Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bpo-31650: PEP 552 (Deterministic pycs) implementation #4575

Merged
merged 14 commits into from
Dec 9, 2017

Conversation

benjaminp
Copy link
Contributor

@benjaminp benjaminp commented Nov 26, 2017

Python now supports checking bytecode cache up-to-dateness with a hash of the
source contents rather than volatile source metadata. See the PEP for details.

While a fairly straightforward idea, quite a lot of code had to be modified due
to the pervasiveness of pyc implementation details in the codebase. Changes in
this commit include:

  • The core changes to importlib to understand how to read, validate, and
    regenerate hash-based pycs.

  • Support for generating hash-based pycs in py_compile and compileall.

  • Modifications to our siphash implementation to support passing a custom
    key. We then expose it to importlib through _imp.

  • Updates to all places in the interpreter, standard library, and tests that
    manually generate or parse pyc files to grok the new format.

  • Support in the interpreter command line code for long options like
    --check-hash-based-pycs.

  • Tests and documentation for all of the above.

https://bugs.python.org/issue31650

@benjaminp benjaminp requested a review from a team November 26, 2017 22:39
@benjaminp benjaminp force-pushed the benjamin-pep-552 branch 4 times, most recently from 780bf48 to cd28db3 Compare November 27, 2017 07:31
against the source at runtime to determine if the pyc needs to be
regenerated.

.. attribute:: UNCHECKED_HASH
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The text below merely says "a hash is included but not validated" but makes no mention of what is validated in this situation. the pyc-invalidation docs cover this in more detail. It seems worth repeating that this just means a hash is generated when generating a pyc but is not checked by default here.

It'd also be good to include a statement of when this might be useful (packaging distribution installed pre-generated pyc files where something else is guaranteeing they cannot go out of sync for example?). Though perhaps that description belongs in import.rst's pyc-invalidation section.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The text below merely says "a hash is included but not validated" but makes no mention of what is validated in this situation. the pyc-invalidation docs cover this in more detail. It seems worth repeating that this just means a hash is generated when generating a pyc but is not checked by default here.

Yeah, I was trying to funnel everything into import.rst. I will beef up this part a big, though.

It'd also be good to include a statement of when this might be useful (packaging distribution installed pre-generated pyc files where something else is guaranteeing they cannot go out of sync for example?). Though perhaps that description belongs in import.rst's pyc-invalidation section.

I put some commentary about why you might want this in the what's new? entry. Maybe, it should go somewhere more "permanent", too, though.

@@ -266,23 +284,29 @@ def main():
if args.workers is not None:
args.workers = args.workers or None

ivl_mode = args.invalidation_mode.replace('-', '_').upper()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you use the names in choices= above to generate the mode name, perhaps use a similar reverse transformation to generate a (sorted/stable-ordered) list of names in the choices tuple above?

choices=sorted(mode.lower().replace('_', '-') for mode in args.invalidation_mode),

or similar?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sgtm

def _validate_bytecode_header(data, source_stats=None, name=None, path=None):
"""Validate the header of the passed-in bytecode against source_stats (if
given) and returning the bytecode that can be compiled by compile().
def _classify_pyc(data, name, exc_details):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you document what the arguments are?
name is not obvious - could it be given a better variable name? is it the pyc file's basename or the full pathname?
exc_details is unobvious.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A similar comment applies to all other new function def's below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do

return flags


def _validate_timestamp_pyc(data, source_stats, name, exc_details):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems suspicious that this raises no error when the source_stats dict-like object fails to contain mtime or size members.

If this is intentional, document why.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'll clean this up.

file."""

def _code_to_timestamp_pyc(code, mtime=0, source_size=0):
"Produce the data for a timestamp-based pyc."
data = bytearray(MAGIC_NUMBER)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is unfortunate that the constant is called MAGIC_NUMBER when in fact it is a bytes object containing the binary magic data. Can it be renamed as this is "just" _bootstrap_external or would that break some API somewhere?

bytearray(MAGIC_NUMBER) looks suspicious if a reader naively assumes that MAGIC_NUMBER is an integer...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it turns out the integer version is actually called _RAW_MAGIC_NUMBER, which a priori sounds more likely to be a bytearray. oops

At any rate, I think it's correct that this constant is exposed as some bytes not a number, since the only valid operation is to write it into a file or compare it to some bytes from a file.

source_stats=st, name=fullname,
path=bytecode_path)
flags = _classify_pyc(data, fullname, exc_details)
bytes_data = data[16:]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider using memoryview(data)[16:] to avoid a large copy assuming _compile_bytecode is happy with that (it appears to be, marshal.loads() accepts memoryview)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sgtm

'path': path,
}
_classify_pyc(data, fullname, exc_details)
return _compile_bytecode(data[16:], name=fullname, bytecode_path=path)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

memoryview(data)[16:] if you also do this in the other place i made this comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do

@@ -5,18 +5,25 @@
from ._bootstrap import spec_from_loader
from ._bootstrap import _find_spec
from ._bootstrap_external import MAGIC_NUMBER
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a public API exposing MAGIC_NUMBER so it's probably best to ignore my previous comment about changing the name of the thing to avoid confusion over its type.

except ImportError as exc:
self.msgout(2, "raise ImportError: " + str(exc), pathname)
raise
co = marshal.loads(marshal_data)
co = marshal.loads(data[16:])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

memoryview(data)[16:]

bytecode = importlib._bootstrap_external._code_to_hash_pyc(
code,
source_hash,
invalidation_mode == PycInvalidationMode.CHECKED_HASH,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A == test to get a boolean value in a list of args is easy to misread as a typo for a keyword argument being passed. Disambiguate by adding ()s around it or passing this by name?

Copy link
Contributor Author

@benjaminp benjaminp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comments! Update incoming.

Copy link
Member

@brettcannon brettcannon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty much all of my comments are minor, so no need to have to double-check any changes made based on my feedback.

.. cmdoption:: --invalidation-mode [timestamp|checked-hash|unchecked-hash]

Control how the generated pycs will be invalidated at runtime. The default
setting, ``timestamp``, means that pyc files with the source timestamp and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just wanted to double-check that we are referring to the files as "pyc" and "pycs" throughout the rest of the docs instead of ".pyc files" or something? Basically just checking we're being consistent.

@@ -1327,6 +1330,14 @@ an :term:`importer`.
.. versionchanged:: 3.6
Accepts a :term:`path-like object`.

.. function:: source_hash(source_bytes)

Return the hash of *source_bytes* as byte string. A hash-based pyc embeds the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"as byte string" -> "as bytes".

metadata.

Python also supports "hash-based" cache files, which store a hash of a source
file contents rather than its metadata. There are two variants of hash-based
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"source file contents" -> "source file's contents"

hash in the cache file. If a checked hash-based cache file is found to be
invalid, Python regenerates it and writes a new checked hash-based cache
file. For unchecked hash-based pycs, Python simply assumes the cache file is
valid if it exists. Hash-based pyc validation behavior may be override with the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"override" -> "overridden"

elif len(raw_timestamp) != 4:
message = 'reached EOF while reading timestamp in {!r}'.format(name)
if len(data) < 16:
message = 'reached EOF while reading pyc header of {!r}'.format(name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might as well change this to an f-string.

flags = _r_long(data[4:8])
# Only the first two flags are defined.
if flags & ~0b11:
raise ImportError('invalid flags {!r} in {!r}'.format(flags, name),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

f-string


"""
if _r_long(data[8:12]) != (source_mtime & 0xFFFFFFFF):
message = 'bytecode is stale for {!r}'.format(name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

f-string

raise ImportError(message, **exc_details)
if (source_size is not None and
_r_long(data[12:16]) != (source_size & 0xFFFFFFFF)):
raise ImportError('bytecode is stale for {!r}'.format(name),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

f-string

del_source=del_source)
test('_temp', mapping, bc_path)

def _test_no_marshal(self, *, del_source=False):
with util.create_modules('_temp') as mapping:
bc_path = self.manipulate_bytecode('_temp', mapping,
lambda bc: bc[:12],
lambda bc: bc[:16],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should have made these constants instead of using a literal when I originally wrote this code. Sorry about that. :(

Python/import.c Outdated
@@ -2216,6 +2249,14 @@ PyInit_imp(void)
d = PyModule_GetDict(m);
if (d == NULL)
goto failure;
PyObject *pyc_mode = PyUnicode_FromString(_Py_CheckHashBasedPycsMode);
if (pyc_mode == NULL)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing curly braces.

Copy link
Member

@gpshead gpshead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some minor comments to address but overall i believe this is good to go in when you are ready.

thanks!

Modules/main.c Outdated
@@ -98,6 +104,8 @@ static const char usage_3[] = "\
also PYTHONWARNINGS=arg\n\
-x : skip first line of source, allowing use of non-Unix forms of #!cmd\n\
-X opt : set implementation-specific option\n\
--check-hash-based-pycs always|default|never:\n\
control how Python invalidates hash-based .pcy files\n\
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.pyc files

};

PyDoc_STRVAR(module_doc,
"perfmap module.");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where'd this file come from?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

eek, now you know what crud I have sitting about in my checkout.

if (flags != 0) {
// Hash-based pyc. We currently refuse to handle checked hash-based
// pycs. We could validate hash-based pycs against the source, but it
// seems likely that most people putting hash-based pycs in a zipfile
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 agreed!

Python now supports checking bytecode cache up-to-dateness with a hash of the
source contents rather than volatile source metadata. See the PEP for details.

While a fairly straightforward idea, quite a lot of code had to be modified due
to the pervasiveness of pyc implementation details in the codebase. Changes in
this commit include:

- The core changes to importlib to understand how to read, validate, and
  regenerate hash-based pycs.

- Support for generating hash-based pycs in py_compile and compileall.

- Modifications to our siphash implementation to support passing a custom
  key. We then expose it to importlib through _imp.

- Updates to all places in the interpreter, standard library, and tests that
  manually generate or parse pyc files to grok the new format.

- Support in the interpreter command line code for long options like
  --check-hash-based-pycs.

- Tests and documentation for all of the above.
@benjaminp benjaminp merged commit 42aa93b into master Dec 9, 2017
@benjaminp benjaminp deleted the benjamin-pep-552 branch December 9, 2017 18:26
@benjaminp
Copy link
Contributor Author

Thanks again for the reviews, Brett and Greg!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants