Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bpo-33499: Add PYTHONPYCACHEPREFIX env var for alt bytecode cache location. #6834

Merged
merged 10 commits into from
Jun 16, 2018

Conversation

carljm
Copy link
Member

@carljm carljm commented May 14, 2018

In some development setups it is inconvenient or impossible to write bytecode
caches to the code tree, but the bytecode caches are still useful. The
PYTHONBYTECODEPATH environment variable allows specifying an alternate
location for bytecode files, within which a directory tree mirroring the code
tree will be created.

https://bugs.python.org/issue33499

@carljm carljm force-pushed the python-bytecode-path branch 2 times, most recently from afd306c to cbde1b0 Compare May 15, 2018 18:14
If this is set, Python will write ``.pyc`` files in a mirror directory tree
at this path, instead of in ``__pycache__`` directories within the source
tree.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You want to say .. versionadded:: 3.8 here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have tested the rendering of the documentation and it's ok.

@carljm carljm force-pushed the python-bytecode-path branch from cbde1b0 to 4104b18 Compare May 15, 2018 19:38
@ambv ambv requested a review from brettcannon May 15, 2018 21:45
@carljm carljm force-pushed the python-bytecode-path branch from 4104b18 to 0730491 Compare May 15, 2018 22:06
@carljm carljm requested a review from a team as a code owner May 15, 2018 22:06
@carljm carljm force-pushed the python-bytecode-path branch 2 times, most recently from d92eb1c to 01b0bba Compare May 16, 2018 03:34
Copy link
Member

@ericsnowcurrently ericsnowcurrently left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly LGTM. My main concerns are with cross-platform handling and with where the environment variable gets handled. I've left comments in-line. Thanks for working on this, Carl.

@@ -310,7 +320,18 @@ def cache_from_source(path, debug_override=None, *, optimization=None):
if not optimization.isalnum():
raise ValueError('{!r} is not alphanumeric'.format(optimization))
almost_filename = '{}.{}{}'.format(almost_filename, _OPT, optimization)
return _path_join(head, _PYCACHE, almost_filename + BYTECODE_SUFFIXES[0])
filename = almost_filename + BYTECODE_SUFFIXES[0]
if bytecode_path is not None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For clarity in the code, it may make sense to make this a constant (i.e. BYTECODE_PATH).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I followed the example of similar existing module-level globals set in _setup, path_sep and path_separators, which are lowercase. I also thought about making BYTECODE_PATH uppercase, but chose consistency. Happy to change it though.

if bytecode_path is not None:
if not _path_isabs(head):
head = _path_join(_os.getcwd(), head)
if head[1] == ':':
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for Windows support, right? Do we do anything like this elsewhere in importlib?

Also, this could get things wrong on posix (where the matched colon has no semantic meaning), no?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's for Windows support. And yeah, I think currently it could get things wrong on posix; I'll also add a check that the first character is alpha; since this path is definitely absolute at this point, that means it can't be a posix path.

head = head[2:]
return _path_join(
bytecode_path,
head.strip(path_separators),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't os.path.join() take care of this for you? (maybe I'm remembering wrong...)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would I think, but we can't use os.path.join here in _bootstrap_external, we don't have access to it. That's why we're using the substitute _path_join defined here in the module, which isn't nearly as sophisticated.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would need the lstrip part of this even if we were using os.path.join, or else it would erase the initial bytecode_path. Part of what we're doing here is turning head from an absolute path back into a relative path we can join to bytecode_path.

else:
head, pycache = _path_split(head)
if pycache != _PYCACHE:
raise ValueError('{} not bottom-level directory in '
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

f-string?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Eric meant "f-strings everywhere!" 😉

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, can do, I left it unchanged because this is pre-existing code, all I did was indent it one level. I'm all in favor of f-strings everywhere :-)

@@ -1552,6 +1576,10 @@ def _setup(_bootstrap_module):
SOURCE_SUFFIXES.append('.pyw')
if '_d.pyd' in EXTENSION_SUFFIXES:
WindowsRegistryFinder.DEBUG_BUILD = True
bytecode_path = _os.environ.get(b'PYTHONBYTECODEPATH') or None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think it be worth looking into doing this as part of the runtime init (i.e. PEP 432). It's the approach we follow for similar environment variables (and CLI flags). Addressing this would impact files like Include/internal/pystate.h, Modules/main.c, and Python/pystate.c.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can look at that, though I'm not clear what concrete benefit it would bring, unless we want to add a corresponding CLI arg.

# cache_from_source and source_from_cache come from the same module,
# thus they share the same globals dict, so we only have to patch it
# once
_globals = self.util.cache_from_source.__globals__
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this indirection feels slightly cheat-y, I still think it's the best option. :) Testing importlib is hard. The comment here helps, so thanks.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your help identifying this workaround!

# a bytecode path inside that directory (in a subdirectory mirroring the
# .py file's path) rather than in a __pycache__ dir next to the py file.
bytecode_paths = [
os.path.join(os.path.sep, 'tmp', 'bytecode'),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These have a definite posix vibe to them. Could you also mix in some Windows-specific paths too. Note that the semantics of such will be different depending on platform (see my note above on cache_from_source), so the tests will need to differ depending on platform.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree the path names "feel" posixy here, but on Windows I turn them into true Windows paths; that's what the "drive" stuff below is for. On posix the "drive" will be empty so the path remains unchanged, on Windows I'll prepend the drive to it. So I think that I'm already doing what you are asking for (using Windows paths on Windows and posix paths on posix).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think UNC paths would be the main thing this test case doesn't yet cover. I think we get that for free due to importlib being in Python rather than C, but a couple of options worth considering as test cases would be:

\\?\C:\...
\\localhost\c$\...

(The first of those is a Windows feature I didn't previously know about, but discovered while double checking how to spell the local loopback share access: https://stackoverflow.com/questions/2787203/unc-path-to-a-folder-on-my-local-computer/36196538#36196538 )

Testing of the local loopback share would need to be gated on actually having access to it (which local admin accounts do, but regular users probably won't - that's why I thought the \\?\ prefix might be interesting)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These functions (source_from_cache and cache_from_source) and their tests only do path manipulation, not path access. I don't intend to add any tests that do filesystem access, since that would be testing existing behavior not modified by this PR. So I don't think I will need to gate on access.

Ultimately the only thing that matters for this PR is that we correctly detect UNC paths as "absolute" and don't try to prepend CWD to them. It's clear from inspection that we do that (since they begin with a path separator), but I agree it would be good to have it tested; will add tests.

os.path.join(os.path.sep, 'tmp', '\u2603'), # non-ASCII in path!
os.path.join(os.path.sep, 'tmp', 'trailing-slash') + os.path.sep,
]
drive = os.path.splitdrive(os.getcwd())[0]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for clarification, os.getcwd() always returns an absolute path? or maybe that doesn't matter for splitdrive()?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, it appeared to but I don't find that guarantee in the documentation.

On second thought I think it might be clearer if I just do an explicit platform check here, and hardcode a C: drive on Windows, instead of abusing getcwd/splitdrive this way as a hidden platform check.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Python documentation isn't clear on the topic, but I checked the docs for GetCurrentDirectory on Windows (https://msdn.microsoft.com/en-us/library/windows/desktop/aa364934(v=vs.85).aspx) and libc getcwd (http://man7.org/linux/man-pages/man2/getcwd.2.html), which are how os.getcwd is implemented, and both are clearly defined to return absolute paths. And even more to the point, both the posix and Windows versions of os.path.abspath use "join with os.getcwd()" as their algorithm. Which is good, because even if I change the code here, in _bootstrap_external.py (where I don't have access to os.path) I'm also relying on "join with cwd" as a way to absolutize a path.

finally:
_globals['bytecode_path'] = _orig_path

@unittest.skipUnless(sys.implementation.cache_tag is not None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the following equivalent?

@unittest.skipIf(sys.implementation.cache_tag is None, ...)

If so, considering do that. skipUnless drives me nuts. :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I just copied this from existing test methods in this class. I can convert them all to use skipIf instead.

@bedevere-bot
Copy link

A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated.

Once you have made the requested changes, please leave a comment on this pull request containing the phrase I have made the requested changes; please review again. I will then notify any core developers who have left a review that you're ready for them to take another look at this pull request.

filename = almost_filename + BYTECODE_SUFFIXES[0]
if bytecode_path is not None:
if not _path_isabs(head):
head = _path_join(_os.getcwd(), head)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason to suddenly start making this absolute? This is a change in semantics.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not a change in semantics because this only occurs in the if bytecode_path is not None branch, which is new and only used if the new env var is set, so there are no existing semantics for this case.

We need to make the path absolute when using a PYTHONBYTECODEPATH to ensure no clashes in the cached bytecode tree. E.g if someone runs code with a local path of foo/bar.py in two different places on their system, with the same value for PYTHONBYTECODEPATH, we need them to store their bytecode in /tmp/bytecode/abs/path1/to/foo/ and /tmp/bytecode/abs/path2/to/foo/, not collide over /tmp/bytecode/foo/.

@contextlib.contextmanager
def bytecode_path(self, path):
# Patch directly in globals of the functions under test, because we have
# no direct access to the right _bootstrap_external module to patch
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing a period.


# cache_from_source and source_from_cache come from the same module,
# thus they share the same globals dict, so we only have to patch it
# once
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing a period.

else:
head, pycache = _path_split(head)
if pycache != _PYCACHE:
raise ValueError('{} not bottom-level directory in '
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Eric meant "f-strings everywhere!" 😉

@carljm
Copy link
Member Author

carljm commented May 16, 2018

Thanks for the review! Will make the suggested updates.

@carljm
Copy link
Member Author

carljm commented May 16, 2018

I addressed all the relevant review feedback except for @ericsnowcurrently 's comment about runtime-init. Since I just saw that @warsaw weighed in on bpo in favor of a matching CLI arg, I'll go ahead and work on implementing a CLI arg and moving the initialization to runtime init. Any votes on the CLI flag to use? For now I'm tentatively using -C (for "cache", since what we're moving is otherwise called __pycache__).

@carljm
Copy link
Member Author

carljm commented May 17, 2018

On second thought, I'd like to ask for re-consideration of the approach of just having an env var (no CLI arg), and keeping initialization where it is now (within importlib). Reasons:

  1. I can't come up with any good use cases for a CLI arg for this; the whole point of a bytecode cache is that it's reused, so this is something you definitely want passed on to subprocesses, etc. It doesn't make sense to take up valuable space in the CLI flag namespace (and touch a bunch of C code) if we don't have strong rationale for why it's needed.
  2. In terms of precedent, our existing env variables that take a path value (PYTHONHOME, PYTHONPATH) do not have a CLI equivalent. In fact, more existing env vars don't have a CLI arg than do. So I don't think there's a general consistency argument for adding the CLI arg; we should add it only if it's truly useful.
  3. I don't understand what the concrete benefit of moving reading of the env var out of importlib and into PEP 432 runtime init would be. This is importlib internal state, it makes sense to keep it encapsulated inside importlib. The only external use case for knowing its value would be to find the pyc for a py (and vice versa), and this functionality is already exported in importlib.util in the functions source_from_cache and cache_from_source. If the goal is to allow setting different values for different sub-interpreters, I also don't see the attraction there; a bytecode cache is something where you'd want to maximize sharing, not isolation.

So I make the magic incantation: I have made the requested changes; please review again.

@bedevere-bot
Copy link

Thanks for making the requested changes!

@ericsnowcurrently, @brettcannon: please review the changes made to this pull request.

@carljm
Copy link
Member Author

carljm commented May 17, 2018

(BTW, the failing Windows VSTS check is the asyncio issue @zooba has been struggling with, it's not related to this PR.)

@warsaw
Copy link
Member

warsaw commented May 17, 2018

@carljm I would be perfectly happy with an environment variable only.

@ericsnowcurrently
Copy link
Member

@ncoghlan

Nick will have a better sense of how PYTHONBYTECODEPATH fits in with PEP 432, but I'll give it a stab here.

The main benefit of following the existing model is for CPython embedders. The convention (i.e. from PEP 432) is that startup-relevant environment variables are read and stored in the config used to initialize the main interpreter during startup. Embedders then have the opportunity to explicitly modify that config. If the only mechanism is through an environment variable (read whenever importlib is loaded) then embedders have to deal with that exception to the convention and may even miss it.

@ericsnowcurrently
Copy link
Member

FWIW, I don't think a CLI flag is necessary.

@carljm carljm force-pushed the python-bytecode-path branch from 79da140 to 5286dd3 Compare May 19, 2018 02:11
@carljm
Copy link
Member Author

carljm commented May 19, 2018

I moved the environment variable parsing into interpreter config as requested, added sys.bytecode_path as public API for it, and also added -X bytecode_path=... since @ncoghlan suggested it might be useful on Windows (and because it was easy to do once I was already parsing the env var in interpreter setup). I think I've addressed all comments and this is review-ready, @ericsnowcurrently / @brettcannon. Thanks in advance for review!

I also manually tested that the environment variable and -X option do set sys.bytecode_path as expected, and the pyc files are written to and read from the expected location.

(Windows VSTS failure is unrelated asyncio flakiness.)

@carljm carljm force-pushed the python-bytecode-path branch from 5286dd3 to 0d59ebd Compare June 7, 2018 06:40
@carljm
Copy link
Member Author

carljm commented Jun 7, 2018

Rebased to resolve conflict in frozen importlib.

I can report that we are now using this feature actively in all Python development at Instagram and it's working well.

Happy to make any other adjustments needed for merge.

@@ -209,6 +209,26 @@ always available.
yourself to control bytecode file generation.


.. data:: bytecode_prefix
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I agree there's value in following existing precedent (as with the choice of _prefix as the attribute suffix), I think in this case the precedent could be misleading, as it may give the impression that the interpreter will only ever read bytecode files from under that prefix, which isn't the case: the interpreter will still happily read precompiled bytecode files from any directory where Python source files may appear. What it won't do is read the local __pycache__ subdirectories in those source directories - it will look for the bytecode caches under the parallel pycache_prefix tree instead.

That ambiguity doesn't exist for PYTHONDONTWRITEBYTECODE, as that setting clearly doesn't affect the way the interpreter reads bytecode files.

(That said, if we do agree to go with sys.pycache_prefix, we may subsequently decide that it's worthwhile to start supporting a PYTHONUPDATEPYCACHE=0 setting, and only keep PYTHONDONTWRITEBYTECODE around as a backwards compatibility shim)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, works for me; I'll update to pycache_prefix.

@ncoghlan
Copy link
Contributor

While checking to see whether or not PYTHONDONTWRITEBYTECODE came from a PEP or an ordinary issue, I discovered that there was an early attempt at designing this feature around 15 years ago, before the more recent changes that added the cache_from_source APIs: https://www.python.org/dev/peps/pep-0304/

My isolated mode concerns are essentially the "Security" and "More security" bullet points in https://www.python.org/dev/peps/pep-0304/#issues: it will be a problem if the out-of-tree cache feature can be used to get an isolated Python to load cached bytecode for a root-owned source file from a director that was writable by users other than root (hence why I think "running in isolated mode prohibits the use of out-of-tree bytecode caches" could be a good default starting point, and then we can work out something less restrictive later based on concrete use cases rather than speculation).

The other issues raised there are addressed either by the importlib migration (which moved the whole import system over to using import hooks), or else by https://www.python.org/dev/peps/pep-3147/ (which introduced __pycache__ directories).

@carljm
Copy link
Member Author

carljm commented Jun 11, 2018

Re isolated mode, it seems to me that PYTHONPYCACHEPREFIX should clearly not be respected in isolated mode (just as with every other PYTHON* env var), but -X pycache_prefix=PATH should still be respected; it is provided via the same means (command-line flag) that supplies -I itself, so there's nothing to be gained security-wise by having the one CLI flag cancel the other. (And it is potentially useful to use isolated mode and a CLI-provided pycache_prefix together.)

If this analysis is reasonable, then I think the desired behavior is already in place (because it falls out naturally from the existing definition of isolated mode vis-a-vis environment variables; this feature uses the same env-get utility functions that are used for all other env vars, and these utilities already respect Py_IgnoreEnvironmentFlag which is set by Py_IsolatedFlag).

Copy link
Contributor

@ncoghlan ncoghlan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @carljm's isolated mode analysis is correct, and my other questions/concerns have all been addressed, so LGTM! Very nice enhancement :)

@ncoghlan ncoghlan changed the title bpo-33499: Add PYTHONBYTECODEPATH env var for alt bytecode location. bpo-33499: Add PYTHONPYCACHEPREFIX env var for alt bytecode cache location. Jun 12, 2018
Copy link
Member

@ericsnowcurrently ericsnowcurrently left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Thanks for making all those changes, Carl! I'm happy with where the patch ended up. And thanks for working on this! It's a solid addition. :)

My only lingering concern lies with not resolving relative paths eagerly. However, considering this is the status quo, I'll address it separately from the PR. :)

carljm added 10 commits June 12, 2018 10:09
In some development setups it is inconvenient or impossible to write bytecode
caches to the code tree, but the bytecode caches are still useful. The
PYTHONBYTECODEPATH environment variable allows specifying an alternate
location for bytecode files, within which a directory tree mirroring the code
tree will be created.
@carljm carljm force-pushed the python-bytecode-path branch from 6041710 to edfd3b6 Compare June 12, 2018 16:10
@brettcannon
Copy link
Member

Naive question my part, but is there anything explicit we need to do for -S? Or are all environment variables automatically ignored by that flag being set?

And I changed my review since @carljm addressed my comments.

@ncoghlan
Copy link
Contributor

Automatically handled by config_get_env_var_dup (it would only have been a problem if the patch wasn't using the existing config helpers).

@ncoghlan ncoghlan merged commit b193fa9 into python:master Jun 16, 2018
@ncoghlan
Copy link
Contributor

3 green ticks from import system maintainers, and all the old thorny issues from PEP 304 resolved by other PEPs in the meantime, so I'm calling this as accepted! :)

Thanks @carljm!

@carljm carljm deleted the python-bytecode-path branch June 20, 2018 00:13
@carljm
Copy link
Member Author

carljm commented Jun 20, 2018

Thanks everyone for the careful, thoughtful, and constructive reviews! Really good contributor experience :)

@YoSTEALTH
Copy link
Contributor

I was having a debate on #python that pycache needs to have an alternative path the other day and having noticed this today is just wow, you people are way ahead... nice job implementing this feature :D

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants