-
-
Notifications
You must be signed in to change notification settings - Fork 31.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bpo-33499: Add PYTHONPYCACHEPREFIX env var for alt bytecode cache location. #6834
Conversation
afd306c
to
cbde1b0
Compare
Doc/using/cmdline.rst
Outdated
If this is set, Python will write ``.pyc`` files in a mirror directory tree | ||
at this path, instead of in ``__pycache__`` directories within the source | ||
tree. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You want to say .. versionadded:: 3.8
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have tested the rendering of the documentation and it's ok.
cbde1b0
to
4104b18
Compare
4104b18
to
0730491
Compare
d92eb1c
to
01b0bba
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly LGTM. My main concerns are with cross-platform handling and with where the environment variable gets handled. I've left comments in-line. Thanks for working on this, Carl.
Lib/importlib/_bootstrap_external.py
Outdated
@@ -310,7 +320,18 @@ def cache_from_source(path, debug_override=None, *, optimization=None): | |||
if not optimization.isalnum(): | |||
raise ValueError('{!r} is not alphanumeric'.format(optimization)) | |||
almost_filename = '{}.{}{}'.format(almost_filename, _OPT, optimization) | |||
return _path_join(head, _PYCACHE, almost_filename + BYTECODE_SUFFIXES[0]) | |||
filename = almost_filename + BYTECODE_SUFFIXES[0] | |||
if bytecode_path is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For clarity in the code, it may make sense to make this a constant (i.e. BYTECODE_PATH
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I followed the example of similar existing module-level globals set in _setup
, path_sep
and path_separators
, which are lowercase. I also thought about making BYTECODE_PATH
uppercase, but chose consistency. Happy to change it though.
Lib/importlib/_bootstrap_external.py
Outdated
if bytecode_path is not None: | ||
if not _path_isabs(head): | ||
head = _path_join(_os.getcwd(), head) | ||
if head[1] == ':': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is for Windows support, right? Do we do anything like this elsewhere in importlib?
Also, this could get things wrong on posix (where the matched colon has no semantic meaning), no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's for Windows support. And yeah, I think currently it could get things wrong on posix; I'll also add a check that the first character is alpha; since this path is definitely absolute at this point, that means it can't be a posix path.
Lib/importlib/_bootstrap_external.py
Outdated
head = head[2:] | ||
return _path_join( | ||
bytecode_path, | ||
head.strip(path_separators), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't os.path.join() take care of this for you? (maybe I'm remembering wrong...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would I think, but we can't use os.path.join
here in _bootstrap_external
, we don't have access to it. That's why we're using the substitute _path_join
defined here in the module, which isn't nearly as sophisticated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We would need the lstrip
part of this even if we were using os.path.join
, or else it would erase the initial bytecode_path
. Part of what we're doing here is turning head
from an absolute path back into a relative path we can join to bytecode_path
.
Lib/importlib/_bootstrap_external.py
Outdated
else: | ||
head, pycache = _path_split(head) | ||
if pycache != _PYCACHE: | ||
raise ValueError('{} not bottom-level directory in ' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
f-string?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think Eric meant "f-strings everywhere!" 😉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, can do, I left it unchanged because this is pre-existing code, all I did was indent it one level. I'm all in favor of f-strings everywhere :-)
Lib/importlib/_bootstrap_external.py
Outdated
@@ -1552,6 +1576,10 @@ def _setup(_bootstrap_module): | |||
SOURCE_SUFFIXES.append('.pyw') | |||
if '_d.pyd' in EXTENSION_SUFFIXES: | |||
WindowsRegistryFinder.DEBUG_BUILD = True | |||
bytecode_path = _os.environ.get(b'PYTHONBYTECODEPATH') or None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think it be worth looking into doing this as part of the runtime init (i.e. PEP 432). It's the approach we follow for similar environment variables (and CLI flags). Addressing this would impact files like Include/internal/pystate.h, Modules/main.c, and Python/pystate.c.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can look at that, though I'm not clear what concrete benefit it would bring, unless we want to add a corresponding CLI arg.
Lib/test/test_importlib/test_util.py
Outdated
# cache_from_source and source_from_cache come from the same module, | ||
# thus they share the same globals dict, so we only have to patch it | ||
# once | ||
_globals = self.util.cache_from_source.__globals__ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While this indirection feels slightly cheat-y, I still think it's the best option. :) Testing importlib is hard. The comment here helps, so thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your help identifying this workaround!
Lib/test/test_importlib/test_util.py
Outdated
# a bytecode path inside that directory (in a subdirectory mirroring the | ||
# .py file's path) rather than in a __pycache__ dir next to the py file. | ||
bytecode_paths = [ | ||
os.path.join(os.path.sep, 'tmp', 'bytecode'), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These have a definite posix vibe to them. Could you also mix in some Windows-specific paths too. Note that the semantics of such will be different depending on platform (see my note above on cache_from_source), so the tests will need to differ depending on platform.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree the path names "feel" posixy here, but on Windows I turn them into true Windows paths; that's what the "drive" stuff below is for. On posix the "drive" will be empty so the path remains unchanged, on Windows I'll prepend the drive to it. So I think that I'm already doing what you are asking for (using Windows paths on Windows and posix paths on posix).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think UNC paths would be the main thing this test case doesn't yet cover. I think we get that for free due to importlib being in Python rather than C, but a couple of options worth considering as test cases would be:
\\?\C:\...
\\localhost\c$\...
(The first of those is a Windows feature I didn't previously know about, but discovered while double checking how to spell the local loopback share access: https://stackoverflow.com/questions/2787203/unc-path-to-a-folder-on-my-local-computer/36196538#36196538 )
Testing of the local loopback share would need to be gated on actually having access to it (which local admin accounts do, but regular users probably won't - that's why I thought the \\?\
prefix might be interesting)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These functions (source_from_cache
and cache_from_source
) and their tests only do path manipulation, not path access. I don't intend to add any tests that do filesystem access, since that would be testing existing behavior not modified by this PR. So I don't think I will need to gate on access.
Ultimately the only thing that matters for this PR is that we correctly detect UNC paths as "absolute" and don't try to prepend CWD to them. It's clear from inspection that we do that (since they begin with a path separator), but I agree it would be good to have it tested; will add tests.
Lib/test/test_importlib/test_util.py
Outdated
os.path.join(os.path.sep, 'tmp', '\u2603'), # non-ASCII in path! | ||
os.path.join(os.path.sep, 'tmp', 'trailing-slash') + os.path.sep, | ||
] | ||
drive = os.path.splitdrive(os.getcwd())[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for clarification, os.getcwd() always returns an absolute path? or maybe that doesn't matter for splitdrive()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, it appeared to but I don't find that guarantee in the documentation.
On second thought I think it might be clearer if I just do an explicit platform check here, and hardcode a C:
drive on Windows, instead of abusing getcwd/splitdrive this way as a hidden platform check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Python documentation isn't clear on the topic, but I checked the docs for GetCurrentDirectory
on Windows (https://msdn.microsoft.com/en-us/library/windows/desktop/aa364934(v=vs.85).aspx) and libc getcwd
(http://man7.org/linux/man-pages/man2/getcwd.2.html), which are how os.getcwd
is implemented, and both are clearly defined to return absolute paths. And even more to the point, both the posix and Windows versions of os.path.abspath
use "join with os.getcwd()
" as their algorithm. Which is good, because even if I change the code here, in _bootstrap_external.py
(where I don't have access to os.path
) I'm also relying on "join with cwd" as a way to absolutize a path.
Lib/test/test_importlib/test_util.py
Outdated
finally: | ||
_globals['bytecode_path'] = _orig_path | ||
|
||
@unittest.skipUnless(sys.implementation.cache_tag is not None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the following equivalent?
@unittest.skipIf(sys.implementation.cache_tag is None, ...)
If so, considering do that. skipUnless drives me nuts. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I just copied this from existing test methods in this class. I can convert them all to use skipIf
instead.
A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated. Once you have made the requested changes, please leave a comment on this pull request containing the phrase |
Lib/importlib/_bootstrap_external.py
Outdated
filename = almost_filename + BYTECODE_SUFFIXES[0] | ||
if bytecode_path is not None: | ||
if not _path_isabs(head): | ||
head = _path_join(_os.getcwd(), head) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason to suddenly start making this absolute? This is a change in semantics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not a change in semantics because this only occurs in the if bytecode_path is not None
branch, which is new and only used if the new env var is set, so there are no existing semantics for this case.
We need to make the path absolute when using a PYTHONBYTECODEPATH
to ensure no clashes in the cached bytecode tree. E.g if someone runs code with a local path of foo/bar.py
in two different places on their system, with the same value for PYTHONBYTECODEPATH
, we need them to store their bytecode in /tmp/bytecode/abs/path1/to/foo/
and /tmp/bytecode/abs/path2/to/foo/
, not collide over /tmp/bytecode/foo/
.
Lib/test/test_importlib/test_util.py
Outdated
@contextlib.contextmanager | ||
def bytecode_path(self, path): | ||
# Patch directly in globals of the functions under test, because we have | ||
# no direct access to the right _bootstrap_external module to patch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing a period.
Lib/test/test_importlib/test_util.py
Outdated
|
||
# cache_from_source and source_from_cache come from the same module, | ||
# thus they share the same globals dict, so we only have to patch it | ||
# once |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing a period.
Lib/importlib/_bootstrap_external.py
Outdated
else: | ||
head, pycache = _path_split(head) | ||
if pycache != _PYCACHE: | ||
raise ValueError('{} not bottom-level directory in ' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think Eric meant "f-strings everywhere!" 😉
Thanks for the review! Will make the suggested updates. |
I addressed all the relevant review feedback except for @ericsnowcurrently 's comment about runtime-init. Since I just saw that @warsaw weighed in on bpo in favor of a matching CLI arg, I'll go ahead and work on implementing a CLI arg and moving the initialization to runtime init. Any votes on the CLI flag to use? For now I'm tentatively using |
On second thought, I'd like to ask for re-consideration of the approach of just having an env var (no CLI arg), and keeping initialization where it is now (within importlib). Reasons:
So I make the magic incantation: I have made the requested changes; please review again. |
Thanks for making the requested changes! @ericsnowcurrently, @brettcannon: please review the changes made to this pull request. |
(BTW, the failing Windows VSTS check is the asyncio issue @zooba has been struggling with, it's not related to this PR.) |
@carljm I would be perfectly happy with an environment variable only. |
Nick will have a better sense of how PYTHONBYTECODEPATH fits in with PEP 432, but I'll give it a stab here. The main benefit of following the existing model is for CPython embedders. The convention (i.e. from PEP 432) is that startup-relevant environment variables are read and stored in the config used to initialize the main interpreter during startup. Embedders then have the opportunity to explicitly modify that config. If the only mechanism is through an environment variable (read whenever importlib is loaded) then embedders have to deal with that exception to the convention and may even miss it. |
FWIW, I don't think a CLI flag is necessary. |
79da140
to
5286dd3
Compare
I moved the environment variable parsing into interpreter config as requested, added I also manually tested that the environment variable and -X option do set (Windows VSTS failure is unrelated asyncio flakiness.) |
5286dd3
to
0d59ebd
Compare
Rebased to resolve conflict in frozen importlib. I can report that we are now using this feature actively in all Python development at Instagram and it's working well. Happy to make any other adjustments needed for merge. |
Doc/library/sys.rst
Outdated
@@ -209,6 +209,26 @@ always available. | |||
yourself to control bytecode file generation. | |||
|
|||
|
|||
.. data:: bytecode_prefix |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I agree there's value in following existing precedent (as with the choice of _prefix
as the attribute suffix), I think in this case the precedent could be misleading, as it may give the impression that the interpreter will only ever read bytecode files from under that prefix, which isn't the case: the interpreter will still happily read precompiled bytecode files from any directory where Python source files may appear. What it won't do is read the local __pycache__
subdirectories in those source directories - it will look for the bytecode caches under the parallel pycache_prefix
tree instead.
That ambiguity doesn't exist for PYTHONDONTWRITEBYTECODE
, as that setting clearly doesn't affect the way the interpreter reads bytecode files.
(That said, if we do agree to go with sys.pycache_prefix
, we may subsequently decide that it's worthwhile to start supporting a PYTHONUPDATEPYCACHE=0
setting, and only keep PYTHONDONTWRITEBYTECODE
around as a backwards compatibility shim)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, works for me; I'll update to pycache_prefix
.
While checking to see whether or not My isolated mode concerns are essentially the "Security" and "More security" bullet points in https://www.python.org/dev/peps/pep-0304/#issues: it will be a problem if the out-of-tree cache feature can be used to get an isolated Python to load cached bytecode for a root-owned source file from a director that was writable by users other than root (hence why I think "running in isolated mode prohibits the use of out-of-tree bytecode caches" could be a good default starting point, and then we can work out something less restrictive later based on concrete use cases rather than speculation). The other issues raised there are addressed either by the importlib migration (which moved the whole import system over to using import hooks), or else by https://www.python.org/dev/peps/pep-3147/ (which introduced |
Re isolated mode, it seems to me that If this analysis is reasonable, then I think the desired behavior is already in place (because it falls out naturally from the existing definition of isolated mode vis-a-vis environment variables; this feature uses the same env-get utility functions that are used for all other env vars, and these utilities already respect |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think @carljm's isolated mode analysis is correct, and my other questions/concerns have all been addressed, so LGTM! Very nice enhancement :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thanks for making all those changes, Carl! I'm happy with where the patch ended up. And thanks for working on this! It's a solid addition. :)
My only lingering concern lies with not resolving relative paths eagerly. However, considering this is the status quo, I'll address it separately from the PR. :)
In some development setups it is inconvenient or impossible to write bytecode caches to the code tree, but the bytecode caches are still useful. The PYTHONBYTECODEPATH environment variable allows specifying an alternate location for bytecode files, within which a directory tree mirroring the code tree will be created.
6041710
to
edfd3b6
Compare
Naive question my part, but is there anything explicit we need to do for And I changed my review since @carljm addressed my comments. |
Automatically handled by |
3 green ticks from import system maintainers, and all the old thorny issues from PEP 304 resolved by other PEPs in the meantime, so I'm calling this as accepted! :) Thanks @carljm! |
Thanks everyone for the careful, thoughtful, and constructive reviews! Really good contributor experience :) |
I was having a debate on #python that pycache needs to have an alternative path the other day and having noticed this today is just wow, you people are way ahead... nice job implementing this feature :D |
In some development setups it is inconvenient or impossible to write bytecode
caches to the code tree, but the bytecode caches are still useful. The
PYTHONBYTECODEPATH environment variable allows specifying an alternate
location for bytecode files, within which a directory tree mirroring the code
tree will be created.
https://bugs.python.org/issue33499