Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Uses PYTHONPYCACHEPREFIX to control where __pycache__ goes #340

Merged
merged 1 commit into from
May 9, 2022

Conversation

ryanmoran
Copy link
Member

Summary

Python 3.x will precompile bytecode artifacts (*.pyc files) when these libraries are loaded. These files appear within __pycache__ directories alongside their source code. For libraries provided by the stdlib, these directories will be created or updated within the cpython layer since the source is located there. Ultimately, this means that loading Python code will result in the interpreter modifying the cpython layer after it has been created, thus creating non-reproducible builds.

There are a couple of options to prevent this layer modification from happening:

  1. Set PYTHONDONTWRITEBYTECODE which will ensure these artifacts are never generated.
  2. Set PYTHONPYCACHEPREFIX which will control where these artifacts appear on the filesystem.

I've chosen to set PYTHONPYCACHEPREFIX as it still allows any benefit these files may provide while ensuring we aren't inadvertently modifying the cpython layer.

Additionally, I've changed the level at which we apply the PYTHONPATH environment variable to Default to align with RFC0019.

Use Cases

Resolves #317

Checklist

  • I have viewed, signed, and submitted the Contributor License Agreement.
  • I have linked issue(s) that this PR should close using keywords or the Github UI (See docs)
  • I have added an integration test, if necessary.
  • I have reviewed the styleguide for guidance on my code quality.
  • I'm happy with the commit history on this PR (I have rebased/squashed as needed).

@ryanmoran ryanmoran requested a review from a team as a code owner May 5, 2022 19:52
@robdimsdale robdimsdale added the semver:patch A change requiring a patch version bump label May 9, 2022
@robdimsdale robdimsdale enabled auto-merge (rebase) May 9, 2022 16:01
@robdimsdale
Copy link
Member

robdimsdale commented May 9, 2022

@ryanmoran looks great - thanks.

Out of curiosity, how did you discover that this was needed? I see that this resolves the linked issue, but do you also have a framework for validating the (non-)reproducibility of this buildpack? Do you think our existing reproducible build tests are insufficient?

@robdimsdale robdimsdale merged commit 22d754f into main May 9, 2022
@robdimsdale robdimsdale deleted the pycache-prefix branch May 9, 2022 16:02
@ryanmoran
Copy link
Member Author

Out of curiosity, how did you discover that this was needed? I see that this resolves the linked issue, but do you also have a framework for validating the (non-)reproducibility of this buildpack?

This is a somewhat tricky case to resolve. I think the biggest hint for me was that we'd only seen this in cases where we were using poetry. That reminded me that I'd heard of cases in the past where it appeared that some Python tooling caused modifications to the underlying Python installation, usually by installing packages there.

With that in mind, I used this tool called container-diff to diff the filesystem between a freshly built container and one from a rebuild. The results showed that the specific files being modified were the *.pyc files in the cpython layer. From there, I started to investigate why these files might be getting modified. Specifically, I found this Stack Overflow post and then this GitHub Issue comment that lead me in the direction of trying to control where these *.pyc files were being written. After setting this environment variable, I could confirm with container-diff that I was no longer seeing modifications to the cpython layer on rebuilds.

Do you think our existing reproducible build tests are insufficient?

I'm not really sure what our tests should be doing in this case. I don't think the tests in the cpython buildpack could have been written in a way that obviously exposes this issue. Additionally, I don't think it makes sense for another implementation buildpack like poetry-install, which is ultimately causing the issue, to have test coverage for this. My hope is that the CNB team can figure out how to have a reasonable implementation for this RFC so that these types of violations of the spec result in a build failure instead of silently creating modified layers.

@robdimsdale
Copy link
Member

Thanks for the detailed explanation @ryanmoran! I agree that ideally we would be able to mark the layers as read-only and expose the underlying issue via a build failure. Thanks for linking to that RFC.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
semver:patch A change requiring a patch version bump
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Layer is being re-added despite being restored from the cache
2 participants