feat: prevent uploading PyPI tokens in common places#19994
Conversation
Minimal implementaiton detecting PyPI API tokens in the commonly used places using the Yara scanner.
19f8bf3 to
530edc6
Compare
miketheman
left a comment
There was a problem hiding this comment.
Excellent start, thanks for taking a pass at it! I've dropped some comments inline.
| return compiled | ||
|
|
||
|
|
||
| def _generate_token(domain="pypi.org", projects_scope=False): |
There was a problem hiding this comment.
I can imagine a future factory-style generator to create tokens for testing, so this paves that future nicely.
e386b2b to
e210ba3
Compare
|
From looking for test tokens to exclude, the closest seem to be https://github.com/mongodb/kingfisher/blob/b2811107a814c117f6a54fd8d334aef52409bf8f/src/validation.rs#L1924 or https://github.com/cider-security-research/cicd-goat/blob/0ed10925f3983857cf219b2ac1c327b861fcccca/solutions/duchess.md?plain=1#L35 but the first won't land in any scanned file, and the second is not a PyPI project. However, I've discovered how people present a placeholder for PyPI tokens, and for regression prevention, I've added them to the test base. This is handled well just with the length in the Yara rule, but I've left the guidelines to exclude other sample tokens specifically, in case one needs it in the future. |
| # rules (e.g. pyarmor), and .pye for SourceDefender-encrypted files. | ||
| _SCAN_EXTENSIONS = {".py", ".pye"} | ||
| # Extensions to scan inside archives. | ||
| _SCAN_TARGETS = { |
There was a problem hiding this comment.
Nit: there might be a better way to express these targets? Maybe as globs, to make it clearer which ones are filenames versus suffixes?
There was a problem hiding this comment.
I was thinking about globs (also to cover your second comment), but I decided it would be better to leave it for a separate change to better research what would be the best way for that. My personal preference would be scan every file under a size limit, but I think it requires more considerations
There was a problem hiding this comment.
Makes sense, thanks. I'll defer to whatever others think makes sense for an initial merge here.
| # Different textual files for common places where PyPI tokens are accidentally left. | ||
| ".md", | ||
| ".rst", | ||
| ".env", |
There was a problem hiding this comment.
Nit: it might be good to match anything with *env* in the filename (including suffixes), since .env.prod, .prod.env, env.py, etc. are all conceivable.
There was a problem hiding this comment.
Yes - see the previous comment. As a bit of context, the current list is based on where I saw tokens recently.
The minimal implementation of #19985 to prevent users from uploading tokens in distribution archives. As the current scanner requires selecting extensions to scan, the list was extended with common places where they are left behind.
This change does not perform automatic token revoking. Only tokens for pypi.org are detected.