-
-
Notifications
You must be signed in to change notification settings - Fork 637
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve MyPy performance #10864
Comments
Or by parenting instances of its daemon: see https://docs.google.com/document/d/1n_MVVGjrkTKTPKHqRPlyfFzQyx2QioclMG_Q3DMUgYk/edit#bookmark=id.rla7dsf86tib |
It's possible this could be accomplished with an extension of the proposal in #10870 to allow non-append-only caches, by re-snapshotting the cache directory on each process execution. This would make it remote-friendly without any further work after #10870 is implemented. I have noted this use case as a possible extension in that issue. I'm less familiar with how we plan to implement parenting instances of the mypy daemon, but I think it seems very clear we would want to support persistent processes somehow and this seems like exactly the right use case to try that on. It's possible that these two approaches are 100% complementary (parenting the daemon, and snapshotting the cache directory). I can't tell which approach would be more immediately useful. |
The non-append-only-ness has more to do with not being concurrency-safe/shareable, for a few reasons:
A few pieces that don't really fit together into a complete story:
|
I had totally missed the absolute paths part of mypy. Either of the bullet points at the bottom are quite exciting to consider. I really like the idea of invoking recursively, as I suspect we can make the path rewriting extremely fast as well as generic enough to apply to other tools that write absolute paths. I would also really like to dive into using FUSE, like so, so much. I created this project which makes it super easy (ideally) to virtualize all of the I/O of a JVM process: https://github.com/cosmicexplorer/upc/blob/7dc4d0ae3219e014b00de62adfb415432f6c0850/local/virtual-cli/client/MainWrapper.scala#L108-L122, but I think possibly the entire thing could be deleted if we could instead have pants create very-high-performance FUSE instances. Especially given @eed3si9n's success with virtualizing zinc's I/O (http://eed3si9n.com/cached-compilation-for-sbt), I think that there is reason to believe FUSE would be a winning strategy that could be generic enough to avoid tons of separate incidental complexity. I'll probably look into reviving |
Hm. So after investigating a # The paths are all located at a top-level key 'path'.
> <.mypy_cache/3.5/uuid.meta.json jq '.' | grep '/Users'
(standard input):62: "path": "/Users/dmcclanahan/tools/pex/.tox/typecheck/lib/python3.8/site-packages/mypy/typeshed/stdlib/2and3/uuid.pyi"
# The only files in the cache are '.json' files, with a lone '.gitignore'.
> find .mypy_cache -type f | sed -re 's#.*(\.[^\.]+)#\1#g' | sort -u
.gitignore
.json
# And it appears all of the paths are relative to the working directory, except ones from the stdlib!
> find .mypy_cache -type f | parallel "echo -n '{}:' && jq -r '.path' <{}" | head -n3
.mypy_cache/3.5/test_resolver.meta.json:tests/test_resolver.py
.mypy_cache/3.5/atexit.meta.json:/Users/dmcclanahan/tools/pex/.tox/typecheck/lib/python3.8/site-packages/mypy/typeshed/stdlib/3/atexit.pyi
.mypy_cache/3.5/pex/testing.data.json:pex/testing.py It feels like we might be able to make use of a mutable cache here for the I think that materializing the typeshed stubs would definitely be a (smallish) slowdown -- but we could consider something like #8905 in that case, extending the symlinks we add to include specific large directories that we don't want to keep materializing. The API could be similar to the existing mutable cache API. Let me know if something's missing from that. |
Ok, after some more investigation, there are some other unstable fields. For posterity, I'm going to post a gist containing the cache contents for the At first glance, the only bad entries in The bad entries in The files aren't that large, so while we could attempt to rewrite them with a regex, it's likely to be safer and not much slower to rewrite them by parsing some json (this could be way wrong). I'm not quite sure yet where we would rewrite them -- if someone could give some pointers as to what we need to be making deterministic here, I would love a tip. This should be enough info to go ahead and implement this, though. |
Also, the mypy cache partitions itself by python version. I believe that should satisfy (from the OP):
I think that the fully recursive method described in #10864 (comment) should possibly be doable now? And I think I'm getting the idea -- if we can reset the There is a concern there about having to scan all the json files in the cache after each mypy invocation. If we could just look for the module owned by the current target, that would involve only scanning two files (the If we strictly merge cache directories from leaf to root, we should be able to avoid redoing a ton of work. However, since the cache directory is (I believe) supposed to be consistent across the whole repo, it's possible we could be a little more intense, and at each depth of a breadth-first search of the dependency closure, merge the cache digests and use them for all mypy invocations at the next depth level of the BFS. I think the latter might be necessary, but would be a little bit (I don't think too much) more complex to implement. |
To be clear, I'm thinking of:
To make it remotable, we need one more thing (I think):
|
Yep, that's what I was thinking. As mentioned in slack, the main issue with it is just that it's hard to know how fast or slow it would be (relative to daemonization approaches) without prototyping it. |
This is mostly false if we use the sqlite cache: Actually ... ditto for the FS based cache which uses an atomic rename for inserts depending on how they use their own store API. Currently entries are keyed on (path, mtime) so we'd need to get in a patch to make this hash-based. With that though, afaict, we could store a sqlite database in a |
We were looking at improving our mypy performance and we saw there's mypy bazel integration.
Can pants exploit this to provide better performance in both CI and local builds? |
Sorry for the long delay responding!
Yes, this would likely be beneficial. I had read through that thread, but totally missed that that flag actually existed on Based on your summary (which is great, thanks), I think that the outcome is that you could treat If so, there are some great incrementality benefits there: although cold performance for an entire repository might be a little bit lower, check would actually be able hit caches, and would have much much better incremental performance by running only for transitively changed files (again, similar to compilation). So yea: definitely worth looking into. |
@stuhood do we have any other compiler examples that I can look at and try to hack? do you feel it's a weekend project? |
This is likely to be a little bit of an undertaking. Python allows import cycles and would be checked at the file level, so it's probably most like the JVM. If you have familiarity with how JVM compilation works, then I could try to work up an explanation for this one. To deal with cycles, it would involve launching All that is relatively straightforward, and very similar to what we do for the JVM. But virtualenvs are built up differently from how a classpath can be composed (classpaths are "just" lists of files, whereas Python artifacts have an install step). We've experimented with recursive composition of Python artifacts, but it's likely that for the purposes of So: all together then (but in relatively low detail, because this is probably a 1-2k line change):
The existing |
Also, if we were going to be invoking |
OR, it's possible that |
I looked into this today. mypy certainly doesn't make this easy. 2 is the hot ticket here really. 3 is actually really annoying because we've littered CWD (and additional subdirs) with cache files. We can work around this by playing tricks with CWD, however we run into a second problem: this cache would ideally be shared (a la I think if we reeeeeeeeeally wanted to we could:
|
Something like that might be worth it...disregarding Pants and running MyPy directly, its performance is pretty dramatic between cold vs warm runs. We are wasting a lot of developer productivity time by not leveraging MyPy's cache. |
Actually |
Leverages mypy's cache to allow for "incremental" runs. The cache is an `append_only_cache` which is a bit disingenuous here, but AFAICT mypy is process-safe w.r.t. the cache (I'll do more digging to be very sure). Fixes #10864 [ci skip-rust] [ci skip-build-wheels]
@thejcannon , this is awesome! For posterity, can you explain why it is OK after all to use an append_only_cache ? |
I'll comment on PR |
…16276) Leverages mypy's cache to allow for "incremental" runs. The cache is an `append_only_cache` which is a bit disingenuous here, but AFAICT mypy is process-safe w.r.t. the cache (I'll do more digging to be very sure). Fixes pantsbuild#10864 [ci skip-rust] [ci skip-build-wheels]
(Also linking #16290 to this ticket) |
This is non-trivial because MyPy's cache is not append-only and mutates itself. We need a way to safely use its cache in a way that still works with things like remote execution and how we partition MyPy runs.
The text was updated successfully, but these errors were encountered: