Improve MyPy performance #10864

Eric-Arellano · 2020-09-25T19:40:51Z

This is non-trivial because MyPy's cache is not append-only and mutates itself. We need a way to safely use its cache in a way that still works with things like remote execution and how we partition MyPy runs.

stuhood · 2020-09-25T19:45:14Z

Improve MyPy performance by leveraging its cache

Or by parenting instances of its daemon: see https://docs.google.com/document/d/1n_MVVGjrkTKTPKHqRPlyfFzQyx2QioclMG_Q3DMUgYk/edit#bookmark=id.rla7dsf86tib

cosmicexplorer · 2020-09-27T15:46:28Z

It's possible this could be accomplished with an extension of the proposal in #10870 to allow non-append-only caches, by re-snapshotting the cache directory on each process execution. This would make it remote-friendly without any further work after #10870 is implemented. I have noted this use case as a possible extension in that issue.

I'm less familiar with how we plan to implement parenting instances of the mypy daemon, but I think it seems very clear we would want to support persistent processes somehow and this seems like exactly the right use case to try that on. It's possible that these two approaches are 100% complementary (parenting the daemon, and snapshotting the cache directory). I can't tell which approach would be more immediately useful.

stuhood · 2020-09-27T21:21:11Z

It's possible this could be accomplished with an extension of the proposal in #10870 to allow non-append-only caches, by re-snapshotting the cache directory on each process execution. This would make it remote-friendly without any further work after #10870 is implemented. I have noted this use case as a possible extension in that issue.

The non-append-only-ness has more to do with not being concurrency-safe/shareable, for a few reasons:

it expects to be located inside the buildroot, and uses absolute paths. locating it outside of the buildroot in a non-buildroot-specific absolute path might require rewriting files (similar to what we did for Zinc, which was ... painful/error-prone)
it doesn't expect to have multiple writers (which is probably the key differentiator for an "append-only" or well behaved cache) but we do expect to have multiple concurrent mypy processes (although perhaps we could constrain to one instance per interpreter-constraint at a time)

A few pieces that don't really fit together into a complete story:

Maybe possible to invoke recursively (per-target), and do lots of re-writing (both paths and timestamps) of process cache CAS entries (nb: not mutable caches) in wrapper code during per-process setup/teardown to allow them to be used at relative paths within the sandbox directory, but be stored elsewhere.
Keep instance(s) of the mypy daemon (or a Bazel-worker-API-like wrapper around the basic CLI API) up in dedicated/owned working directories, and then "sync" relevant files into the workspace (if they use file watching... or maybe FUSE otherwise?)

cosmicexplorer · 2020-09-27T21:56:38Z

I had totally missed the absolute paths part of mypy. Either of the bullet points at the bottom are quite exciting to consider. I really like the idea of invoking recursively, as I suspect we can make the path rewriting extremely fast as well as generic enough to apply to other tools that write absolute paths.

I would also really like to dive into using FUSE, like so, so much. I created this project which makes it super easy (ideally) to virtualize all of the I/O of a JVM process: https://github.com/cosmicexplorer/upc/blob/7dc4d0ae3219e014b00de62adfb415432f6c0850/local/virtual-cli/client/MainWrapper.scala#L108-L122, but I think possibly the entire thing could be deleted if we could instead have pants create very-high-performance FUSE instances. Especially given @eed3si9n's success with virtualizing zinc's I/O (http://eed3si9n.com/cached-compilation-for-sbt), I think that there is reason to believe FUSE would be a winning strategy that could be generic enough to avoid tons of separate incidental complexity.

I'll probably look into reviving brfs and incorporating some of the ideas from https://github.com/cosmicexplorer/upc.

cosmicexplorer · 2020-09-27T22:32:53Z

Hm. So after investigating a .mypy_cache/ directory from the pex repo a bit, I found something somewhat surprising: it appears that the only absolute paths it contains are for types from the stdlib (which I believe should not change at all). The rest actually seem to be...relative to the working directory? I'm not 100% sure how to fit this together yet, but here are the results of my quick investigation:

# The paths are all located at a top-level key 'path'.
> <.mypy_cache/3.5/uuid.meta.json jq '.' | grep '/Users'
(standard input):62:  "path": "/Users/dmcclanahan/tools/pex/.tox/typecheck/lib/python3.8/site-packages/mypy/typeshed/stdlib/2and3/uuid.pyi"
# The only files in the cache are '.json' files, with a lone '.gitignore'.
> find .mypy_cache -type f | sed -re 's#.*(\.[^\.]+)#\1#g' | sort -u
.gitignore
.json
# And it appears all of the paths are relative to the working directory, except ones from the stdlib!
> find .mypy_cache -type f | parallel "echo -n '{}:' && jq -r '.path' <{}" | head -n3
.mypy_cache/3.5/test_resolver.meta.json:tests/test_resolver.py
.mypy_cache/3.5/atexit.meta.json:/Users/dmcclanahan/tools/pex/.tox/typecheck/lib/python3.8/site-packages/mypy/typeshed/stdlib/3/atexit.pyi
.mypy_cache/3.5/pex/testing.data.json:pex/testing.py

It feels like we might be able to make use of a mutable cache here for the .mypy_cache directory, without having to rewrite any files. It would end up reaching out to somewhere on the disk to get the stdlib typeshed stubs, but I believe that can also be configured with a command-line option (so we could resolve mypy into a pex, extract the .deps/, then point it into there).

I think that materializing the typeshed stubs would definitely be a (smallish) slowdown -- but we could consider something like #8905 in that case, extending the symlinks we add to include specific large directories that we don't want to keep materializing. The API could be similar to the existing mutable cache API.

Let me know if something's missing from that.

cosmicexplorer · 2020-09-27T22:51:31Z

Ok, after some more investigation, there are some other unstable fields. For posterity, I'm going to post a gist containing the cache contents for the uuid module: https://gist.github.com/cosmicexplorer/258a9947589fb8d3500da2bd6ec5cea5.

At first glance, the only bad entries in uuid.data.json are just path, which is only absolute for typeshed types, which could be mitigated by materializing them into the execution directory as per #10864 (comment).

The bad entries in uuid.meta.json are more numerous, but the file is also much smaller. path is still there with the same caveats, but in addition we have data_mtime and mtime, which presumably contain the modification times for uuid.meta.json and uuid.data.json. I believe that everything else should remain constant if the mypy version and command-line args are kept constant. There's a platform key, but that should be fine since we are discussing a local cache.

The files aren't that large, so while we could attempt to rewrite them with a regex, it's likely to be safer and not much slower to rewrite them by parsing some json (this could be way wrong). I'm not quite sure yet where we would rewrite them -- if someone could give some pointers as to what we need to be making deterministic here, I would love a tip. This should be enough info to go ahead and implement this, though.

cosmicexplorer · 2020-09-27T23:00:47Z

Also, the mypy cache partitions itself by python version. I believe that should satisfy (from the OP):

how we partition MyPy runs.

I think that the fully recursive method described in #10864 (comment) should possibly be doable now? And I think I'm getting the idea -- if we can reset the data_mtime and mtime to 0 for all json files in the cache after all invocations, and materialize typeshed into the dir instead of letting it pick an absolute path, I think we should be able to possibly just digest the "cache directory" in each invocation for each target and merge the results? That seems 100% remotable too.

There is a concern there about having to scan all the json files in the cache after each mypy invocation. If we could just look for the module owned by the current target, that would involve only scanning two files (the *.data.json and the *.meta.json) after each invocation, resetting their modified timestamps, then digesting them. I think that's what we would do if we did a fully recursive invocation as Stu described above, but I would want someone to check my work on that. If we do really have to check all the files in the cache on each invocation, we should consider using a mutable cache, or we should consider using regex from rust instead of parsing and rewriting json from python.

If we strictly merge cache directories from leaf to root, we should be able to avoid redoing a ton of work. However, since the cache directory is (I believe) supposed to be consistent across the whole repo, it's possible we could be a little more intense, and at each depth of a breadth-first search of the dependency closure, merge the cache digests and use them for all mypy invocations at the next depth level of the BFS. I think the latter might be necessary, but would be a little bit (I don't think too much) more complex to implement.

cosmicexplorer · 2020-09-27T23:12:16Z

To be clear, I'm thinking of:

Put typeshed types into the input_digest for each mypy execution, and use the CLI option to make it look for types there.
- Optimization: make these into symlinks and only materialize them once as per make "canonical" symlinks to cache materializing popular files in local hermetic process executions #8905.
Run mypy recursively, once per python target (which I believe is now once per python source file, plus any stub).
After every invocation, read all the files (?) in the "cache dir" (which is just a normal process exec output dir), and rewrite their json contents to ensure data_mtime and mtime are set to 0 in all *.meta.json files.
In the mypy recursive invocation, merge digests for the mypy "cache dir" at each depth level of a BFS of all targets and their transitive dependencies (see details above).

To make it remotable, we need one more thing (I think):

Rewrite the platform key in all *.meta.json files to point to the correct platform.

stuhood · 2020-09-28T18:36:17Z

I think that the fully recursive method described in #10864 (comment) should possibly be doable now? And I think I'm getting the idea -- if we can reset the data_mtime and mtime to 0, and materialize typeshed into the dir instead of letting it pick an absolute path, I think we should be able to possibly just digest the "cache directory" in each invocation for each target and merge the results? That seems 100% remotable too.

Yep, that's what I was thinking. As mentioned in slack, the main issue with it is just that it's hard to know how fast or slow it would be (relative to daemonization approaches) without prototyping it.

jsirois · 2021-02-11T22:15:51Z

This is non-trivial because MyPy's cache is not append-only and mutates itself.

This is mostly false if we use the sqlite cache:
https://github.com/python/mypy/blob/538d36481526135c44b90383663eaa177cfc32e3/mypy/metastore.py#L140-L223

Actually ... ditto for the FS based cache which uses an atomic rename for inserts depending on how they use their own store API.

Currently entries are keyed on (path, mtime) so we'd need to get in a patch to make this hash-based. With that though, afaict, we could store a sqlite database in a Process.append_only_caches entry.

yoav-orca · 2021-12-06T07:19:17Z

We were looking at improving our mypy performance and we saw there's mypy bazel integration.
There's an undocumented flag --bazel that changes the way mypy works in a few ways:

normalizes the path to be relative
returns 0 mtime
changes the cache dir to be the cwd of mypy
Generally delegates the cache consistency checks to bazel
Changes the search path of modules

python/mypy#3912

Can pants exploit this to provide better performance in both CI and local builds?

stuhood · 2021-12-20T19:50:01Z

Sorry for the long delay responding!

We were looking at improving our mypy performance and we saw there's mypy bazel integration.
There's an undocumented flag --bazel that changes the way mypy works in a few ways:

Yes, this would likely be beneficial. I had read through that thread, but totally missed that that flag actually existed on main as undocumented.

Based on your summary (which is great, thanks), I think that the outcome is that you could treat mypy more like a compiler which consumes the outputs of "compiling"/checking its dependencies. The cache would become a relocatable artifact passed between invokes of mypy through the dependency graph.

If so, there are some great incrementality benefits there: although cold performance for an entire repository might be a little bit lower, check would actually be able hit caches, and would have much much better incremental performance by running only for transitively changed files (again, similar to compilation).

So yea: definitely worth looking into.

yoav-orca · 2021-12-22T05:58:58Z

@stuhood do we have any other compiler examples that I can look at and try to hack? do you feel it's a weekend project?

stuhood · 2021-12-22T20:35:53Z

This is likely to be a little bit of an undertaking.

Python allows import cycles and would be checked at the file level, so it's probably most like the JVM. If you have familiarity with how JVM compilation works, then I could try to work up an explanation for this one. To deal with cycles, it would involve launching mypy recursively on CoarsenedTarget nodes: each "coarsened" (i.e. cycle broken: see the link) target would then recursively depend on the mypy output for each of its dependencies.

All that is relatively straightforward, and very similar to what we do for the JVM. But virtualenvs are built up differently from how a classpath can be composed (classpaths are "just" lists of files, whereas Python artifacts have an install step). We've experimented with recursive composition of Python artifacts, but it's likely that for the purposes of mypy we'd need to provide all of the thirdparty dependencies as a single virtualenv instead (rather than building a precise virtualenv per recursive mypy invoke), similar to what @benjyw did with #13732.

So: all together then (but in relatively low detail, because this is probably a 1-2k line change):

Compute CoarsenedTarget roots for the input addresses (example).
For each root, determine which thirdparty resolve to use (cc @Eric-Arellano), and build a "repository"-"all-requirements" PEX for it (á la An option to test/run/repl against the entire lockfile. #13732).
Create a dataclass "request" type that contains both the CT and the resolve, and kick off recursive mypy checking with the request.
The recursive @rule to check mypy would look a lot like a JVM compile @rule: the best example of that is probably scalac, which:
a. Requests the outputs of mypy for its dependencies,
b. Mashes together the sources of all of the Targets inside the CoarsenedTargets (basically: all of the files within an import cycle),
c. Sets up the thirdparty inputs and dependencies for the tool invoke,
d. Invokes the tool, and captures its outputs.

The existing mypy rules would also be really helpful to look at.

stuhood · 2022-01-05T22:00:12Z

Also, if we were going to be invoking mypy on much smaller sets of files, #14070 is likely a blocker (in order to reduce per-invoke overheads). There is also some discussion on pex-tool/pex#1423 of whether the overheads would even be low enough to support invoking at this granularity.

stuhood · 2022-01-20T18:02:24Z

Also, if we were going to be invoking mypy on much smaller sets of files, #14070 is likely a blocker (in order to reduce per-invoke overheads). There is also some discussion on pantsbuild/pex#1423 of whether the overheads would even be low enough to support invoking at this granularity.

OR, it's possible that CoarsenedTarget could be expanded to batch files (á la #14186) while coarsening. That would have the effect of invoking mypy on a "chunkier" graph of files, which could still get cache hits for chunks, without paying the sandbox setup cost for individual files. This would also benefit the JVM.

thejcannon · 2022-07-22T17:08:42Z

We were looking at improving our mypy performance and we saw there's mypy bazel integration. There's an undocumented flag --bazel that changes the way mypy works in a few ways:
1. normalizes the path to be relative
2. returns 0 mtime
3. changes the cache dir to be the cwd of mypy
4. Generally delegates the cache consistency checks to bazel
5. Changes the search path of modules

python/mypy#3912
Can pants exploit this to provide better performance in both CI and local builds?

I looked into this today. mypy certainly doesn't make this easy.

2 is the hot ticket here really.

3 is actually really annoying because we've littered CWD (and additional subdirs) with cache files. We can work around this by playing tricks with CWD, however we run into a second problem: this cache would ideally be shared (a la append_only_caches) but then we need to atomically write the files and figure out the "how to add and not edit" issue which is almost prohibibitively hard. (The atomicity is hard too because mypy uses 2 files for metadata, so there's almost guaranteed to be a race condition).

I think if we reeeeeeeeeally wanted to we could:

Initialize the sandbox with a copy of the cache (using append_only_caches)
Run mypy using the copy cache as CWD. Also for every source file, pass --cache-mapping and use a file whose name is unique based on the file's hash (this way we only append)
After mypy runs, atomically copy files from CWD to the real cache (and here "atomically" doesn't just mean a single file, but the cache data file-pairs. Or copy them "in the right order" e.g. the data then the meta)

Eric-Arellano · 2022-07-22T17:59:42Z

Something like that might be worth it...disregarding Pants and running MyPy directly, its performance is pretty dramatic between cold vs warm runs. We are wasting a lot of developer productivity time by not leveraging MyPy's cache.

thejcannon · 2022-07-22T18:31:25Z

Actually --skip-cache-mtime-check "works". So the only challenge here might be a repo-specific subdir in the named cache.

Leverages mypy's cache to allow for "incremental" runs. The cache is an `append_only_cache` which is a bit disingenuous here, but AFAICT mypy is process-safe w.r.t. the cache (I'll do more digging to be very sure). Fixes #10864 [ci skip-rust] [ci skip-build-wheels]

benjyw · 2022-07-24T05:14:40Z

@thejcannon , this is awesome! For posterity, can you explain why it is OK after all to use an append_only_cache ?

thejcannon · 2022-07-24T12:23:54Z

I'll comment on PR

…16276) Leverages mypy's cache to allow for "incremental" runs. The cache is an `append_only_cache` which is a bit disingenuous here, but AFAICT mypy is process-safe w.r.t. the cache (I'll do more digging to be very sure). Fixes pantsbuild#10864 [ci skip-rust] [ci skip-build-wheels]

wimax-grapl · 2022-08-19T14:26:48Z

(Also linking #16290 to this ticket)

Eric-Arellano added the performance label Sep 25, 2020

cosmicexplorer mentioned this issue Sep 27, 2020

proposal for remotable mutable caches like the 1.x jvm backend #10870

Closed

4 tasks

cosmicexplorer mentioned this issue Sep 27, 2020

Add an intrinsic for capturing Snapshots from absolute paths #10842

Open

Eric-Arellano mentioned this issue Sep 29, 2020

Implement a polished v2 MyPy experience #10131

Closed

7 tasks

stuhood changed the title ~~Improve MyPy performance by leveraging its cache~~ Improve MyPy performance Nov 5, 2020

stuhood added the Q42020-idea label Nov 12, 2020

wimax-grapl mentioned this issue Jun 7, 2021

Mypy sqlite cache: optional way to cache by path+contents-hash, not path+mtime python/mypy#10587

Open

benjyw removed performance labels Sep 9, 2021

stuhood mentioned this issue Dec 22, 2021

PEX_PATH transitivity pex-tool/pex#1423

Closed

stuhood added the estimate: >3W label Jan 5, 2022

stuhood added the 2022-idea label Jan 19, 2022

stuhood mentioned this issue Jan 20, 2022

Batching of lint and fmt invokes #14186

Merged

stuhood mentioned this issue Feb 22, 2022

Calculate local distribution contents once per distribution #14551

Merged

stuhood mentioned this issue Feb 22, 2022

Consider recursively merging sources/requirements for Python #14561

Open

stuhood mentioned this issue Mar 8, 2022

Allow disabling MyPy and Pylint partioning by multiple resolves #14725

Open

thejcannon mentioned this issue Jul 22, 2022

Improve mypy runs by adding the "incremental" mypy cache #16276

Merged

thejcannon closed this as completed in #16276 Jul 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve MyPy performance #10864

Improve MyPy performance #10864

Eric-Arellano commented Sep 25, 2020

stuhood commented Sep 25, 2020

cosmicexplorer commented Sep 27, 2020 •

edited

Loading

stuhood commented Sep 27, 2020 •

edited

Loading

cosmicexplorer commented Sep 27, 2020

cosmicexplorer commented Sep 27, 2020 •

edited

Loading

cosmicexplorer commented Sep 27, 2020 •

edited

Loading

cosmicexplorer commented Sep 27, 2020 •

edited

Loading

cosmicexplorer commented Sep 27, 2020 •

edited

Loading

stuhood commented Sep 28, 2020

jsirois commented Feb 11, 2021 •

edited

Loading

yoav-orca commented Dec 6, 2021

stuhood commented Dec 20, 2021 •

edited

Loading

yoav-orca commented Dec 22, 2021

stuhood commented Dec 22, 2021 •

edited

Loading

stuhood commented Jan 5, 2022 •

edited

Loading

stuhood commented Jan 20, 2022

thejcannon commented Jul 22, 2022

Eric-Arellano commented Jul 22, 2022

thejcannon commented Jul 22, 2022

benjyw commented Jul 24, 2022

thejcannon commented Jul 24, 2022

wimax-grapl commented Aug 19, 2022

Improve MyPy performance #10864

Improve MyPy performance #10864

Comments

Eric-Arellano commented Sep 25, 2020

stuhood commented Sep 25, 2020

cosmicexplorer commented Sep 27, 2020 • edited Loading

stuhood commented Sep 27, 2020 • edited Loading

cosmicexplorer commented Sep 27, 2020

cosmicexplorer commented Sep 27, 2020 • edited Loading

cosmicexplorer commented Sep 27, 2020 • edited Loading

cosmicexplorer commented Sep 27, 2020 • edited Loading

cosmicexplorer commented Sep 27, 2020 • edited Loading

stuhood commented Sep 28, 2020

jsirois commented Feb 11, 2021 • edited Loading

yoav-orca commented Dec 6, 2021

stuhood commented Dec 20, 2021 • edited Loading

yoav-orca commented Dec 22, 2021

stuhood commented Dec 22, 2021 • edited Loading

stuhood commented Jan 5, 2022 • edited Loading

stuhood commented Jan 20, 2022

thejcannon commented Jul 22, 2022

Eric-Arellano commented Jul 22, 2022

thejcannon commented Jul 22, 2022

benjyw commented Jul 24, 2022

thejcannon commented Jul 24, 2022

wimax-grapl commented Aug 19, 2022

cosmicexplorer commented Sep 27, 2020 •

edited

Loading

stuhood commented Sep 27, 2020 •

edited

Loading

cosmicexplorer commented Sep 27, 2020 •

edited

Loading

cosmicexplorer commented Sep 27, 2020 •

edited

Loading

cosmicexplorer commented Sep 27, 2020 •

edited

Loading

cosmicexplorer commented Sep 27, 2020 •

edited

Loading

jsirois commented Feb 11, 2021 •

edited

Loading

stuhood commented Dec 20, 2021 •

edited

Loading

stuhood commented Dec 22, 2021 •

edited

Loading

stuhood commented Jan 5, 2022 •

edited

Loading