-
-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature: Reduce Mypy's Cache Size #15731
Comments
We could also pickle it while at it, or use some other serializer more compact than JSON. We could also compress it. All sound better than introducing file size optimization tricks into individual (de)serializers. |
Good idea, I will look into this more. Some thoughts/comments:
|
I finally got around to running the numbers, and we can reduce the cache size by about ~5.5x by gzipping the JSON files:
The times are an average of 3 runs. This was done without compiling with Mypyc. These compression results looks promising, especially considering that they don't really affect the runtime performance, and significantly reduce the cache size. This was a simple 4 line change to compress/decompress the JSON files using gzip instead of writing directly. If we choose to go forwards with this we would need to support detection of older, non-compressed JSON files. I have yet to experiment with storing all the cache data in one JSON file instead of lots of small ones. In theory we could read and decompress the entire cache once upon startup, which might be faster then reading each file individually. Also, here's a note I found regarding pickling the data that I found during my investigation: #932 (comment) . Essentially pickling is far more volatile compared to JSON. I could look into other binary serializers, but I feel that the easiest and most impactful option is to compress the JSON. |
Thanks for running these experiments, I would be supportive of using gzip level 1. I don't think we'd need to support detection of older non compressed files, we already don't read caches across mypy versions. |
Reducing cache size would be useful in general. However, there are few things that may cause issues for some users. Switching to a compressed cache would break some use cases we have at work where we manipulate mypy cache files. Our mypy runner script supports downloading cache snapshots from a remote server, and these snapshots are compressed using LZMA, which compresses better than gzip -- but LZMA is much slower when compressing, so it only makes sense if cache files are downloaded over network. Compressing previously compressed files isn't effective, so we'd probably have to implement an extra step to decompress cache files and compress them again using LZMA, and similarly after downloading. The second use case involves a tool that reads and parses mypy cache files. We'd need to add a decompression step to the tool, which should be simple.
We primarily care about performance when compiled with mypyc, so it would be important to have performance measurements with a compiled mypy. Also, performance measurements are hard to do accurately, due to many potential sources of noise. Generally an average of at least 10 runs are needed (and reporting % standard deviation would be nice), and runs with different variants should be interleaved (e.g. run "variant 1, variant 2, variant 3, variant 1, variant 2, ..." instead of "variant 1, variant 1, variant 1, variant 2, variant 2, ..."). The number of other running processes should be reduced to a minimum (e.g. close browser windows and background services that use CPU). Finally, some laptops have aggressive CPU throttling, so using a desktop computer is usually better. |
Thanks, this is really helpful info to keep in mind. I found a Regardless, here's the data that it's currently giving me for 10 trials, compiled with O3 and no debug info: Baseline (010da0b)
This PR (936ba7d)
I assume that this script was made when execute([
"python3", "-m", "mypy"
"--config-file", "mypy_self_check.ini",
"-p", "mypy",
"-p", "mypyc"
]) |
Reducing Mypy's Cache Size
This is a meta-issue talking about different ways to reduce Mypy's cache size. I've been working on a branch in my free time, though it is probably too big for a PR, so I thought it would be best to get everyone's opinion on which optimizations (if any) would be worth-while.
In short, I've reduced Mypy's filesystem cache by about 27% using a few different techniques.
Here is a breakdown of each of the commits, what I did to reduce the cache size, and by how much. We probably don't need to include all of these techniques since a lot of them only marginally reduce the cache size. The cache I've been using as a comparison is Mypy's own cache when checking itself.
The Numbers
master
763a94d5e
623266f47
6cefbfb27
builtins.
prefix for common types27e9e0d56
builtins.object
in MRO because everything derives from itd2d0aa005
cce01a60f
88b5b6a3d
arg_names
andarg_kinds
for func defs with type info4275d51b5
0a4dfeab2
.class
key with empty stringabf66951c
NoneType
node as"None"
string literal22b91d94d
def_extras
usage98c4ce817
Instance
args if they're emptyThe best techniques based on total savings are:
.class
key with empty string (4.4%)builtins.
prefix for common types (3.0%)arg_names
andarg_kinds
for func defs with type info (3.0%)After that point everything starts to drop off, though they still might be worth including.
Backwards Compatibility
I made sure to check that my changes would not break backwards compatibility. These new techniques allow for loading of both old and new caches, but of course the new cache format will be used when cached files needs to be rebuild.
Why?
I was poking around the cache folder to see why the cache was the size it was, and I noticed that there was a lot of optimizations that could be made. In theory, smaller caches are quicker to save and load, while taking up less space on the users computer. In CI systems where storage space is metered, having a smaller cache size will speed up CI workflows and use up less cloud storage.
Let me know if this is something you guys would be interested in! If so I'll start splitting this into separate PR's. Thanks!
The text was updated successfully, but these errors were encountered: