Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bazel build integration #3912

Open
sixolet opened this issue Sep 3, 2017 · 38 comments
Open

Bazel build integration #3912

sixolet opened this issue Sep 3, 2017 · 38 comments

Comments

@sixolet
Copy link
Collaborator

sixolet commented Sep 3, 2017

I've been playing with getting Mypy to integrate with bazel for typechecking. I've got a basic version of it working. I intend to describe it here for discussion of the general approach, and then post a series of PRs for individual things it needs as I clean them up, if the general approach seems right.

The general approach: Make a mypy_library and mypy_binary macro that puts a typechecking aspect on the python files it contains. The aspect calculates the mypy cache files; give mypy a mode where it stops considering itself responsible for calculating freshness of dependencies, and instead let bazel handle that.

Specific PRs I expect to make:

  • A --cache-hermetically option to stop writing mtimes
  • A --cache-mapping option to specify a json file containing .data.json and .meta.json file locations by module id, both for cache files to read and new cache files to write (bazel likes being super specific about things like this; it seems cleaner to specify to mypy than to rely on mypy's cache directory layout). This option would also short-circuit the process of looking for source and checking it for freshness for the covered modules; just believe the cache files bazel is handing you.
  • Possibly a --module-without-init option (or some name, I don't know what it should be called) to allow mypy to believe things are modules without an __init__.py file. Bazel for some reason doesn't care whether you have them in your python source when you build. I can possibly hack around this by touching that file myself in the build directory in the bazel rule, instead.
  • A --ignore-module-prefix option to understand that certain parts of the module path are already accounted for, and all modules within the prefix to ignore should be treated as if they have empty __init__.py.
  • An actual Skylark implementation of the python typechecking macros, and a BUILD file for mypy.

That would be enough to get it working (and I have a prototype of that, which I need to split into separate PRs and clean up a lot if it seems an ok approach). To make it awesomer:

  • Speed up mypy's startup time. There might be low-hanging fruit?
  • Precompute the cache files for builtins and typeshed, so we never need to go looking for source there.
  • Implement some dependency pruning, so you don't even need to load all the cache files all the time.

I'm submitting this description before I'm ready to submit PRs deliberately, so I am not super attached to any one approach. Suggestions?

@ethanhs
Copy link
Collaborator

ethanhs commented Sep 3, 2017

--module-without-init

I don't think this is a good idea. The issue sounds more to be with Bazel than Mypy, and I think it makes more sense to modify the build script until Bazel fixes this.

Precompute the cache files for builtins and typeshed, so we never need to go looking for source there.

I really, really like this idea. The typeshed information should be immutable therefore this makes a lot of sense. If someone is using a custom typeshed directory, it might get a bit more complicated, but in the general case and for releases, this will be really nice. Mypy currently spends 1.5 seconds on an empty file, so I do hope there are ways to improve startup time.

@JukkaL
Copy link
Collaborator

JukkaL commented Sep 3, 2017

Some thoughts below. Note that this our internal team priority is to speed up mypy though a daemon mode instead of a build tool like bazel. Our review bandwidth may be lower than normal until the end of year, so there's a risk that bigger PRs in particular will be slow to review. It's better if we can first reach agreement on the design of new features at an abstract level before looking at PRs, as design reviews are much less work (and can save also implementation work in case the design gets iterated on a lot).

Speed up mypy's startup time. There might be low-hanging fruit?

I have a few ideas that may be helpful:

  • Make sure you don't compile the .pyc/.pyo files during each mypy run.
  • Remove type annotations in the released mypy version. Initializing annotations takes some time. After the recent optimizations the potential savings might be marginal, though.
  • Profile the time needed to evaluate the top level of each module in mypy, and see if there are things that are slower than expected. Named tuples are known to slow down program startup, for example.
  • See if a lot of code gets imported that isn't actually needed for most runs, and see if they can be lazily imported. Running code coverage might be an easy way to do this.

The non-bazel answer is to run mypy as a daemon and to use a small client that starts up very quickly. However, this probably is not a good match for bazel (though I've heard that it's possible). A related idea is to somehow keep a fully initialized mypy process running and just fork it whenever a fresh run is needed. This would provide isolation between mypy runs.

Precompute the cache files for builtins and typeshed, so we never need to go looking for source there.

This can't happen in mypy itself, since the cache files depend on mypy options such as platform and target Python version. However, it may make sense to have a build artifact in your bazel setup for typeshed with bundled cache files, and to use it instead of the standard typeshed though --custom-typeshed-dir.

A --cache-hermetically option to stop writing mtimes

Before adding a large number of command-line options, I'd like to see the output of mypy --help cleaned up so that it only shows the most commonly used options. We could point to the docs for information about other options, or perhaps we could have a separate help option for displaying verbose help with all options.

Implement some dependency pruning, so you don't even need to load all the cache files all the time.

Something like this could be pretty useful. Can you create a separate issue for this? This would be useful outside bazel as well.

@gvanrossum
Copy link
Member

Sorry for the slow response!

Regarding --module-without-init: this is separately being proposed as --namespace-packages in a different issue: #2773. I'm thinking you'd like it to apply to Python 2 too? If so, please add a note there (the other org asking for this, Instagram, are on Python 3).

I'm not so worried about the proliferating number of command line flags -- at least, I don't think you need to pay for it before you can add new flags.

I think a --cache-hermetically flag would be pretty simple and self-contained, so go ahead and send us a PR for that. Do understand that it would only suppress the "mtime", not the option values contained in the meta files. (Actually it might also suppress the "path" field? Because the hash is the only things that matters.)

I'm not sure what the benefits of your propose --cache-mapping would be. IIUC this would be a dict giving e.g.

{"mypy.build": [".mypy_cache/3.6/mypy/build.meta.json", ".mypy_cache/3.6/mypy/build.data.json"],
 ...}

The contents of such a file (IIUC) should be entirely predictable given the values of --cache-dir and --python-version. But perhaps you care more about the "just believe the cache" part of the flag? Or perhaps that could be folded into --cache-hermetically? Hm, now I'm beginning to doubt my understanding of your proposed workflow.

I'm not sure what --ignore-module-prefix is supposed to do. What exactly do you mean by "module path"? The search path used to find module files, or the pathname of the module file?

@JukkaL
Copy link
Collaborator

JukkaL commented Sep 20, 2017

I'm not so worried about the proliferating number of command line flags -- at least, I don't think you need to pay for it before you can add new flags.

Yeah, this should be up to the core team, since we are the ones who've added (or accepted) all the existing command-line options :-)

I have some vague ideas about improving mypy's startup time and cache deserialization speed, but I'm not yet sure if these are practical and when these will happen. Lazily loading only necessary dependent cache files of indirectly imported modules might be some relatively low-hanging fruit?

@JukkaL
Copy link
Collaborator

JukkaL commented Sep 20, 2017

I said earlier:

Our review bandwidth may be lower than normal until the end of year

This may not actually need to be the case. I'm cautiously optimistic.

@dgoldstein0
Copy link

any plan to actually do any of this work?

btw we've also seen the bazel-python problem around init.py and discussed it a bunch internally. My quick skim of the --namespace-packages proposal makes it sound like it'll solve our problem if we get py2 support for it.

Anyhow I hadn't looked into the internals of mypy at all and this proposal seems to cover that pretty well. From a bazel POV though, what we need is a way to invoke mypy in two (new?) modes:

  • give it a set of python and types files (perhaps .pyi) and generate a new type file output - so we can implement a build step that summarizes the types of a library, for it's dependencies to use
  • give it a set of python and types files of dependencies and typecheck the python - so we can typecheck a library, based on it's dependency libraries types

in addition to all the above stuff about making mypy hermetic and care less about existence of init files.

@dgoldstein0
Copy link

and to clarify - if mypy has those two modes, I'd be willing to take a stab at writing the skylark to invoke mypy within bazel

@gvanrossum
Copy link
Member

gvanrossum commented Nov 2, 2017 via email

@dgoldstein0
Copy link

dgoldstein0 commented Nov 2, 2017 via email

@JukkaL
Copy link
Collaborator

JukkaL commented Nov 2, 2017

give it a set of python and types files (perhaps .pyi) and generate a new type file output - so we can implement a build step that summarizes the types of a library, for it's dependencies to use

Using the existing JSON serialized cache files for this would be both faster and less work to implement, at the cost of not having an easily human-readable representation.

The mypy startup performance was again brought up by @sixolet last week (offline). Here pruning dependencies (either by not loading some dependencies at all or by loading them lazily) would likely be the low-hanging fruit. Other possible optimizations are harder:

  • Speed up importing the mypy implementation and dependencies. Pruning internal mypy dependencies might help a bit, and Python 3.7 might also improve this a little. Larger improvements will likely require somewhat drastic measures -- but these are still within the realm of possibility.
  • Make deserialization of JSON cache files faster. Again, this is non-trivial but conceptually it should be possible to make this much faster.

@sixolet
Copy link
Collaborator Author

sixolet commented Nov 2, 2017 via email

@gvanrossum
Copy link
Member

I am trying to wrap my head around this. I'm still reading up on bazel but I think this is going to be fun!

@kamahen
Copy link
Contributor

kamahen commented Feb 28, 2018

A few thoughts from someone who has done something similar ...

The mypy cache could be specified as one of the outputs in the mypy aspect's action (and input(s) to other rules). Then Bazel would take care of it for you and re-run only what's necessary.

Bazel allows only one rule for creating an output. But your BUILD files probably have some rules that specify the same source files as other rules. There are a few solutions to this, but probably the easiest is to add the rule name to the output file (to make it unique) and have the Skylark create a mapping file that is input to mypy to map between the Bazel label and the actual file. You can see an example of this in pytype/imports_map_loader.py.
Another way would be similar to how Java rules work with .class files.

@gvanrossum - trust me, this quickly becomes not-fun. However, the documentation is much better now than when I did this, and Skylark has far fewer bugs now (but be warned -- it is not Python).

@gvanrossum
Copy link
Member

gvanrossum commented Mar 17, 2018

Here's a brief progress report. Things are not working yet.

I have made a mypy branch that supports a --bazel flag that does various things for Bazel. For now I've simply vendored this branch in the Dropbox repo where I'm experimenting with Bazel. I'm still hacking on the code because I haven't solved all problems yet.

I've also created a dropbox-specific Bazel aspect that runs mypy. I don't want to publish this yet since it's gross and doesn't work yet, but I've learned a lot.

Perhaps the key learning (besides confirming Peter's observation :-) is that I can't seem to create the mypy cache files alongside the source files, because Bazel doesn't just require that outputs live under the current package. There's a cross-compilation feature that puts all outputs in a different directory, and this takes the form <root>/<prefix>/<package>/<target>/<file> where <root> is some horrendously long Bazel directory (fortunately it's also the current directory :-), <prefix> is something indicative of the target platform (in our case it's bazel-out/k8-fastbuild/bin -- possibly this is a Dropbox-specific thing), <package> is the package name (the directory containing the BUILD file relative to the repo/workspace root), <target> is the name attribute of the target in the BUILD file, and <file> is whatever you want. Source files on the other hand live in <root>/<package>/<file>. I presume this architecture maximizes parallelism.

So suppose we have two deps <target1> and <target2> whose mypy cache output files are inputs to <target3>, the cache files are distributed across three directories, and there's no --cache-dir flag we can pass to mypy that changes this. I guess Naomi's original idea of having a cache file mapping is probably the simplest way to handle this. I haven't started writing code for this solution yet (the branch above was a hopeful attempt that just helped me debug the situation).

@dgoldstein0
Copy link

dgoldstein0 commented Mar 17, 2018 via email

@kamahen
Copy link
Contributor

kamahen commented Mar 18, 2018

Bazel has no concept of "folder".
You should always use File.path to translate between the Bazel label and the actual file that your program consumes.

@gvanrossum
Copy link
Member

Well, I solved the cache problem by implementing a --cache-map feature. It seemed simplest to just pass the mapping on the command line rather than writing it to a file and reading it from a file -- Bazel has a "spill file" feature that can do that as needed (and mypy can read it).

Doing this I managed to actually check some targets and their transitive dependencies. (I even found some questionable code.)

I made less headway with the __init__.py file problem -- hacking mypy so that (with the --bazel flag) it doesn't insist on an __init__.py file in every package is not so easy, since this is relied on in multiple places. To make progress in the meantime, I wrote a terrible hack where a wrapper script just creates __init__.py files as needed before it runs mypy.

@gvanrossum
Copy link
Member

@sixolet please have a look at my prototype for the mypy changes at #4759.

I've now solved the __init__.py problem in Skylark by using a macro that calls native.glob(['**/__init__.py']), plus some clever use of depset() in the aspect implementation. (I would show my code but it's too messy.)

@gvanrossum
Copy link
Member

A similar solution for .pyi files also seems to work, even if it's not clear that .pyi files should be implicitly substituted in the srcs attribute of relevant rules, which is what I currently do -- similar as for __init__.py where it appears to be the Bazel convention.

@kamahen
Copy link
Contributor

kamahen commented Mar 21, 2018

@gvanrossum If you want a review of the Bazel code, I can take a look. (Messiness seems to be a hallmark of Skylark code, so don't feel ashamed of it ... even copious documentation and design notes fail to help in my experience (I had my one of my integration designs reviewed by the Bazel and Skylark team and yet everyone missed a horrible flaw in it.)

What do you mean by "similar as for __init__.py where it appears to be a Bazel convention"? Are you referring to Bazel auto-creating a __init__.py file if it doesn't exist in a source directory? If so, that apparently was a very old thing at Google, and far from universally liked. I could probably point you to someone who knows more about the history and pluses/minuses of that "feature".

@gvanrossum
Copy link
Member

Nah, I'm good, I have internal reviewers.

@gvanrossum
Copy link
Member

Alas, it seems my pure Skylark solution for implicit __init__.py files does not work. Consider a package p with a subpackage q (i.e. the latter is p/q in the filesystem, or p.q in the module namespace). Inside p/q Bazel's glob() doesn't find p/__init__.py even if it exists. Hence with my hack the file p/q/foo.py is seen by mypy as representing q.foo rather than p.q.foo and then everything else goes wrong. Thus I think I will have to make the --bazel flag (or another flag) treat p/q/foo.py as representing p.q.foo even if p/__init__.py does not exist (and even if q/__init__.py does exist). I had previously tried to make this work with a quick hack and failed. I will now have to try again. (And risk a merge conflict when PEP 561 lands.)

@kamahen
Copy link
Contributor

kamahen commented Mar 26, 2018

I don't see why the glob() doesn't find p/__init__.py. If you use the pattern **/__init__.py, won't you get a list of all the __init__.py files? (which you can then filter as needed)

@gvanrossum
Copy link
Member

That was exactly the pattern I was using. But glob uses the current package (i.e. p/q) as its starting point. I think it's specified in item 7 here: https://docs.bazel.build/versions/master/be/functions.html#glob

@kamahen
Copy link
Contributor

kamahen commented Mar 26, 2018

I think that ../**/__init__.py isn't allowed in a glob.

I recall having a lot of problems with the details of Python's import model, even when assuming that missing __init__.pys could be hallucinated into existence. (And assuming that there aren't any special import hooks.)
Good luck. ;)

@kamahen
Copy link
Contributor

kamahen commented Mar 26, 2018

You might try shifting the test to the action ... you can get a list of all the files that Bazel thinks are needed, and then perhaps apply some heuristics (because some of the files might be from the source tree and others could be generated in one of the bin or genfiles directories).
You might be able to find the output directories from predefined make-variables

gvanrossum added a commit that referenced this issue Jun 5, 2018
…ap) in support of Bazel (#4759)

Addresses the mypy part for #3912. (The Skylark code I wrote for that is not yet open source.)

The `--bazel` flag make cache files hermetic by using relative paths and setting mtime to zero; effectively the presence of a cache file prevents reading of the source file (though the source file must still exist).

The `--cache-map` flag specifies a mapping from source files to cache files that overrides the usual way of finding the locations for cache files; e.g.
```
mypy --cache-map foo.py foo.meta.json foo.data.json bar.py bar.meta.json bar.data.json -- foo.py bar.py
```

The `--package-root` flag specifies a directory below which all subdirectories are considered packages (this accounts for Bazel's habit to imply empty `__init__.py` files everywhere).
@gvanrossum
Copy link
Member

Whew! #4759 has finally landed. I am hoping to open-source the Skylark code I wrote too, though I cannot support it (it's got some Dropboxisms).

@matthen
Copy link

matthen commented Oct 24, 2018

Any updates on this project? Do you have any example working build rules?

@gvanrossum
Copy link
Member

There are actually plans to release a bunch of Bazel work done at Dropbox (e.g. Go and Python rules and a tool for auto-generating dependencies), and this would be part of that. We don't have a specific timeline for the release, but we're aiming to do it before the end of 2018. Hope that helps! Rest assured that it is in production use at Dropbox.

@thundergolfer
Copy link

but we're aiming to do it before the end of 2018.

I'm looking to get Python + Bazel + MyPy going in my team, so jumping back in to check if there's been updates and generally what the state of this integration is.

@gvanrossum
Copy link
Member

Sorry, we haven't been able to prioritize the public release of the Skylark files that define our Python targets, so alas, there are no public updates. Internally we still use the --bazel flag and related things, so it's not dead, but it's not easy to separate it from internals.

@thundergolfer
Copy link

Understandable. Thanks for the update.

@thundergolfer
Copy link

thundergolfer commented Dec 1, 2019

Jumping back in here to give a heads up that I've started an open-source implementation of this here → https://github.com/thundergolfer/bazel-mypy-integration

It's still very much in the pre-Alpha stage, but I've got the very basics happening. I have a lot to do in terms of fixing bugs, smoothing sharp edges, and giving it a proper exercising in a real-world Bazel monorepo, but it's a start!

Feedback and advice appreciated 🙏

@JukkaL
Copy link
Collaborator

JukkaL commented Jan 29, 2020

Because of shifts in priorities, it seems unlikely that we'll be able to release our Bazel integration any time soon. Keeping this open as a low-priority issue since there is a lot of potentially useful discussion in case somebody wants to work on this.

@kamahen
Copy link
Contributor

kamahen commented Jan 29, 2020

FWIW, I've come up with a way of running source analysis in parallel and caching the results, that gets a good speed-up and doesn't require a tool like Bazel. If there's any interest in this, I can try to write it up -- the code is straightforward but not well documented (I have some documentation if anyone wants to read it; but it needs editing by a good tech writer).

@JukkaL
Copy link
Collaborator

JukkaL commented Jan 29, 2020

If the main motivation is performance, remote caching together with mypy-daemon will often give much faster mypy runs than Bazel integration (https://mypy.readthedocs.io/en/stable/additional_features.html#using-a-remote-cache-to-speed-up-mypy-runs).

@JukkaL
Copy link
Collaborator

JukkaL commented Jan 29, 2020

(The level of performance obviously depends on various factors, and in some cases Bazel can be a performance win.)

@thundergolfer
Copy link

Keeping this open as a low-priority issue since there is a lot of potentially useful discussion in case somebody wants to work on this.

I have been working on it, and we've been using it at work for a few weeks now with no issues. https://github.com/thundergolfer/bazel-mypy-integration

If you see any problems with my implementation please do let me know 🙂.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants