Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ideas for improving typeshed testing and tooling #8774

Open
AlexWaygood opened this issue Sep 20, 2022 · 13 comments
Open

Ideas for improving typeshed testing and tooling #8774

AlexWaygood opened this issue Sep 20, 2022 · 13 comments
Labels
help wanted An actionable problem of low to medium complexity where a PR would be very welcome project: infrastructure typeshed build, test, documentation, or distribution related

Comments

@AlexWaygood
Copy link
Member

AlexWaygood commented Sep 20, 2022

A few days ago, I had a great conversation with @kkirsche offline about some ways we can improve our communications with contributors. One of the things that came up is that it would be good to have a clear outline of specific areas to do with tooling and testing where PRs would be welcomed. With that in mind, here's some areas where I think our testing and tooling could be improved, and where I would welcome contributions.

For any of these tools/ideas: if a PR would be enough work that it'd feel wasteful if it was rejected, please make sure to open an issue discussing the specific change first. If an issue gets a few thumbs-ups, it's a pretty good sign we like the idea, and you should feel free to proceed :)

(Other maintainers: please feel free to edit this!)

Existing tools

Improving existing tools is the "low-hanging fruit": it's always going to be easier to build on existing infrastructure than building something new entirely from scratch.

For each of these tools, I've tried to list a "primary maintainer": this is the person who's most likely to make decisions on a certain tool. That doesn't mean that this person is the only person who's ever going to chime in on a certain tool. For example, @hauntsaninja is the person with the most expertise on stubtest, and people are probably always going to defer to him with regards to that script. But that doesn't mean that other people won't also be interested in reviewing -- @sobolevn and I also have experience with writing stubtest improvements, and may well chime in on a stubtest PR.

Stubtest

Primary maintainer: @hauntsaninja

Stubtest compares what exists at runtime to what we've got in the stub, and checks for inconsistencies between the two. It's one of the most useful tools in our CI, and also really easy to contribute to. If you make a PR for a stubtest improvement, it'll probably be reviewed by some combination of @hauntsaninja, @JelleZijlstra, @sobolevn and/or me within a few days :)

Stubtest improvements generally fall into two categories:

  • New checks for catching where the stubs are inconsistent with the runtime.
  • PRs to reduce false-positive errors. Reducing the number of false-positives is very useful as it makes it easier for us to set --ignore-missing-stubs to False for more third-party packages.

What to improve:

  • There's a bunch of TODO comments in the stubtest codebase. Any of those would be good candidates for PRs.
  • With regards to adding new checks: think creatively! Stubtest doesn't have a generalised way of comparing the runtime to the stub; it's just a series of "special cases" where we've "hardcoded" in checks for inconsistencies between the stub and the runtime. If you notice something stubtest is missing in CI, that's probably a good candidate for a stubtest PR. If you're unsure about whether the idea is good or not, file a mypy issue first -- if it gets a few thumbs-up, it's probably a pretty good sign that it's a good idea.

Stubgen

Primary maintainer: ?

Stubgen is a tool for autogenerating stub files from source code. It's not a code base I'm particularly familiar with, and I'm not sure it's quite so actively maintained as some of the other tools on this list -- but here's a few ways I think it could be improved:

  • Make generated stubs comply with flake-pyi "out of the box". Our "typeshed style" has evolved in recent months and years, but stubgen has been a bit slow to catch up. It would be great if the stubs stubgen produced were automatically compatible with flake-pyi, so that we didn't have to do so many manual fixes.
  • Fewer re-exports (maybe we can already do this with the --export-less option? Maybe we just need to update scripts/create_baseline_stubs.py, typeshed's wrapper script for running stubgen?) By default, stubgen almost always assumes that an import from somewhere else in the package should be treated as an explicit re-export. Unless the file is an __init__.py file, however, I think stubgen should actually generally assume the opposite. This comes up a lot in typeshed code reviews; fixing this would save a lot of time. (Fixed by Use stubgen's --export-less option in create_baseline_stubs.py #8843.)
  • Add "stubgen-but-with-inference" (@hauntsaninja's idea in Evaluate ways to determine accuracy and reduce incomplete type stubs #8481 (comment)). Mypy has a wealth of information that it could use to make sophisticated inference about good signatures for generated stubs. But even adding super-dumb rules like "__bool__ methods are generally going to return bool" are going to be correct 99% of the time, and will save time for typeshed reviewers.

flake8-pyi

Primary maintainer: @AlexWaygood

flake8-pyi is a flake8 plugin to check stub files that works through analysing the Abstract Syntax Tree of the stub. It's a tool that's a lot more "dumb" than stubtest, since it doesn't have the full power of a type checker behind it, and doesn't know anything about the runtime either. There's still a lot you can do just with simple AST analysis, however, and it's another code base that's easy to contribute to if you're familiar with the basics of ASTs and the visitor pattern.

Lots of recent checks we've introduced have been fairly style-focused, but we've also added some bugbear-esque checks in the past that are aimed at catching common bugs:

We should continue to think creatively about whether there are other common bugs that we can catch using a simple AST analysis.

flake8-pyi is maintained by the typeshed team, so if you have an idea for a new check, just file an issue! If it gets a few thumbs-up, it's probably a good idea, and you should feel free to make a PR. If we're sceptical, we'll probably say so :)

There's also a bunch of open issues already with ideas for new checks. If it has a few thumbs-up (or even if it's just older than a week or two and hasn't been closed yet), then it's probably an idea that we're receptive to!

mypy_primer

Primary maintainer: @hauntsaninja

Mypy_primer checks out a copy of typeshed prior to a PR and after a PR. It uses "old typeshed" and "new typeshed" to run mypy against millions of lines of open-source code, and reports the difference between the two.

Here's some ways I think we could make mypy_primer even better:

  • Add more projects to mypy_primer (add more projects to mypy_primer hauntsaninja/mypy_primer#42). This one's a no-brainer.
  • Work on the ideas listed at https://github.com/hauntsaninja/mypy_primer#contributing. In particular, having a way to bisect between typeshed commits would be pretty nice.
  • Figure out ways to analyse "mypy_primer coverage":
    • For a given stubs package in typeshed, how many packages included in mypy_primer depend on that stubs package?
    • If we figure out a way of answering the above question: how best to communicate that information? Should we have a website somewhere that auto-updates information about mypy_primer coverage? A markdown file in typeshed that a bot auto-files PRs to update, listing mypy_primer coverage information for each stubs package? Etc. etc.

Test cases

Primary maintainer: @AlexWaygood

Our test_cases directory (aimed at providing regression tests for places where stubs have caused problems in the past) is new, and still somewhat experimental. The biggest outstanding issue I see was pointed out by @Akuli in #8066 (comment), which is that the test cases are liable to slip out of date with the upstream source code in much the same way that stubs are. How can we ensure that that doesn't happen? Can we add some kind of verification that ensures that the places where we're asserting type checkers shouldn't emit errors do actually pass at runtime?

Stubsabot

Primary maintainer: @hauntsaninja

Stubsabot -- a bot which automatically suggests bumping pinned versions for third-party stubs -- is another new addition to our tooling. Here's some ideas of how we could make it even better:

The "stubtest failed" bot

Primary maintainer: @hauntsaninja

Note: future status of this tool is TBD. typeshed-internal/stub_uploader#60 means that we can now pin stubs to more specific versions, which should hopefully reduce the number of daily stubtest failures we get. Maybe we'll be able to do away with the check altogether...

Stubtest is run on the full suite of third-party stubs every night, and a bot automatically opens an issue if stubtest failed on any of them (e.g. #8745). Things that would be great to see improved about this workflow are:

  • Could the bot automatically tell us which stubs stubtest failed on?
  • Could the bot automatically link to the precise stubtest run that failed, instead of us having to click through and navigate to the stubtest run that's red?
  • Could the bot automatically paste the results of the failing stubtest run into the issue description?
  • Can we stop the bot from creating a new issue if there's already a "stubtest failed" issue open?

Miscellany

These are both pretty minor nits, but it would be great to:

  • Add a test to make sure that the exclude-from-stricter-checks entries in pyrightconfig.stricter.json stay alphabetised, so I don't have to keep annoying contributors that break the alphabetisation ;) tests/check_consistent.py would be a good place for such a check. (Note that most json-parsing tools will refuse to parse pyrightconfig.stricter.json because it's strictly-speaking invalid json.)
  • Add a test that ensures that nobody writes if sys.version_info >= (3, 7) guards anywhere. (3.7 is the minimum supported typeshed version, so such checks are redundant.) tests/check_new_syntax.py might be a good place for such a check.

Ideas for new tools

These ideas are necessarily going to be more work than the ideas above, as most of them involve writing new scripts from scratch, which would create its own maintenance burden. But here's some pie-in-the-sky ideas for brand new tests/tools we could add.

Note that new tools don't need to be part of typeshed or to run in CI to make a difference. For instance, for a long time, we didn't run stubtest in CI, but during that time we used stubtest to fix many, many errors in typeshed. Keeping a tool separate is a good way to avoid disagreement over things like the balance of false positives vs false negatives, to iterate quickly on the tool, and to prove the tool's value.

  • More automated fixes:
    • Are there any other fast pre-commit hooks already out there that would be useful to add to our CI?
    • I love flake8-pyi, but I feel like it's probably more annoying for contributors than something they actively welcome. It's not something I'm planning on working on, because I don't like the idea of the amount of work it would take to get to a tool that's properly tested -- but a bunch of the checks in flake8-pyi feel like they could probably be automated fixes in a pre-commit hook.
    • Something I see very often is contributors trying to import TypedDict from typing (for example) -- which doesn't work, since it needs to be imported from typing_extensions in order to be compatible with Python 3.7. Is there an automated way of fixing that? (Can isort already do that, if we have the right settings?)
  • Use the CPython test suite. We could set up a mypy_primer-esque bot that runs mypy (using typeshed's stubs) against the CPython test suite before and after a PR, and reports on any differences in mypy's results. I have no idea what the signal-to-noise ratio here would be, and I think it might be a fair amount of work to get this to a script we could use in CI -- but maybe worth exploring?
  • Pyright has a sophisticated type inference engine that can be used to infer types in unannotated code. Can we use that somehow to compare pyright's inferences against our manual stubs? Could we do something similar with mypy using the --check-untyped-defs option?
  • Maybe some things in the space of making it easier to upstream stubs or validate type hints against the original code? Rough thoughts at Evaluate ways to determine accuracy and reduce incomplete type stubs #8481 (comment)
  • Any other ideas?
@AlexWaygood AlexWaygood added project: infrastructure typeshed build, test, documentation, or distribution related help wanted An actionable problem of low to medium complexity where a PR would be very welcome labels Sep 20, 2022
@AlexWaygood AlexWaygood pinned this issue Sep 20, 2022
@sobolevn
Copy link
Member

Mypy improvements are also welcome! Anyone can ping me to get help: suggestions / reviews / etc.

@srittau
Copy link
Collaborator

srittau commented Nov 1, 2022

Unpinned temporarily to make space for #8956.

@srittau srittau unpinned this issue Nov 1, 2022
@jenstroeger
Copy link
Contributor

Thank you @AlexWaygood for the list of tools, I’ll have to take a closer look at some of them. I was actually about to open an new issue, but after reading through your writeup I decided to comment on this one instead.

Idea for a new tool

My suggestion/question is related to the Allow package references as version specifiers over at Python’s discussion board.

I noticed that a types package here doesn’t necessarily have any version-enforced relationship with its code package. For example, it is perfectly fine to have this installed

protobuf                      3.20.3
types-protobuf                3.19.6

This looks problematic and a checker that warns against such version mismatch would be useful.

A proper solution, methinks, would be to create a dependency between the types-foo package and its related foo package, which is discussed here but might be better suited as an issue on this Typeshed repo?

@hauntsaninja
Copy link
Collaborator

Thanks for the suggestion! Agreed that this has been a bit murky (although the murkiness has often worked out favourably for users). To that end, we've recently done a couple things to make versioning between types-* and upstream work better: 1) the new versioning scheme in stub_uploader that should result in it being a little clearer what upstream a given version corresponds to, 2) stubsabot, which helps ensure we keep our stubs up to date.

I would be okay adding upstreams as an extra to types-* packages, but would be strongly opposed to adding a default dependency, especially one that had constraints.

@jenstroeger
Copy link
Contributor

Thanks @hauntsaninja, glad to know that this issue is being at least thought about. Perhaps I should link this comment to the Python discussion or would you like to leave a note there? Just to make sure we cross-pollinate on this issue.

I would be okay adding upstreams as an extra to types-* packages,

Using what name? Then users would have to specify, e.g.

dependencies = [
    "protobuf ==3.20.3",
    "types-protobuf[protobuf] ==3.20.3",
]

or some such?

but would be strongly opposed to adding a default dependency, especially one that had constraints.

Mind to elaborate on the reason?

@AlexWaygood AlexWaygood pinned this issue Nov 25, 2022
@AlexWaygood
Copy link
Member Author

AlexWaygood commented Nov 25, 2022

Mind to elaborate on the reason?

Wherever possible, stubs packages should be extremely minimal packages that are easy to install, have minimal dependencies, contain no executable code, contain minimal security risks, and are easy to bootstrap.

We've started to relax this policy recently with recent moves working towards allowing non-types dependencies for stubs in some situations (this work is not yet complete). But it's still a principle we should stick to wherever possible.

@Avasam
Copy link
Collaborator

Avasam commented Dec 3, 2022

Here's an idea, because it happened to me again:
If stubtest doesn't know the name of a parameter (kwarg, c-module,...), and it starts with a single _ and it would not shadow a keyword/import if stripped. Then it is likely a mistake and the dev meant to use a positional-only argument.

Or I can wait for mypy to support the / syntax for positional-only arguments. Which I assume may happen with Python 3.7 EOL next year.

@AlexWaygood
Copy link
Member Author

Or I can wait for mypy to support the / syntax for positional-only arguments. Which I assume may happen with Python 3.7 EOL next year.

Yeah, I'm tempted to say it may not be worth it to write a check for this, given that 3.7 EOL is on the horizon :) But if the implementation isn't too difficult, it might be interesting -- writing the check might even reveal a few existing bugs in typeshed!

(Note: I'm not sure exactly when we'll be dropping support for Python 3.7. We haven't really discussed it yet. We waited a few months after Python 3.6 had reached EOL status before dropping support for 3.6, due to the fact that it was still quite widely used when it reached EOL status, and we wanted to give people more time to move over.)

@Akuli
Copy link
Collaborator

Akuli commented Dec 6, 2022

There aren't that many parameter annotations that start with a single underscore. We could check the existing ones manually.

$ git grep '\(, \|(\)_[^_][A-Za-z0-9_]*:' | wc -l
212

@danieleades
Copy link
Contributor

any appetite for looking at using Ruff for linting?
I'm not actually sure what ruff's support for .pyi files is like

@sobolevn
Copy link
Member

I like ruff (you can find my name among contributors), but I think that flake8 serves us quite well:

  • Speed is good enough
  • Plugins that we use are flake8 specific
  • Stability

As far as I know, there are no unique checks in ruff that can help us in better typing annotations.

@AlexWaygood
Copy link
Member Author

AlexWaygood commented Apr 29, 2023

Ruff seems cool in many ways, but we definitely can't switch to ruff at typeshed.

Flake8-pyi is by far the most important flake8 plugin for typeshed, and it's an important part of our CI. Ruff has only implemented a few of the rules from flake8-pyi, and the rules it has implemented from flake8-pyi have several not-so-great behaviour differences from the way flake8-pyi has implemented them.

Also, flake8-pyi is currently maintained by the same team that maintains typeshed, which makes it very easy for us to make tweaks to flake8-pyi as and when typeshed requires it. Ruff seems very open to PRs, which is great, but there's also the fact that I don't know rust, so I wouldn't be able to make PRs fixing bugs in their logic for the flake8-pyi rules they've implemented.

@srittau srittau unpinned this issue Jul 13, 2023
@srittau
Copy link
Collaborator

srittau commented Jul 13, 2023

Since the discussion has mostly died down, I've unpinned this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted An actionable problem of low to medium complexity where a PR would be very welcome project: infrastructure typeshed build, test, documentation, or distribution related
Projects
None yet
Development

No branches or pull requests

8 participants