-
-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Static and run-time tests for stubs #917
Conversation
Currently we check only for syntax errors in stub files. The idea is to add test data for static analyzers similar to how it's done in DefinitelyTyped for TypeScript.
@matthiaskramm Pytype is not happy with defining a Other tests pass, although I had to put several suppressing comments for the code that pip installs a third-party library. There should be a cleaner way to do it. |
CONTRIBUTING.md
Outdated
@@ -14,6 +14,9 @@ are important to the project's success. | |||
* [Contact us](#discussion) before starting significant work. | |||
* IMPORTANT: For new libraries, [get permission from the library owner first](#adding-a-new-library). | |||
* Create your stubs [conforming to the coding style](#stub-file-coding-style). | |||
* Add tests for your stubs to `test_data`. These tests should are |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove "should"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed earlier, this should be optional. For example: "Optionally, add some tests for your stubs ...".
Also, I'd like to see more documentation about this, as it may unclear how these tests work. I'd suggest creating either a section in CONTRIBUTING.md or a separate document about how to write tests. Here are some ideas what might be useful to discuss:
- Give an example test and explain the kinds of things it can catch.
- We don't expect to have tests for everything defined in a stub, but certain things like overloaded functions, generic functions and non-trivial union types may be useful to test.
- If you fix a bug in a stub, a test will help make sure that the bug will remain fixed. It also makes it easier to review the correctness of your fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I've created a separate section in CONTRIBUTING.md taking your comments into account.
@@ -0,0 +1,17 @@ | |||
import pip # type: ignore | |||
|
|||
pip.main('install six==1.10.0'.split()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice if the runtime_test.py
driver can somehow take care of dependencies, maybe through some sort of magical comment that it scans for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it would be better to have the requirements in a text file, similar to how other dependencies are dealt with (e.g. requirements-tests-py3.txt
)?
Also, maybe we should require that we explicitly define the versions of all packages being installed, including those that can be automatically installed by pip through dependencies. Otherwise, an updated package on PyPI may break our tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually wonder if having the dependencies explicitly installed here is a bit better, because it means that you can run a single test file without having to install all the dependencies for all other test files. Also it is less likely that dependencies will linger when they are no longer needed.
OTOH if a dependency is needed by two different tests, and the second one forgets to install it, this may still go unnoticed if the tests are always run in the same order (and uninstalling everything when the test is done is tricky). This will then cause unexpected test flakes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've extracted the code for installing requirements into a pytest fixture that automatically installs *_requirements.txt before running tests from *_test.py. You can run a single test separately and pytest will ensure requirements are installed.
version_dirs = [ | ||
'2and3', | ||
'3', | ||
'%d.%d' % (sys.version_info[0], sys.version_info[1]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should actually be all subdirectories from sys.version_info[1]
upwards: when you're running 3.4 for example type checkers will look at the subdirectories for 3.3 and 3.4 but not 3.5.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JelleZijlstra I believe it doesn't apply to run-time checks since we run them on a particular version of the interpreter and it may not have some module from older Python versions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But in typeshed directories like 3.4 apply also to newer versions. For example, asyncio is in stdlib/3.4 because it was introduced in 3.4, but the stubs are also used for 3.5 and 3.6. I think tests should behave the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there was some backwards incompatible change in the API of a module in older Python 3, the tests for the current version may fail. If it is absolutely necessary to run tests for some older Python 3.N module, we will set up another Travis environment with Python 3.N.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may be true, but it's certainly confusing, since you are using the same directory naming scheme as for the stubs, and for stubs the rule that Jelle explains is essential (otherwise e.g. asyncio would not be found for Python 3.5 or higher).
It also points, again, to yet another reason why I don't believe these runtime tests are going to help much (it's not checking that the sys.version_info
checks in the stubs are correct).
CONTRIBUTING.md
Outdated
@@ -14,6 +14,9 @@ are important to the project's success. | |||
* [Contact us](#discussion) before starting significant work. | |||
* IMPORTANT: For new libraries, [get permission from the library owner first](#adding-a-new-library). | |||
* Create your stubs [conforming to the coding style](#stub-file-coding-style). | |||
* Add tests for your stubs to `test_data`. These tests should are |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed earlier, this should be optional. For example: "Optionally, add some tests for your stubs ...".
Also, I'd like to see more documentation about this, as it may unclear how these tests work. I'd suggest creating either a section in CONTRIBUTING.md or a separate document about how to write tests. Here are some ideas what might be useful to discuss:
- Give an example test and explain the kinds of things it can catch.
- We don't expect to have tests for everything defined in a stub, but certain things like overloaded functions, generic functions and non-trivial union types may be useful to test.
- If you fix a bug in a stub, a test will help make sure that the bug will remain fixed. It also makes it easier to review the correctness of your fix.
Point = namedtuple('Point', 'x y') | ||
p = Point(1, 2) | ||
|
||
assert p == Point(1, 2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that mypy currently can't check that the return type is correct here. For example, if we'd declare def f() -> int: ...
and had a test like assert f() == 'x'
mypy wouldn't flag it as an error (see python/typing#378 and python/mypy#1271). Would pytype flag this as an error?
Even if we can't check these yet, the assertions are useful for code reviewers, as it makes it easier to manually check that a return type annotation looks reasonable.
@@ -0,0 +1,17 @@ | |||
import pip # type: ignore | |||
|
|||
pip.main('install six==1.10.0'.split()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it would be better to have the requirements in a text file, similar to how other dependencies are dealt with (e.g. requirements-tests-py3.txt
)?
Also, maybe we should require that we explicitly define the versions of all packages being installed, including those that can be automatically installed by pip through dependencies. Otherwise, an updated package on PyPI may break our tests.
I'm still very skeptical. This runs the risk of slowing down test runs dramatically (once we scale up) and I think most of the issues that have been fixed wouldn't have been found by these types of tests. Also the tests look like they would be much larger in volume than the stubs to which they belong. |
Among other things, I don't quite see why Take |
Alas, actual stdlib code (tests or otherwise) is usually too dynamic to pass with mypy. I am okay with this proposal as long as it's not mandatory and doesn't slow down the test runs much. It won't be comprehensive, but it might be useful to catch regressions in particularly tricky signatures (e.g. |
Yes, we could keep the run-time tests minimal and use them for checking for false errors and, optionally, verifying some contributions. They are good for making sure that a stub covers at least some basic API and they may be helpful for reviewers serving as a proof that the stub author understood the API correctly. Run-time tests are even better for preventing regressions after some false errors were found and fixed. And there will be regressions since all our code analysis engines differ and we are not 100% PEP 484 compliant and fixing one bit in stubs like |
@matthiaskramm I keep facing various type checking issues with pytype. I could disable it for run-time + static test_data files or I could file new issues to your tracker about the problem I discover. What do you prefer? |
It's true, tests are heavy on introspection. I did a few experiments, using pytype. Options:
|
@matthiaskramm I would prefer to start with a few hand-written tests specifically designed to pass both run-time and static tests. But even in the two tests I've written for the purposes of demonstration in this PR pytype reports a false error about a non-top level import. I'd like to proceed with not running those tests with pytype in order not to hold back this PR. |
I'm happy to help you fix the pytype errors. However, I feel we need an overall strategy about how we want to do testing in the future. It's not clear to me that this PR is a step in the right direction. |
The way I would like to proceed is accept pull requests with run-time + static tests even though they fail on pytype. I assume you can fix them for pytype later on. If we were blocked by any of Mypy, pytype, PyCharm failing to understand any particular test case, we would wait for too long for a PR to get accepted. Maybe Mypy is a bit special since it's a de facto reference implementation of PEP 484, but as for pytype, PyCharm, and other checkers that support PEP 484, I see no reason to wait until a test passes for all of them (assuming tests are PEP 484 compliant and pass at run-time). The approach for testing stubs proposed in this PR might be not ideal, but it's already working by finding bugs in either stubs (see the examples in this PR) or the checkers (see the failing tests for pytype). I would like @JukkaL and @gvanrossum to have another look at the udpated PR. It includes the documentation on how to create optional run-time + static tests, as well as test examples that handle requirements for third-party libraries. |
I don't want to accept PRs that have failing tests, even if we know why
they fail. But perhaps you can update the pytype blacklist to skip checking
certain files.
|
@gvanrossum I've re-run the tests after the changes in pytype, they are now passing. |
tests/pytype_test.py
Outdated
@@ -70,7 +70,7 @@ def pytype_test(args): | |||
print("Cannot run pytd. Did you install pytype?") | |||
return 0, 0 | |||
|
|||
wanted = re.compile(r"stdlib/(2|2\.7|2and3)/.*\.pyi$") | |||
wanted = re.compile(r"stdlib/(2|2\.7|2and3)/.*\.pyi?$") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's this change for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed it in order to run pytype on tests files which are regular Python files. Since I've disabled running pytype for them, I could revert this change as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The existing pytype_test.py
checks pyi
files using the pytd
tool. If you want to run against py
files, you'll need to run the pytype
executable.
Sorry. I should have caught that earlier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you do run pytype, you may have to run it as TYPESHED_HOME=/some/path/ pytype
in order to point it at the typeshed version you're testing.
So if we go forward with this PR, we're moving into the direction of
Is that what we want? (We're going to get quite a bit of duplication, but we're also getting a brand new test suite for the standard library that passes typechecking.) |
I propose to start small and have run-time + static tests only for some fixes of false errors sent to typeshed as PRs. Converting the whole CPython test suite to pass type checking in Mypy looks like a big undertaking. |
I know that what you're doing right now is just a lot of small tests, for specific files, to verify fixes. But these tests are going to grow, in number and size. What will this look like in three years? I suspect we'll have something that looks very similar to the Python test suite. Maybe that's what we want. But the time to make that decision is right now. |
I see no problem in having some tests for fixed false errors. If the test suite grows over time and it's size becomes close to the CPython test suite, we will be able to decide then whether we will continue using it or rewrite CPython tests to become type check compatible. Also, there are Typeshed stubs for third-party libraries, the authors of which don't want type hints in their code, so I guess they won't be happy about adjusting their test suites to pass type checks. Having Typeshed-specific tests for them will be helpful. |
That's the situation I'd like to prevent: We kick the can down the road today, and then at some point in the future, we'll be sitting on two huge test suites. We then decide we want to merge them again. What if we can't? What if merging them would require to go through every single file, every single test, in both versions of the file, manually trying to get both the typechecking and runtime tests to work, in tests that are only constructed for either, respectively? We're talking about a massive undertaking. I'd like to see a future direction that sounds more realistic. |
After reading @matthiaskramm's comments I am once again changing my opinion to extreme skepticism (in spite of @JukkaL's endorsement). Many contributors will assume that adding tests is better than not adding tests, so despite the language about tests being optional we will end up with many tests. It's not realistic to expect those to ever be merged with the stdlib test suite, so it will grow to be a second large redundant test suite. And yet these tests cannot ever be comprehensive, except in the case of very simple APIs. I still haven't understood why @vlasovskikh wants runtime tests. What problem are you addressing here? I would be a little less opposed to having runtime tests just for third party modules, as those might act as the "canary in the coalmine" when a third party API changes incompatibly (they don't all have the same commitment to backwards compatibility as the stdlib). But for third party packages I think the problem of versioning of those packages is more important than the problem of testing compliance. I think we may have to accept "stalemate" here (no rough consensus reachable), which means we will have to stick with the status quo. |
I'm not insanely familiar with this project but writing runtime test cases where we actually care about & test the values seems wrong for this project, because of all of the reasoning above -- seeming to duplicate a lot of work. For this project, I'd be in favor of some kind of testing where we only check the types of things, not their values; but we could also write test cases which assert that a given expression or block does not type-check (since the point of stub-files is to prevent bugs, and if we can only check positive examples then you can always make a test pass by changing a type to Any). Not sure what this would look like yet though. |
@gvanrossum The reasons for having run-time + static tests in the Typeshed repo:
I see no problem in having more tests in addition to CPython tests. Tests rarely need maintenance, yet they help ensure compatibility, automate checking of pull requests, fight regressions. If I still haven't convinced you, I'll close this PR and start using those tests only for PyCharm for a while. |
@vlasovskikh It would be interesting to see what a realistic test suite would look like. If you start using stub tests in PyCharm development, perhaps you could document all the stub issues they are finding, and use those to argue for having tests in typeshed (assuming they help catch issues, of course). I've done a little work on an alternative approach. The idea is to do automated testing of stubs by comparing the contents of stubs to what's defined at runtime in a module. For example, if a stub defines a function |
I'm still not convinced. In my experience tests are costly to write (often the cost of developing tests is as much as developing the code they are testing) and definitely not maintenance-free. In my experience, having merged a large number of PRs into typeshed, there are two typical forms of contributions:
Very rarely have we seen regressions in stubs -- when these have happened, I believe they were either caused by a large refactoring where some detail was overlooked (e.g. merging the Python 2 and Python 3 versions in a single stub living in "2and3") or when dealing with a really complicated overload (the recent history of What I would like to see, if possible, is a test that verifies that new/changed stubs are acceptable to PyCharm. This is similar to the tests we currently run for mypy and pytype -- since the syntax of stubs is not precisely defined by PEP 484, mypy and pytype implement different dialects, and these tests constrain us to their intersection. I'd be happy to also add such a test for PyCharm, if it is possible to create one. But these tests have the property that they don't require extra work (apart from fixing issues they find, and -- rarely -- a blacklist update) for each new/changed stub. |
The main advantage is having to explicitly state the version the stubs are targeting in requirements.txt, I like that part. The runtime test in this sense is more of a safe guard against future API changes in a module. Well, to be thorough, we'd really need to test against all releases that fit in the specified range. That can be a future improvement. The main disadvantage is the expected low quality of such tests. First, they are not maintenance free like Guido says. So I expect them to be simplistic. Second, since they will be simplistic, we need to come up with a concept of "stub coverage" for those tests, that would demonstrate how much of a stub our mypy/pytest/PyCharm run against the test is really exercising. Then the remaining disadvantage is additional test time. Many companies solve this by discovering which tests need to be run on a change based on files touched and building the dependency graph from that. This is non-trivial but possible to achieve for us in the long term. In the short term we could simply make those tests opt-in if speed becomes a problem. The main risk I'm seeing with those tests is that it's far from trivial to write anything close to exhaustive for even simple libraries. And once we start mocking things, we're back at testing make-believe code that might not represent reality. In this sense, I'm very tempted to try out reusing a package's existing tests instead of writing your own simplistic ones. That would be more thorough, but would bring additional complexity (incompatible test runners, testing environment setups, convenience magic in tests). Summing up: @gvanrossum, if that were my decision to make, I would let Andrey go forward with the tests as is, with the expected future improvements towards:
We always have the option to disable the tests in the future if we prove they are more hassle than they are worth. @vlasovskikh, how does this sound? By the way, I have also played with introducing testing to typeshed but taken a different approach. I'm writing a tool that modifies the AST of the stubbed module to insert annotations back to the sources and run mypy on it. This is not generally feasible for code that uses metaclasses and other runtime effects that force stubs to look different. However, compared to testing stubs on user code, I found that this forces stubs to stay closer to reality and helps finding typing issues in the code being stubbed. I am writing this with a specific project in mind for now and will definitely share once it's less rough. |
I need to take a break from this argument (any many others :-) and am going
on vacation next week. Will be back on or after 2/27.
|
Thanks everybody for your extensive review! I'm closing this PR. I'll keep experimenting with these tests in PyCharm for a while. @gvanrossum I'll see if I'm able to come up with a PyCharm-based tool for checking stubs similar to mypy_test.py and pytype_test.py that can be used for running it on the Travis CI server. There are some limitations here since PyCharm isn't really a command line tool and our CI mode requires some setup. Have a nice vacation! @ambv These are good directions for developing these tests further, thanks! |
Based on the discussion in #862 I propose a better way of testing stubs. The tests in this PR are a subject to both static checking by a type checker and run-time checking via pytest.
The convention for testing your change to stub files: