Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Static and run-time tests for stubs #917

Closed
wants to merge 10 commits into from

Conversation

vlasovskikh
Copy link
Member

Based on the discussion in #862 I propose a better way of testing stubs. The tests in this PR are a subject to both static checking by a type checker and run-time checking via pytest.

The convention for testing your change to stub files:

  • Add pytest-compatible tests that use the API of the stub you're about to change
  • A test should pass the run-time pytest check
  • A test should statically fail with type errors before the change and pass after it
  • A third-party library test should pip install its precise dependencies as the first step

vlasovskikh and others added 2 commits January 26, 2017 18:51
Currently we check only for syntax errors in stub files. The idea is to
add test data for static analyzers similar to how it's done in
DefinitelyTyped for TypeScript.
@vlasovskikh
Copy link
Member Author

@matthiaskramm Pytype is not happy with defining a namedtuple class. Is it due to the assignment of the call to namedtuple() to a variable or due to the fact I do it inside a function?

Other tests pass, although I had to put several suppressing comments for the code that pip installs a third-party library. There should be a cleaner way to do it.

CONTRIBUTING.md Outdated
@@ -14,6 +14,9 @@ are important to the project's success.
* [Contact us](#discussion) before starting significant work.
* IMPORTANT: For new libraries, [get permission from the library owner first](#adding-a-new-library).
* Create your stubs [conforming to the coding style](#stub-file-coding-style).
* Add tests for your stubs to `test_data`. These tests should are
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove "should"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed earlier, this should be optional. For example: "Optionally, add some tests for your stubs ...".

Also, I'd like to see more documentation about this, as it may unclear how these tests work. I'd suggest creating either a section in CONTRIBUTING.md or a separate document about how to write tests. Here are some ideas what might be useful to discuss:

  • Give an example test and explain the kinds of things it can catch.
  • We don't expect to have tests for everything defined in a stub, but certain things like overloaded functions, generic functions and non-trivial union types may be useful to test.
  • If you fix a bug in a stub, a test will help make sure that the bug will remain fixed. It also makes it easier to review the correctness of your fix.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I've created a separate section in CONTRIBUTING.md taking your comments into account.

@@ -0,0 +1,17 @@
import pip # type: ignore

pip.main('install six==1.10.0'.split())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice if the runtime_test.py driver can somehow take care of dependencies, maybe through some sort of magical comment that it scans for.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it would be better to have the requirements in a text file, similar to how other dependencies are dealt with (e.g. requirements-tests-py3.txt)?

Also, maybe we should require that we explicitly define the versions of all packages being installed, including those that can be automatically installed by pip through dependencies. Otherwise, an updated package on PyPI may break our tests.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually wonder if having the dependencies explicitly installed here is a bit better, because it means that you can run a single test file without having to install all the dependencies for all other test files. Also it is less likely that dependencies will linger when they are no longer needed.

OTOH if a dependency is needed by two different tests, and the second one forgets to install it, this may still go unnoticed if the tests are always run in the same order (and uninstalling everything when the test is done is tricky). This will then cause unexpected test flakes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've extracted the code for installing requirements into a pytest fixture that automatically installs *_requirements.txt before running tests from *_test.py. You can run a single test separately and pytest will ensure requirements are installed.

version_dirs = [
'2and3',
'3',
'%d.%d' % (sys.version_info[0], sys.version_info[1]),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should actually be all subdirectories from sys.version_info[1] upwards: when you're running 3.4 for example type checkers will look at the subdirectories for 3.3 and 3.4 but not 3.5.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JelleZijlstra I believe it doesn't apply to run-time checks since we run them on a particular version of the interpreter and it may not have some module from older Python versions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But in typeshed directories like 3.4 apply also to newer versions. For example, asyncio is in stdlib/3.4 because it was introduced in 3.4, but the stubs are also used for 3.5 and 3.6. I think tests should behave the same.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there was some backwards incompatible change in the API of a module in older Python 3, the tests for the current version may fail. If it is absolutely necessary to run tests for some older Python 3.N module, we will set up another Travis environment with Python 3.N.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be true, but it's certainly confusing, since you are using the same directory naming scheme as for the stubs, and for stubs the rule that Jelle explains is essential (otherwise e.g. asyncio would not be found for Python 3.5 or higher).

It also points, again, to yet another reason why I don't believe these runtime tests are going to help much (it's not checking that the sys.version_info checks in the stubs are correct).

CONTRIBUTING.md Outdated
@@ -14,6 +14,9 @@ are important to the project's success.
* [Contact us](#discussion) before starting significant work.
* IMPORTANT: For new libraries, [get permission from the library owner first](#adding-a-new-library).
* Create your stubs [conforming to the coding style](#stub-file-coding-style).
* Add tests for your stubs to `test_data`. These tests should are
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed earlier, this should be optional. For example: "Optionally, add some tests for your stubs ...".

Also, I'd like to see more documentation about this, as it may unclear how these tests work. I'd suggest creating either a section in CONTRIBUTING.md or a separate document about how to write tests. Here are some ideas what might be useful to discuss:

  • Give an example test and explain the kinds of things it can catch.
  • We don't expect to have tests for everything defined in a stub, but certain things like overloaded functions, generic functions and non-trivial union types may be useful to test.
  • If you fix a bug in a stub, a test will help make sure that the bug will remain fixed. It also makes it easier to review the correctness of your fix.

Point = namedtuple('Point', 'x y')
p = Point(1, 2)

assert p == Point(1, 2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that mypy currently can't check that the return type is correct here. For example, if we'd declare def f() -> int: ... and had a test like assert f() == 'x' mypy wouldn't flag it as an error (see python/typing#378 and python/mypy#1271). Would pytype flag this as an error?

Even if we can't check these yet, the assertions are useful for code reviewers, as it makes it easier to manually check that a return type annotation looks reasonable.

@@ -0,0 +1,17 @@
import pip # type: ignore

pip.main('install six==1.10.0'.split())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it would be better to have the requirements in a text file, similar to how other dependencies are dealt with (e.g. requirements-tests-py3.txt)?

Also, maybe we should require that we explicitly define the versions of all packages being installed, including those that can be automatically installed by pip through dependencies. Otherwise, an updated package on PyPI may break our tests.

@gvanrossum
Copy link
Member

I'm still very skeptical. This runs the risk of slowing down test runs dramatically (once we scale up) and I think most of the issues that have been fixed wouldn't have been found by these types of tests. Also the tests look like they would be much larger in volume than the stubs to which they belong.

@matthiaskramm
Copy link
Contributor

Among other things, I don't quite see why pyi files should require new tests, given that the library files they model already have existing tests.

Take collections_test.py. The things it tests is just a subset of test_collections.py. Could we just get the latter to run against the pyi, preferrably without having to add any new files to typeshed?

@gvanrossum
Copy link
Member

Alas, actual stdlib code (tests or otherwise) is usually too dynamic to pass with mypy.

I am okay with this proposal as long as it's not mandatory and doesn't slow down the test runs much. It won't be comprehensive, but it might be useful to catch regressions in particularly tricky signatures (e.g. Dict.get). It seems that mypy's test-data/samples and test-data/stdlib-samples/3.2 serve a similar purpose, possibly we could move those into typeshed.

@vlasovskikh
Copy link
Member Author

Yes, we could keep the run-time tests minimal and use them for checking for false errors and, optionally, verifying some contributions.

They are good for making sure that a stub covers at least some basic API and they may be helpful for reviewers serving as a proof that the stub author understood the API correctly.

Run-time tests are even better for preventing regressions after some false errors were found and fixed. And there will be regressions since all our code analysis engines differ and we are not 100% PEP 484 compliant and fixing one bit in stubs like object.__slots__ could affect how type checkers understand some other bit.

@vlasovskikh
Copy link
Member Author

@matthiaskramm I keep facing various type checking issues with pytype. I could disable it for run-time + static test_data files or I could file new issues to your tracker about the problem I discover. What do you prefer?

@matthiaskramm
Copy link
Contributor

matthiaskramm commented Feb 7, 2017

It's true, tests are heavy on introspection. I did a few experiments, using pytype.
I found a bug in time.pyi by running pytype Python-2.7.13/Lib/test/test_time.py and a a bug in csv.pyi by running pytype Python-2.7.13/Lib/test/test_csv.py
However, the output is noisy. For example, pytype Python-2.7.13/Lib/test/test_bisect.py lists over 100 errors, because pytype fails to understand the intention behind TestBisect.module.

Options:

  1. (Proposed by this PR) Ignore the existing test suite, hand-craft new tests.
  2. Have mypy or pytype execute the original tests without any changes, whitelist existing/incorrect errors.
  3. Copy&paste the original tests into typeshed, adjust so they pass type-checking.
  4. Adjust the original tests, in the Python sources, so they pass type-checking.

@vlasovskikh
Copy link
Member Author

@matthiaskramm I would prefer to start with a few hand-written tests specifically designed to pass both run-time and static tests.

But even in the two tests I've written for the purposes of demonstration in this PR pytype reports a false error about a non-top level import. I'd like to proceed with not running those tests with pytype in order not to hold back this PR.

@matthiaskramm
Copy link
Contributor

I'm happy to help you fix the pytype errors.

However, I feel we need an overall strategy about how we want to do testing in the future. It's not clear to me that this PR is a step in the right direction.

@vlasovskikh
Copy link
Member Author

The way I would like to proceed is accept pull requests with run-time + static tests even though they fail on pytype. I assume you can fix them for pytype later on. If we were blocked by any of Mypy, pytype, PyCharm failing to understand any particular test case, we would wait for too long for a PR to get accepted. Maybe Mypy is a bit special since it's a de facto reference implementation of PEP 484, but as for pytype, PyCharm, and other checkers that support PEP 484, I see no reason to wait until a test passes for all of them (assuming tests are PEP 484 compliant and pass at run-time).

The approach for testing stubs proposed in this PR might be not ideal, but it's already working by finding bugs in either stubs (see the examples in this PR) or the checkers (see the failing tests for pytype).

I would like @JukkaL and @gvanrossum to have another look at the udpated PR. It includes the documentation on how to create optional run-time + static tests, as well as test examples that handle requirements for third-party libraries.

@gvanrossum
Copy link
Member

gvanrossum commented Feb 10, 2017 via email

@vlasovskikh
Copy link
Member Author

@gvanrossum I've re-run the tests after the changes in pytype, they are now passing.

@@ -70,7 +70,7 @@ def pytype_test(args):
print("Cannot run pytd. Did you install pytype?")
return 0, 0

wanted = re.compile(r"stdlib/(2|2\.7|2and3)/.*\.pyi$")
wanted = re.compile(r"stdlib/(2|2\.7|2and3)/.*\.pyi?$")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's this change for?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed it in order to run pytype on tests files which are regular Python files. Since I've disabled running pytype for them, I could revert this change as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The existing pytype_test.py checks pyi files using the pytd tool. If you want to run against py files, you'll need to run the pytype executable.

Sorry. I should have caught that earlier.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you do run pytype, you may have to run it as TYPESHED_HOME=/some/path/ pytype in order to point it at the typeshed version you're testing.

@matthiaskramm
Copy link
Contributor

So if we go forward with this PR, we're moving into the direction of

  1. Copy&paste (subsets of) tests from cpython/Lib/test into typeshed, adjust so they pass type-checking.

Is that what we want? (We're going to get quite a bit of duplication, but we're also getting a brand new test suite for the standard library that passes typechecking.)

@vlasovskikh
Copy link
Member Author

I propose to start small and have run-time + static tests only for some fixes of false errors sent to typeshed as PRs. Converting the whole CPython test suite to pass type checking in Mypy looks like a big undertaking.

@matthiaskramm
Copy link
Contributor

I know that what you're doing right now is just a lot of small tests, for specific files, to verify fixes.

But these tests are going to grow, in number and size. What will this look like in three years? I suspect we'll have something that looks very similar to the Python test suite. Maybe that's what we want. But the time to make that decision is right now.

@vlasovskikh
Copy link
Member Author

I see no problem in having some tests for fixed false errors. If the test suite grows over time and it's size becomes close to the CPython test suite, we will be able to decide then whether we will continue using it or rewrite CPython tests to become type check compatible. Also, there are Typeshed stubs for third-party libraries, the authors of which don't want type hints in their code, so I guess they won't be happy about adjusting their test suites to pass type checks. Having Typeshed-specific tests for them will be helpful.

@matthiaskramm
Copy link
Contributor

If the test suite [...] becomes close to the CPython test suite [...], we will be able to decide whether we will continue using it or rewrite CPython tests to become type check compatible.

That's the situation I'd like to prevent: We kick the can down the road today, and then at some point in the future, we'll be sitting on two huge test suites. We then decide we want to merge them again. What if we can't? What if merging them would require to go through every single file, every single test, in both versions of the file, manually trying to get both the typechecking and runtime tests to work, in tests that are only constructed for either, respectively? We're talking about a massive undertaking. I'd like to see a future direction that sounds more realistic.

@gvanrossum
Copy link
Member

After reading @matthiaskramm's comments I am once again changing my opinion to extreme skepticism (in spite of @JukkaL's endorsement).

Many contributors will assume that adding tests is better than not adding tests, so despite the language about tests being optional we will end up with many tests. It's not realistic to expect those to ever be merged with the stdlib test suite, so it will grow to be a second large redundant test suite. And yet these tests cannot ever be comprehensive, except in the case of very simple APIs.

I still haven't understood why @vlasovskikh wants runtime tests. What problem are you addressing here?

I would be a little less opposed to having runtime tests just for third party modules, as those might act as the "canary in the coalmine" when a third party API changes incompatibly (they don't all have the same commitment to backwards compatibility as the stdlib).

But for third party packages I think the problem of versioning of those packages is more important than the problem of testing compliance.

I think we may have to accept "stalemate" here (no rough consensus reachable), which means we will have to stick with the status quo.

@lincolnq
Copy link
Contributor

I'm not insanely familiar with this project but writing runtime test cases where we actually care about & test the values seems wrong for this project, because of all of the reasoning above -- seeming to duplicate a lot of work.

For this project, I'd be in favor of some kind of testing where we only check the types of things, not their values; but we could also write test cases which assert that a given expression or block does not type-check (since the point of stub-files is to prevent bugs, and if we can only check positive examples then you can always make a test pass by changing a type to Any). Not sure what this would look like yet though.

@vlasovskikh
Copy link
Member Author

@gvanrossum The reasons for having run-time + static tests in the Typeshed repo:

  1. Compatiblity. Having a common tests suite will help to ensure compatibility among type checkers and will increase PEP 484 compliance.

  2. Automated checks for correctness of pull requests to Typeshed. Now we have to read the docs about the API covered by a PR and make the decision based on code review only, almost with no automated checks (besides the shallow syntax checks in tests/).

I see no problem in having more tests in addition to CPython tests. Tests rarely need maintenance, yet they help ensure compatibility, automate checking of pull requests, fight regressions.

If I still haven't convinced you, I'll close this PR and start using those tests only for PyCharm for a while.

@JukkaL
Copy link
Contributor

JukkaL commented Feb 17, 2017

@vlasovskikh It would be interesting to see what a realistic test suite would look like. If you start using stub tests in PyCharm development, perhaps you could document all the stub issues they are finding, and use those to argue for having tests in typeshed (assuming they help catch issues, of course).

I've done a little work on an alternative approach. The idea is to do automated testing of stubs by comparing the contents of stubs to what's defined at runtime in a module. For example, if a stub defines a function f but the module doesn't define it at runtime, we'd report that as a problem with the stub. My initial results indicate this would result in a large number of false positives that we'd probably need to manually check and whitelist, but it would likely also find dozens of real problems in stubs. These two approaches aren't mutually exclusive and both might be useful to have.

@vlasovskikh
Copy link
Member Author

@JukkaL Agreed. Typeshed tests for PyCharm are on GitHub. I'll be back with an update when we have more real-world tests.

@gvanrossum
Copy link
Member

I'm still not convinced. In my experience tests are costly to write (often the cost of developing tests is as much as developing the code they are testing) and definitely not maintenance-free. In my experience, having merged a large number of PRs into typeshed, there are two typical forms of contributions:

  1. Brand new stubs for some module (or package). These are typically contributed by someone who needs them because their own app uses that module, and their app serves as an initial, "manual" test that the stubs are valid. Sure, their app won't be covering every nook and cranny of the module's API, but neither would explicit tests, if we were to make them write those. So these stubs will be roughly correct, and they will be incrementally improved via the following workflow. I think requiring tests here would just add friction (more than doubling the work for the contributor and hence for the reviewer) and it not improve the correctness of the typical contribution (because most contributions are already mostly correct due to the "manual" testing).

  2. Incremental improvements to some stub. This could be tweaking the type (or optionality) of argument types or return types, or adding missing methods, classes, or submodules. These increments are typically very small (touching a few methods only) and it's easy to manually review them. A huge majority of these makes the stubs more flexible, e.g. adding a new method, replacing a required argument with an optional one, or adding the ability to accept bytes in addition to text, and that property is easy to review.

Very rarely have we seen regressions in stubs -- when these have happened, I believe they were either caused by a large refactoring where some detail was overlooked (e.g. merging the Python 2 and Python 3 versions in a single stub living in "2and3") or when dealing with a really complicated overload (the recent history of Mapping.get() comes to mind -- and this was mostly due to mypy implementation issues).

What I would like to see, if possible, is a test that verifies that new/changed stubs are acceptable to PyCharm. This is similar to the tests we currently run for mypy and pytype -- since the syntax of stubs is not precisely defined by PEP 484, mypy and pytype implement different dialects, and these tests constrain us to their intersection. I'd be happy to also add such a test for PyCharm, if it is possible to create one. But these tests have the property that they don't require extra work (apart from fixing issues they find, and -- rarely -- a blacklist update) for each new/changed stub.

@ambv
Copy link
Contributor

ambv commented Feb 17, 2017

The main advantage is having to explicitly state the version the stubs are targeting in requirements.txt, I like that part. The runtime test in this sense is more of a safe guard against future API changes in a module. Well, to be thorough, we'd really need to test against all releases that fit in the specified range. That can be a future improvement.

The main disadvantage is the expected low quality of such tests. First, they are not maintenance free like Guido says. So I expect them to be simplistic. Second, since they will be simplistic, we need to come up with a concept of "stub coverage" for those tests, that would demonstrate how much of a stub our mypy/pytest/PyCharm run against the test is really exercising.

Then the remaining disadvantage is additional test time. Many companies solve this by discovering which tests need to be run on a change based on files touched and building the dependency graph from that. This is non-trivial but possible to achieve for us in the long term. In the short term we could simply make those tests opt-in if speed becomes a problem.

The main risk I'm seeing with those tests is that it's far from trivial to write anything close to exhaustive for even simple libraries. And once we start mocking things, we're back at testing make-believe code that might not represent reality. In this sense, I'm very tempted to try out reusing a package's existing tests instead of writing your own simplistic ones. That would be more thorough, but would bring additional complexity (incompatible test runners, testing environment setups, convenience magic in tests).

Summing up: @gvanrossum, if that were my decision to make, I would let Andrey go forward with the tests as is, with the expected future improvements towards:

  • only running tests on relevant changes; (synchronizing typeshed in mypy would be one of the few cases when we'd run all tests regardless)
  • coming up with "stub coverage";
  • trying out running mypy on existing test suites.

We always have the option to disable the tests in the future if we prove they are more hassle than they are worth.

@vlasovskikh, how does this sound?

By the way, I have also played with introducing testing to typeshed but taken a different approach. I'm writing a tool that modifies the AST of the stubbed module to insert annotations back to the sources and run mypy on it. This is not generally feasible for code that uses metaclasses and other runtime effects that force stubs to look different. However, compared to testing stubs on user code, I found that this forces stubs to stay closer to reality and helps finding typing issues in the code being stubbed. I am writing this with a specific project in mind for now and will definitely share once it's less rough.

@gvanrossum
Copy link
Member

gvanrossum commented Feb 17, 2017 via email

@vlasovskikh
Copy link
Member Author

Thanks everybody for your extensive review! I'm closing this PR. I'll keep experimenting with these tests in PyCharm for a while.

@gvanrossum I'll see if I'm able to come up with a PyCharm-based tool for checking stubs similar to mypy_test.py and pytype_test.py that can be used for running it on the Travis CI server. There are some limitations here since PyCharm isn't really a command line tool and our CI mode requires some setup. Have a nice vacation!

@ambv These are good directions for developing these tests further, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants