Skip to content

Conversation

@potiuk
Copy link
Member

@potiuk potiuk commented Jan 6, 2024

MyPy as used in our static checks has slightly different heuristics
when running on on individual files and whole packages. This sometimes
causes semi-random failures when different set of files is produced
when pre-commits split the files between parallel processes.

The regular mypy-* pre-commits work by passing filenames to mypy
checks, and when --all-files flag is passed to mypy, this means
that 2700 files are passed. In this case pre-commit will split such
long list of files to several sequential muypy executions. This
is not very good because depending on the list of files passed,
mypy can split the list diferently and results will be different
when just list of files changes - so mypy might start detecting
problems that were not present before.

This PR introduces new mypy check that runs mypy for packages
rather than individual files. We cannot run them for local
pre-commit runs, because in many cases, such package based
mypy check will run for minutes when a single file changes,
due to cache invalidation rules - and we do not want to penalise
commits that are changing common airflow code (because such PRs
would invalidate a lot of mypy cache every time such common file
changes). So we still want to run file-based mypy for local
commits. But we do not want to pass 2700 files in CI, rather than
that on CI we want to run mypy checks "per package".

This PR introduces a new "manual" stage mypy pre-commit check that
will run "package" based mypy checks and adds selective check rules
that will decide properly when to run such tests and separate,
matrix-based CI job that will run such mypy checks - separately
for each of the packages: "airflow", "providers", "docs", "dev".

Also this job will skip providers checks in non-main branch and
will run all tests when "full tests needed" are requested.

This PR ignores some errors resulted from 3rd-party libraries used
that are randomply appearing when some files are modified (and fixes
the current main failures)


^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

@potiuk
Copy link
Member Author

potiuk commented Jan 6, 2024

@potiuk
Copy link
Member Author

potiuk commented Jan 6, 2024

I am about to add a permanent check to run such complete mypy check in canary build soon :) - but merging it now should fix the problem in main.

@potiuk potiuk changed the title Fix attr-defined/override mypy errors Fix mypy errors resulting from all code mypy check Jan 6, 2024
@potiuk
Copy link
Member Author

potiuk commented Jan 6, 2024

Added a separate mypy check that runs a complete mypy check - this way we can prevent similar problems in the future.

@jscheffl
Copy link
Contributor

jscheffl commented Jan 6, 2024

I do not really get the understanding for "mypy all" vs. "mypy everything". What is going to be checked pre-commit and what on PR and what on main.

If the partly/differential mypy is "flaky"/not fully deterministic, should we not take the safe route and always run mypy over all files? As long as static checks are not the longest running actions and most of the local development can use a mypy cache, what would be the drawback of testing "more" to be on the safe side?

@potiuk
Copy link
Member Author

potiuk commented Jan 6, 2024

I do not really get the understanding for "mypy all" vs. "mypy everything". What is going to be checked pre-commit and what on PR and what on main.

The difference is how pre-commit works and automatically splits list of files to check between multiple processes. When you run mypy checks on your commit, pre-commit works in the way that it will pass the list of files changed in your pre-commit to underlying pre-commit when you have pass-filenames: true. This is in order to speed up the check if you run local commit and only one or two files change (then locally pre-commit will only pass those two files as parameters - so it is good for your local commitS).

In CI - in every PR we run pre-commit with --all-files flag - to make sure that we are not only running differential check -but full chec. Which means that it will pass all files that match the criteria to the pre-commit.

And due to POSIX specifications and limitations, it cannot run as a single command - because we have too many files in Airflow. The list contains currently 2700 files. There are certain limitations of POSIX command line and if you have 2700 arguments passed to a command, each of them being a path, it's not possible to run it as a single command. So what happens - pre-commit will automatically split it into several commands.

Effectively what happens is that the regular mypy-pre-commits when --all-files is run will get 2700 files as an input and what happens is that these 2700 files are split into likely 2 or 3 separate mypy commands run sequentially - each of them having much less number of files - simply because there is a limited number of arguments that can be passed via command line.

And they way how mypy works is that depending on the list of files passed to it, it might behave a little differently - the main reason is that it uses heurisitics to guess if a field/variable is of a certain type - based on the list of files it gets as input and the "certain" imports it can follow throuhg. The end result is that when you only pass a subset of files to mypy it will sometimes not guess correctly what the type is (because none of the files passed to it will import that file.

The new check does something different. It runs ONE check with docs, airflow an docs PACKAGES as parameters (not files)- which is different because then mypy will discover and find ALL the files. Which is slower than running mypy check for just one or two files that were changed in a commit, but generally it will always produce the same results as it will always check all the files.

@potiuk
Copy link
Member Author

potiuk commented Jan 6, 2024

If the partly/differential mypy is "flaky"/not fully deterministic, should we not take the safe route and always run mypy over all files? As long as static checks are not the longest running actions and most of the local development can use a mypy cache, what would be the drawback of testing "more" to be on the safe side?

That would make every single commit to run 5-10 minutes to check all the files after you install pre-commit in your local filesystem/git - even with mypy_cache just importing and building the tree / object classes for Airflow will take a long time - when you change one or two files.

@potiuk
Copy link
Member Author

potiuk commented Jan 7, 2024

The "mypy (airflow)" check took now very long (9min) - is there some --exclude missing such that it does not check provider subfolder redundant to the parallel provider test?

Yep. This is what I just recalled to add :)

@potiuk
Copy link
Member Author

potiuk commented Jan 7, 2024

BTW. We also might want to add some extra instructions on how to run the tests if Mypy test fails - because it's not obvious that you should run

MYPY_PACKAGES="airflow" pre-commit run --hooks-stage manual mypy

to locally reproduce it.

@potiuk
Copy link
Member Author

potiuk commented Jan 7, 2024

Added the instructions. And also added one more thing that was wrong - when we update constraints we need to also make a dependency on the mypy job - otherwise we will update constraints, when mypy checks for new dependencies will fail which is precisely what canary builds are supposed to prevent. The canary builds will run with "latest" dependencies and if a new package released by someone will start failing our mypy checks we want to stop updating constraints until we fix it in main.

@potiuk
Copy link
Member Author

potiuk commented Jan 7, 2024

And added STATIC_CHECKS documentaiton explaining all that.

@potiuk
Copy link
Member Author

potiuk commented Jan 7, 2024

Final touches: Added --color always (in CI this is a pseudo-terminal and pre-commit does not detect color terminal support automatically) and --verbose - we will see a little more output even if mypy checks will be successful (how many files were checked).

MyPy as used in our static checks has slightly different heuristics
when running on on individual files and whole packages. This sometimes
causes semi-random failures when different set of files is produced
when pre-commits split the files between parallel processes.

The regular `mypy-*` pre-commits work by passing filenames to mypy
checks, and when `--all-files` flag is passed to mypy, this means
that 2700 files are passed. In this case pre-commit will split such
long list of files to several sequential muypy executions. This
is not very good because depending on the list of files passed,
mypy can split the list diferently and results will be different
when just list of files changes - so mypy might start detecting
problems that were not present before.

This PR introduces new `mypy` check that runs mypy for packages
rather than individual files. We cannot run them for local
pre-commit runs, because in many cases, such package based
mypy check will run for minutes when a single file changes,
due to cache invalidation rules - and we do not want to penalise
commits that are changing common airflow code (because such PRs
would invalidate a lot of mypy cache every time such common file
changes). So we still want to run file-based mypy for local
commits. But we do not want to pass 2700 files in CI, rather than
that on CI we want to run mypy checks "per package".

This PR introduces a new "manual" stage mypy pre-commit check that
will run "package" based mypy checks and adds selective check rules
that will decide properly when to run such tests and separate,
matrix-based CI job that will run such mypy checks - separately
for each of the packages: "airflow", "providers", "docs", "dev".

Also this job will skip providers checks in non-main branch and
will run all tests when "full tests needed" are requested.

This PR ignores some errors resulted from 3rd-party libraries used
that are randomply appearing when some files are modified (and fixes
the current main failures)
@potiuk
Copy link
Member Author

potiuk commented Jan 7, 2024

A little too much verbosity :)

@potiuk
Copy link
Member Author

potiuk commented Jan 7, 2024

Right...

That's good enough:

Screenshot 2024-01-07 at 12 57 34

@potiuk potiuk merged commit f7b663d into apache:main Jan 7, 2024
@potiuk potiuk deleted the fix-mypy-errors branch January 7, 2024 12:53
@ephraimbuddy ephraimbuddy added the changelog:skip Changes that should be skipped from the changelog (CI, tests, etc..) label Jan 10, 2024
@ephraimbuddy ephraimbuddy added this to the Airflow 2.8.1 milestone Jan 10, 2024
potiuk added a commit that referenced this pull request Jan 13, 2024
MyPy as used in our static checks has slightly different heuristics
when running on on individual files and whole packages. This sometimes
causes semi-random failures when different set of files is produced
when pre-commits split the files between parallel processes.

The regular `mypy-*` pre-commits work by passing filenames to mypy
checks, and when `--all-files` flag is passed to mypy, this means
that 2700 files are passed. In this case pre-commit will split such
long list of files to several sequential muypy executions. This
is not very good because depending on the list of files passed,
mypy can split the list diferently and results will be different
when just list of files changes - so mypy might start detecting
problems that were not present before.

This PR introduces new `mypy` check that runs mypy for packages
rather than individual files. We cannot run them for local
pre-commit runs, because in many cases, such package based
mypy check will run for minutes when a single file changes,
due to cache invalidation rules - and we do not want to penalise
commits that are changing common airflow code (because such PRs
would invalidate a lot of mypy cache every time such common file
changes). So we still want to run file-based mypy for local
commits. But we do not want to pass 2700 files in CI, rather than
that on CI we want to run mypy checks "per package".

This PR introduces a new "manual" stage mypy pre-commit check that
will run "package" based mypy checks and adds selective check rules
that will decide properly when to run such tests and separate,
matrix-based CI job that will run such mypy checks - separately
for each of the packages: "airflow", "providers", "docs", "dev".

Also this job will skip providers checks in non-main branch and
will run all tests when "full tests needed" are requested.

This PR ignores some errors resulted from 3rd-party libraries used
that are randomply appearing when some files are modified (and fixes
the current main failures)

(cherry picked from commit f7b663d)
potiuk added a commit to potiuk/airflow that referenced this pull request Jan 13, 2024
The apache#36638 change introduced "full package" checks - where in
case of CI we run mypy checks separately from regular static checks,
for the whole folders.

However it's been a little convoluted on how the checks were run,
with a separate env variable. Instead we can actually have multiple
mypy-* checks (same as we have for local pre-commit runs) as mypy
allows to have multiple checks with the same name in various stages.

This change simplifies the setup a bit:

* we name the checks "folder" checks because this is what they are
* we name the check names consistent ("airflow", "providers", "docs",
  "dev") with mypy-folders output
* we have separate small script to run the folder checks
* we map "providers" into "airflow/providers" in the pre-commit
potiuk added a commit that referenced this pull request Jan 13, 2024
The #36638 change introduced "full package" checks - where in
case of CI we run mypy checks separately from regular static checks,
for the whole folders.

However it's been a little convoluted on how the checks were run,
with a separate env variable. Instead we can actually have multiple
mypy-* checks (same as we have for local pre-commit runs) as mypy
allows to have multiple checks with the same name in various stages.

This change simplifies the setup a bit:

* we name the checks "folder" checks because this is what they are
* we name the check names consistent ("airflow", "providers", "docs",
  "dev") with mypy-folders output
* we have separate small script to run the folder checks
* we map "providers" into "airflow/providers" in the pre-commit
potiuk added a commit that referenced this pull request Jan 13, 2024
The #36638 change introduced "full package" checks - where in
case of CI we run mypy checks separately from regular static checks,
for the whole folders.

However it's been a little convoluted on how the checks were run,
with a separate env variable. Instead we can actually have multiple
mypy-* checks (same as we have for local pre-commit runs) as mypy
allows to have multiple checks with the same name in various stages.

This change simplifies the setup a bit:

* we name the checks "folder" checks because this is what they are
* we name the check names consistent ("airflow", "providers", "docs",
  "dev") with mypy-folders output
* we have separate small script to run the folder checks
* we map "providers" into "airflow/providers" in the pre-commit

(cherry picked from commit a912948)
ephraimbuddy pushed a commit that referenced this pull request Jan 15, 2024
MyPy as used in our static checks has slightly different heuristics
when running on on individual files and whole packages. This sometimes
causes semi-random failures when different set of files is produced
when pre-commits split the files between parallel processes.

The regular `mypy-*` pre-commits work by passing filenames to mypy
checks, and when `--all-files` flag is passed to mypy, this means
that 2700 files are passed. In this case pre-commit will split such
long list of files to several sequential muypy executions. This
is not very good because depending on the list of files passed,
mypy can split the list diferently and results will be different
when just list of files changes - so mypy might start detecting
problems that were not present before.

This PR introduces new `mypy` check that runs mypy for packages
rather than individual files. We cannot run them for local
pre-commit runs, because in many cases, such package based
mypy check will run for minutes when a single file changes,
due to cache invalidation rules - and we do not want to penalise
commits that are changing common airflow code (because such PRs
would invalidate a lot of mypy cache every time such common file
changes). So we still want to run file-based mypy for local
commits. But we do not want to pass 2700 files in CI, rather than
that on CI we want to run mypy checks "per package".

This PR introduces a new "manual" stage mypy pre-commit check that
will run "package" based mypy checks and adds selective check rules
that will decide properly when to run such tests and separate,
matrix-based CI job that will run such mypy checks - separately
for each of the packages: "airflow", "providers", "docs", "dev".

Also this job will skip providers checks in non-main branch and
will run all tests when "full tests needed" are requested.

This PR ignores some errors resulted from 3rd-party libraries used
that are randomply appearing when some files are modified (and fixes
the current main failures)

(cherry picked from commit f7b663d)
ephraimbuddy pushed a commit that referenced this pull request Jan 15, 2024
The #36638 change introduced "full package" checks - where in
case of CI we run mypy checks separately from regular static checks,
for the whole folders.

However it's been a little convoluted on how the checks were run,
with a separate env variable. Instead we can actually have multiple
mypy-* checks (same as we have for local pre-commit runs) as mypy
allows to have multiple checks with the same name in various stages.

This change simplifies the setup a bit:

* we name the checks "folder" checks because this is what they are
* we name the check names consistent ("airflow", "providers", "docs",
  "dev") with mypy-folders output
* we have separate small script to run the folder checks
* we map "providers" into "airflow/providers" in the pre-commit

(cherry picked from commit a912948)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:providers changelog:skip Changes that should be skipped from the changelog (CI, tests, etc..) provider:apache-druid provider:databricks provider:exasol provider:google Google (including GCP) related issues provider:grpc provider:hashicorp Hashicorp provider related issues provider:postgres provider:snowflake Issues related to Snowflake provider provider:trino provider:vertica

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants