-
Notifications
You must be signed in to change notification settings - Fork 15.9k
Make pandas an optional core dependency
#17575
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
potiuk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Huge yes.
|
The PR most likely needs to run full matrix of tests because it modifies parts of the core of Airflow. However, committers might decide to merge it quickly and take the risk. If they don't merge it quickly - please rebase it to the latest main at your convenience, or amend the last commit of the PR, and push it with --force-with-lease. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just realized -we also need to add it to set of extras in production Dockerfile (to keep it behave the way it did).
Line 37 in 9875757
| ARG AIRFLOW_EXTRAS="async,amazon,celery,cncf.kubernetes,docker,dask,elasticsearch,ftp,grpc,hashicorp,http,ldap,google,google_auth,microsoft.azure,mysql,postgres,redis,sendgrid,sftp,slack,ssh,statsd,virtualenv" |
|
And the Dockerfile will continue to work for old airflow versions as well. AFAIK when you specifiy extra that is missing, it will be silently ignored, so it will work for pre 2.2 Airflow versions as well. |
|
I think this closes #12500 - are there any other uses of numpy in the code base? |
Not yet. It is used at least in AirflowJsonEncoder 😱 |
setup.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| # Pandas stopped releasing 3.6 binaries for 1.2.* series. | |
| 'pandas>=0.17.1, <1.2;python_version<"3.7"', | |
| 'pandas>=0.17.1, <2.0;python_version>="3.7"', | |
| 'pandas>=0.17.1, <2.0', |
Pandas 1.2+ provides the Requires-Python metadata, so 1.2+ versions will be automatically excluded when pip install on Python 3.6:
$ python3.6 -m venv x
$ . x/bin/activate
(x) $ pip install pandas
Collecting pandas
Using cached https://.../pandas-1.1.5-cp36-cp36m-manylinux1_x86_64.whl
Collecting python-dateutil>=2.7.3 (from pandas)
...
So the python_version<"3.7" line is not needed. (Although it should not have a negative impact either.)
Note: Python 3.6.0 ships with pip 9.1 by default, which supports Requires-Python, so it’s highly unlikely someone on 3.6 running on pip version not supporting this metadata.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, updated
We only use `pandas` in `DbApiHook.get_pandas_df`. Not all users use it, plus while `pandas` now supports many pre-compiled packages it still can take forever where it needs to be compiled. So for first-time users this can be a turn off. If pandas is already installed this will work fine, but if not users have an option to run `pip install apache-airflow[pandas]`
|
398ebfe should make numpy optional as well |
Missed removing ``numpy`` from `setup.cfg` in apache#17575. It was only added in setup.cfg in https://github.com/apache/airflow/pull/15209/files#diff-380c6a8ebbbce17d55d50ef17d3cf906 numpy already has `python_requires` metadata: https://github.com/numpy/numpy/blob/v1.20.3/setup.py#L473 so we don't need to set `numpy<1.20;python_version<"3.7"`
Missed removing ``numpy`` from `setup.cfg` in #17575. It was only added in setup.cfg in https://github.com/apache/airflow/pull/15209/files#diff-380c6a8ebbbce17d55d50ef17d3cf906 numpy already has `python_requires` metadata: https://github.com/numpy/numpy/blob/v1.20.3/setup.py#L473 so we don't need to set `numpy<1.20;python_version<"3.7"`
|
@kaxil Should this import also be put inside a try block, if airflow/tests/utils/test_json.py Line 24 in 11e3453
|
We install that in Dockerfile so we have it while running tests |
Agree. The "test" environment has all the deps in - including pandas and numpy. The "test" dependency ([devel_all] extra) defines all things needed to run tests. I think while numpy is not needed as "airflow" dependency, it's perfectly fine to have it as "test airflow" dependency. |
@kaxil @potiuk Thanks for the feedback. The reason I ask is because of the discussion here: #18343 (comment). For the test environment there is no issue as you've mentioned, so I guess the concern there is about the local development environment? Of course, if you run a test locally and numpy is missing, you can just install it. |
We have PLENTY of tests that will fail when some extra is not installed. That's why we have https://github.com/apache/airflow/blob/main/BREEZE.rst with all deps installed in the docker image (which is 1-1 replica of CI test environment), and https://github.com/apache/airflow/blob/main/LOCAL_VIRTUALENV.rst#creating-a-local-virtualenv where we explain that you need to add "devel" extra if you want to install local virtualenv for development. So I think all is good here |
@potiuk Thanks. One clarification - from #18343 (comment), I got the impression that |
Depends what you mean by "core". There is no mentioning of "core" in the extras doc: From https://airflow.apache.org/docs/apache-airflow/stable/extra-packages-ref.html#bundle-extras
It's clear that |
Ah okay, that is a good point.
Sounds good. If I come up with something concrete, I'll submit a PR. Thanks for clarifying my questions! |
Missed removing ``numpy`` from `setup.cfg` in apache/airflow#17575. It was only added in setup.cfg in https://github.com/apache/airflow/pull/15209/files#diff-380c6a8ebbbce17d55d50ef17d3cf906 numpy already has `python_requires` metadata: https://github.com/numpy/numpy/blob/v1.20.3/setup.py#L473 so we don't need to set `numpy<1.20;python_version<"3.7"` GitOrigin-RevId: 6b327c14f773e265d9c54551d9ca273ac7350fdb
Missed removing ``numpy`` from `setup.cfg` in apache/airflow#17575. It was only added in setup.cfg in https://github.com/apache/airflow/pull/15209/files#diff-380c6a8ebbbce17d55d50ef17d3cf906 numpy already has `python_requires` metadata: https://github.com/numpy/numpy/blob/v1.20.3/setup.py#L473 so we don't need to set `numpy<1.20;python_version<"3.7"` GitOrigin-RevId: 6b327c14f773e265d9c54551d9ca273ac7350fdb
Missed removing ``numpy`` from `setup.cfg` in apache/airflow#17575. It was only added in setup.cfg in https://github.com/apache/airflow/pull/15209/files#diff-380c6a8ebbbce17d55d50ef17d3cf906 numpy already has `python_requires` metadata: https://github.com/numpy/numpy/blob/v1.20.3/setup.py#L473 so we don't need to set `numpy<1.20;python_version<"3.7"` GitOrigin-RevId: 6b327c14f773e265d9c54551d9ca273ac7350fdb
Missed removing ``numpy`` from `setup.cfg` in apache/airflow#17575. It was only added in setup.cfg in https://github.com/apache/airflow/pull/15209/files#diff-380c6a8ebbbce17d55d50ef17d3cf906 numpy already has `python_requires` metadata: https://github.com/numpy/numpy/blob/v1.20.3/setup.py#L473 so we don't need to set `numpy<1.20;python_version<"3.7"` GitOrigin-RevId: 6b327c14f773e265d9c54551d9ca273ac7350fdb
Missed removing ``numpy`` from `setup.cfg` in apache/airflow#17575. It was only added in setup.cfg in https://github.com/apache/airflow/pull/15209/files#diff-380c6a8ebbbce17d55d50ef17d3cf906 numpy already has `python_requires` metadata: https://github.com/numpy/numpy/blob/v1.20.3/setup.py#L473 so we don't need to set `numpy<1.20;python_version<"3.7"` GitOrigin-RevId: 6b327c14f773e265d9c54551d9ca273ac7350fdb
Missed removing ``numpy`` from `setup.cfg` in apache/airflow#17575. It was only added in setup.cfg in https://github.com/apache/airflow/pull/15209/files#diff-380c6a8ebbbce17d55d50ef17d3cf906 numpy already has `python_requires` metadata: https://github.com/numpy/numpy/blob/v1.20.3/setup.py#L473 so we don't need to set `numpy<1.20;python_version<"3.7"` GitOrigin-RevId: 6b327c14f773e265d9c54551d9ca273ac7350fdb
Missed removing ``numpy`` from `setup.cfg` in apache/airflow#17575. It was only added in setup.cfg in https://github.com/apache/airflow/pull/15209/files#diff-380c6a8ebbbce17d55d50ef17d3cf906 numpy already has `python_requires` metadata: https://github.com/numpy/numpy/blob/v1.20.3/setup.py#L473 so we don't need to set `numpy<1.20;python_version<"3.7"` GitOrigin-RevId: 6b327c14f773e265d9c54551d9ca273ac7350fdb
Missed removing ``numpy`` from `setup.cfg` in apache/airflow#17575. It was only added in setup.cfg in https://github.com/apache/airflow/pull/15209/files#diff-380c6a8ebbbce17d55d50ef17d3cf906 numpy already has `python_requires` metadata: https://github.com/numpy/numpy/blob/v1.20.3/setup.py#L473 so we don't need to set `numpy<1.20;python_version<"3.7"` GitOrigin-RevId: 6b327c14f773e265d9c54551d9ca273ac7350fdb
Missed removing ``numpy`` from `setup.cfg` in apache/airflow#17575. It was only added in setup.cfg in https://github.com/apache/airflow/pull/15209/files#diff-380c6a8ebbbce17d55d50ef17d3cf906 numpy already has `python_requires` metadata: https://github.com/numpy/numpy/blob/v1.20.3/setup.py#L473 so we don't need to set `numpy<1.20;python_version<"3.7"` GitOrigin-RevId: 6b327c14f773e265d9c54551d9ca273ac7350fdb
Missed removing ``numpy`` from `setup.cfg` in apache/airflow#17575. It was only added in setup.cfg in https://github.com/apache/airflow/pull/15209/files#diff-380c6a8ebbbce17d55d50ef17d3cf906 numpy already has `python_requires` metadata: https://github.com/numpy/numpy/blob/v1.20.3/setup.py#L473 so we don't need to set `numpy<1.20;python_version<"3.7"` GitOrigin-RevId: 6b327c14f773e265d9c54551d9ca273ac7350fdb
Missed removing ``numpy`` from `setup.cfg` in apache/airflow#17575. It was only added in setup.cfg in https://github.com/apache/airflow/pull/15209/files#diff-380c6a8ebbbce17d55d50ef17d3cf906 numpy already has `python_requires` metadata: https://github.com/numpy/numpy/blob/v1.20.3/setup.py#L473 so we don't need to set `numpy<1.20;python_version<"3.7"` GitOrigin-RevId: 6b327c14f773e265d9c54551d9ca273ac7350fdb
Missed removing ``numpy`` from `setup.cfg` in apache/airflow#17575. It was only added in setup.cfg in https://github.com/apache/airflow/pull/15209/files#diff-380c6a8ebbbce17d55d50ef17d3cf906 numpy already has `python_requires` metadata: https://github.com/numpy/numpy/blob/v1.20.3/setup.py#L473 so we don't need to set `numpy<1.20;python_version<"3.7"` GitOrigin-RevId: 6b327c14f773e265d9c54551d9ca273ac7350fdb
Missed removing ``numpy`` from `setup.cfg` in apache/airflow#17575. It was only added in setup.cfg in https://github.com/apache/airflow/pull/15209/files#diff-380c6a8ebbbce17d55d50ef17d3cf906 numpy already has `python_requires` metadata: https://github.com/numpy/numpy/blob/v1.20.3/setup.py#L473 so we don't need to set `numpy<1.20;python_version<"3.7"` GitOrigin-RevId: 6b327c14f773e265d9c54551d9ca273ac7350fdb
Missed removing ``numpy`` from `setup.cfg` in apache/airflow#17575. It was only added in setup.cfg in https://github.com/apache/airflow/pull/15209/files#diff-380c6a8ebbbce17d55d50ef17d3cf906 numpy already has `python_requires` metadata: https://github.com/numpy/numpy/blob/v1.20.3/setup.py#L473 so we don't need to set `numpy<1.20;python_version<"3.7"` GitOrigin-RevId: 6b327c14f773e265d9c54551d9ca273ac7350fdb
Missed removing ``numpy`` from `setup.cfg` in apache/airflow#17575. It was only added in setup.cfg in https://github.com/apache/airflow/pull/15209/files#diff-380c6a8ebbbce17d55d50ef17d3cf906 numpy already has `python_requires` metadata: https://github.com/numpy/numpy/blob/v1.20.3/setup.py#L473 so we don't need to set `numpy<1.20;python_version<"3.7"` GitOrigin-RevId: 6b327c14f773e265d9c54551d9ca273ac7350fdb
We only use
pandasinDbApiHook.get_pandas_df. Not all users use it, pluswhile
pandasnow supports many pre-compiled packages it still can take forever whereit needs to be compiled.
So for first-time users this can be a turn off. If pandas is already installed this
will work fine, but if not users have an option to run
pip install apache-airflow[pandas]Related #12500
^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code change, Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in UPDATING.md.