-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-39555: [Packaging][Python] Enable building pyarrow against numpy 2.0 #39557
GH-39555: [Packaging][Python] Enable building pyarrow against numpy 2.0 #39557
Conversation
@@ -97,4 +97,4 @@ SHELL ["/bin/bash", "-i", "-c"] | |||
ENTRYPOINT ["/bin/bash", "-i", "-c"] | |||
|
|||
COPY python/requirements-wheel-build.txt /arrow/python/ | |||
RUN pip install -r /arrow/python/requirements-wheel-build.txt | |||
RUN pip install -r /arrow/python/requirements-wheel-build.txt --pre --extra-index-url "https://pypi.anaconda.org/scientific-python-nightly-wheels/simple" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just added this for testing now, but so eventually this should only be done when this is being run as a nightly job (for creating the nightly wheels), and not on a release.
I am not sure how to detect that inside this dockerfile (passing some env variable?). Now, in practice that is maybe also not needed, because since we don't plan to backport this to the 15.x branch, this commit should only live on main and by the time of the next 16.x release cycle, we should be able to remove this again (by that time, numpy 2.0 should be released)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just use the same logic as in ci/scripts/install_pandas.sh
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we open an issue labeled as blocker for 16.0 to remove this again (to build with (by then) released numpy), we won't forget to update this, and for now I think it is fine to just always build with numpy nightly on the main
branch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Building with Numpy nightly doesn't seem like a tremendous idea, is it? It is not required functionally and may introduce instability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At least should add a comment, and open an issue to remove this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a comment and opened a blocker issue -> #39848
|
@github-actions crossbow submit -g wheel -g python |
Revision: 185ffba Submitted crossbow builds: ursacomputing/crossbow @ actions-ad937328cf |
# Starting with NumPy 1.25, NumPy is (by default) as far back compatible | ||
# as oldest-support-numpy was (customizable with a NPY_TARGET_VERSION |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this always be the case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean exactly with "always"? For all Python versions, or all numpy versions? Or the full API?
It's a guarantee that is given by NumPy starting with version 1.25: https://numpy.org/devdocs/release/1.25.0-notes.html#compiling-against-the-numpy-c-api-is-now-backwards-compatible-by-default
(see also some details I posted on the parent issue: #39532 (comment))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean that oldest-supported-numpy also accounts for bugs that cropped up in the past, e.g. broken ARM compatibility.
https://github.com/scipy/oldest-supported-numpy/blob/main/setup.cfg
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I understand, oldest-supported-numpy pins to a newer numpy version for some platforms / architectures (like arm) because the older version that is theoretically supported has bugs. But so that means that if we always use the latest available version for a platform, that should be fine?
Of course, there might be a new bug occurring in the latest version, and at that point, defaulting to "the latest available version" might temporarily give issues, compared to pinning to a slightly older but "known to be OK" version (for example, if we know 1.25 is fine, we could pin to that in case the newest 1.26.x would introduce a bug).
But that's a problem for all packages building against numpy, and so it would be good if numpy would have guidelines around this. And my understanding is that it is now recommended to built against the latest numpy version (although https://numpy.org/devdocs/dev/depending_on_numpy.html#build-time-dependency is not explicit about that, it just mentions that starting from 1.25, you can use that version to built packages that are also compatible with older numpy)
cc @seberg
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds all exactly right.
I suppose the docs linked could recommend to compile against the latest available version (unless there be bugs). This is required for major releases, i.e. NumPy 2 (you cannot build releases against it yet since there is no rc
, though. I still need to opaquify the descriptor struct a bit, which you may notice).
To continue supporting building against pre 2.0 releases, you may have to vendor (numpy/_core/include/)numpy/npy_2_compat.h
.
Can the Numpy 2.0 build be a separate CI build? |
We already have a build where we test with numpy 2.0 (the pandas-nightly installs the nightly wheels for both pandas and numpy). Here we want wheels builds that build against numpy 2.0. While in theory I could duplicate all our wheels builds to have one version building against numpy 1.x and one against numpy 2.0.dev, that are a lot of builds that would be added, and it's sufficient to upload nightly wheels built with the latest numpy for downstream testing. (some other projects like pandas, scikit-learn and matplotlib also have switched to build their nightly wheels with nightly numpy, despite numpy 2.0.dev not yet being stable. That's only expected with the first RC targeted for February). |
It is functionally required, if we want that our nightly wheels can be used alongside other nightly wheels (of numpy, pandas, etc). And for dowstream packages testing against pyarrow nightly, they typically also test against the nightly version of other packages (for example, dask is currently not correctly testing with pyarrow nightly because the pyarrow import just fails because of the presence of numpy nightly) |
I'm not sure I understand: are build-time requirements tied to runtime requirements? Why would a PyArrow built with NumPy 1.25 not be usable with a Pandas built with NumPy 2.0? |
Because numpy 2.0 breaks the ABI, and if you want to be runtime compatible with numpy 2.0, you need to build with numpy 2.0 (see also my notes and links in the parent issue #39532)
Sorry, to be clear: it's only for the numpy dependency of course. PyArrow built with NumPy 1.25 is not usable with a NumPy 2.0. Whether pandas is built with numpy 1.x or 2.x indeed doesn't matter, it's just that in the example I was giving, the specific CI build is testing both pandas and pyarrow nightly, with numpy nightly in that environment, and pandas is being tested fine because its nightly wheels are built with numpy nightly, while pyarrow cannot be imported. |
Ok, I had no idea this was the case. Could you add this to the PR description, along with a link to https://numpy.org/devdocs/dev/depending_on_numpy.html#numpy-2-0-specific-advice ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding the comments! Can you rebase and run Crossbow CI before merging?
@github-actions crossbow submit -g wheel -g python |
Revision: 88daada Submitted crossbow builds: ursacomputing/crossbow @ actions-aedf531b7e |
Hmm, looks like the manylinux Docker image build is broken for some unrelated reason? @kou
|
@github-actions crossbow submit wheel-manylinux* |
Revision: 66748cd Submitted crossbow builds: ursacomputing/crossbow @ actions-a7207ad732 |
The wheel builds are ok now, I think this can be merged. |
RHEL's |
It seems that our test uses numpy 1.26.3: https://github.com/ursacomputing/crossbow/actions/runs/7710469177/job/21013844540#step:9:122
Should we add a test with numpy 2? |
It's good that it tests with 1.26, that ensures that the wheel build with 2.0 actually runs on an older numpy. We already have a separate nightly integration build that runs with numpy nightly (the pandas-nightly build) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
I see.
@pitrou thanks for fixing the manylinux docker image! |
After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit a1c1773. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 4 possible false positives for unstable benchmarks that are known to sometimes produce them. |
…umpy 2.0 (apache#39557) ### Rationale for this change Ensure we can build pyarrow against numpy 2.0 nightly (update pyproject.toml to allow this), and test this by building our nightly wheels with numpy nightly. This also ensures that other projects that use our nightly wheels to test together with numpy nightly can do that (numpy 2.0 changes the ABI, so to run with numpy 2.0, your package needs to be built with numpy 2.x; currently pyarrow installed with our nightly wheel will fail to import when also numpy nightly is installed). See the parent issue apache#39532 for details, and https://numpy.org/devdocs/dev/depending_on_numpy.html#numpy-2-0-specific-advice for a direct link to the NumPy guidelines on updating build dependencies for NumPy 2.0. * Closes: apache#39555 Lead-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
…umpy 2.0 (apache#39557) ### Rationale for this change Ensure we can build pyarrow against numpy 2.0 nightly (update pyproject.toml to allow this), and test this by building our nightly wheels with numpy nightly. This also ensures that other projects that use our nightly wheels to test together with numpy nightly can do that (numpy 2.0 changes the ABI, so to run with numpy 2.0, your package needs to be built with numpy 2.x; currently pyarrow installed with our nightly wheel will fail to import when also numpy nightly is installed). See the parent issue apache#39532 for details, and https://numpy.org/devdocs/dev/depending_on_numpy.html#numpy-2-0-specific-advice for a direct link to the NumPy guidelines on updating build dependencies for NumPy 2.0. * Closes: apache#39555 Lead-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
…umpy 2.0 (apache#39557) ### Rationale for this change Ensure we can build pyarrow against numpy 2.0 nightly (update pyproject.toml to allow this), and test this by building our nightly wheels with numpy nightly. This also ensures that other projects that use our nightly wheels to test together with numpy nightly can do that (numpy 2.0 changes the ABI, so to run with numpy 2.0, your package needs to be built with numpy 2.x; currently pyarrow installed with our nightly wheel will fail to import when also numpy nightly is installed). See the parent issue apache#39532 for details, and https://numpy.org/devdocs/dev/depending_on_numpy.html#numpy-2-0-specific-advice for a direct link to the NumPy guidelines on updating build dependencies for NumPy 2.0. * Closes: apache#39555 Lead-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Rationale for this change
Ensure we can build pyarrow against numpy 2.0 nightly (update pyproject.toml to allow this), and test this by building our nightly wheels with numpy nightly. This also ensures that other projects that use our nightly wheels to test together with numpy nightly can do that (numpy 2.0 changes the ABI, so to run with numpy 2.0, your package needs to be built with numpy 2.x; currently pyarrow installed with our nightly wheel will fail to import when also numpy nightly is installed).
See the parent issue #39532 for details, and https://numpy.org/devdocs/dev/depending_on_numpy.html#numpy-2-0-specific-advice for a direct link to the NumPy guidelines on updating build dependencies for NumPy 2.0.