Skip to content

Enable the use of libcufile #293

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 21 commits into from
Nov 28, 2024
Merged

Enable the use of libcufile #293

merged 21 commits into from
Nov 28, 2024

Conversation

mgorny
Copy link
Contributor

@mgorny mgorny commented Nov 22, 2024

Checklist

  • Used a personal fork of the feedstock to propose changes
  • Bumped the build number (if the version is unchanged)
  • Reset the build number to 0 (if the version changed)
  • Re-rendered with the latest conda-smithy (Use the phrase @conda-forge-admin, please rerender in a comment in this PR for automated rerendering)
  • Ensured the license file is being packaged.

Fixes #257

@conda-forge-admin
Copy link
Contributor

conda-forge-admin commented Nov 22, 2024

Hi! This is the friendly automated conda-forge-linting service.

I just wanted to let you know that I linted all conda-recipes in your PR (recipe/meta.yaml) and found it was in an excellent condition.

I do have some suggestions for making it better though...

For recipe/meta.yaml:

  • ℹ️ The recipe is not parsable by parser conda-souschef (grayskull). Your recipe may not receive automatic updates and/or may not be compatible with conda-forge's infrastructure. Please check the logs for more information and ensure your recipe can be parsed.
  • ℹ️ The recipe is not parsable by parser conda-recipe-manager. Your recipe may not receive automatic updates and/or may not be compatible with conda-forge's infrastructure. Please check the logs for more information and ensure your recipe can be parsed.

This message was generated by GitHub Actions workflow run https://github.com/conda-forge/conda-forge-webservices/actions/runs/12050123841. Examine the logs at this URL for more detail.

@jakirkham
Copy link
Member

Thanks Michał! 🙏

Had a suggestion above. Also happy to help with reviewing here

@mgorny
Copy link
Contributor Author

mgorny commented Nov 23, 2024

Updated. That said, I've no clue why smithy suddenly doubled the number of configurations (but it happens without my changes too).

@h-vetinari
Copy link
Member

That said, I've no clue why smithy suddenly doubled the number of configurations

We did a pretty major change to how the docker images and the stdlib interact. This means that the work-arounds in CBC are now likely causing extra configurations to appear.

Copy link
Member

@h-vetinari h-vetinari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cleaned up the now-obsolete config, picked your commit d912eab (c.f. discussion in #177), and leveraged the switch to CUDA 12.6, so this should now be substantially simplified. :)

@h-vetinari
Copy link
Member

h-vetinari commented Nov 23, 2024

@hmaarrfk, I'm not sure that 65ffa43 was supposed to end up being merged? As in: why are we not using the GHA server anymore?

@h-vetinari
Copy link
Member

@conda-forge/pytorch-cpu, after a bunch of restarts this now finally got through conda-forge/conda-smithy#2163 - the rest works without issue (debugging that continues in #294). The last commit is not strictly necessary (that was a false lead), but I didn't want to restart the CI yet again.

Note that it drops CUDA 11.8 as discussed in #177. Finally, I'd also be happy to join as a maintainer here - I've been in the "shadows" here long enough I think. ;-)

@jslee02
Copy link

jslee02 commented Nov 25, 2024

@hmaarrfk, I'm not sure that 65ffa43 was supposed to end up being merged? As in: why are we not using the GHA server anymore?

The goal might have been to build Linux packages locally to conserve CI resources temporarily. Since this doesn't seem intended as a long-term solution, re-enabling CI makes sense to me.

I tested this PR locally, and it appears that aarch64 with CUDA 12.6 fails:

+ pip check
No broken requirements found.
+ python -c 'import torch; torch.tensor(1).to('\''cpu'\'').numpy(); print('\''numpy support enabled!!!'\'')'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/conda/feedstock_root/build_artifacts/libtorch_1732480330557/_test_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placeho/lib/python3.12/site-packages/torch/__init__.py", line 367, in <module>
    from torch._C import *  # noqa: F403
    ^^^^^^^^^^^^^^^^^^^^^^
ImportError: /lib64/libm.so.6: version `GLIBC_2.27' not found (required by /home/conda/feedstock_root/build_artifacts/libtorch_1732480330557/_test_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placeho/lib/python3.12/site-packages/torch/../../../././libcufile.so.0)
WARNING: Tests failed for pytorch-2.5.1-cuda126_py312h8a24fa9_204.conda - moving package to /home/conda/feedstock_root/build_artifacts/broken
Traceback (most recent call last):
  File "/opt/conda/lib/python3.12/site-packages/conda_build/build.py", line 3483, in test
    utils.check_call_env(
  File "/opt/conda/lib/python3.12/site-packages/conda_build/utils.py", line 404, in check_call_env
    return _func_defaulting_env_to_os_environ("call", *popenargs, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.12/site-packages/conda_build/utils.py", line 380, in _func_defaulting_env_to_os_environ
    raise subprocess.CalledProcessError(proc.returncode, _args)
subprocess.CalledProcessError: Command '['/bin/bash', '-o', 'errexit', '/home/conda/feedstock_root/build_artifacts/libtorch_1732480330557/test_tmp/conda_test_runner.sh']' returned non-zero exit status 1.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/bin/conda-build", line 11, in <module>
    sys.exit(execute())
             ^^^^^^^^^
  File "/opt/conda/lib/python3.12/site-packages/conda_build/cli/main_build.py", line 589, in execute
    api.build(
  File "/opt/conda/lib/python3.12/site-packages/conda_build/api.py", line 209, in build
    return build_tree(
           ^^^^^^^^^^^
  File "/opt/conda/lib/python3.12/site-packages/conda_build/build.py", line 3670, in build_tree
    test(pkg, config=metadata.config.copy(), stats=stats)
  File "/opt/conda/lib/python3.12/site-packages/conda_build/build.py", line 3497, in test
    tests_failed(
  File "/opt/conda/lib/python3.12/site-packages/conda_build/build.py", line 3544, in tests_failed
    raise CondaBuildUserError("TESTS FAILED: " + os.path.basename(pkg))
conda_build.exceptions.CondaBuildUserError: TESTS FAILED: pytorch-2.5.1-cuda126_py312h8a24fa9_204.conda
valid configs are {'osx_arm64_numpy2.0python3.9.____cpython', 'osx_64_blas_implgenericnumpy2.0python3.9.____cpython', 'osx_arm64_numpy2.0python3.12.____cpython', 'osx_arm64_numpy2.0python3.10.____cpython', 'osx_arm64_numpy2python3.13.____cp313', 'osx_64_blas_implmklnumpy2.0python3.9.____cpython', 'osx_64_blas_implmklnumpy2.0python3.10.____cpython', 'osx_64_blas_implmklnumpy2python3.13.____cp313', 'linux_aarch64_c_compiler_version13cuda_compilerNonecuda_compiler_versionNonecxx_compiler_version13', 'linux_64_blas_implmklc_compiler_version13cuda_compilerNonecuda_compiler_versionNonecxx_compiler_version13', 'osx_64_blas_implmklnumpy2.0python3.11.____cpython', 'linux_64_blas_implgenericc_compiler_version12cuda_compilercuda-nvcccuda_compiler_version12.6cxx_compiler_version12', 'osx_64_blas_implgenericnumpy2.0python3.10.____cpython', 'osx_64_blas_implgenericnumpy2python3.13.____cp313', 'osx_64_blas_implmklnumpy2.0python3.12.____cpython', 'osx_64_blas_implgenericnumpy2.0python3.11.____cpython', 'osx_64_blas_implgenericnumpy2.0python3.12.____cpython', 'linux_64_blas_implmklc_compiler_version12cuda_compilercuda-nvcccuda_compiler_version12.6cxx_compiler_version12', 'linux_aarch64_c_compiler_version12cuda_compilercuda-nvcccuda_compiler_version12.6cxx_compiler_version12', 'osx_arm64_numpy2.0python3.11.____cpython', 'linux_64_blas_implgenericc_compiler_version13cuda_compilerNonecuda_compiler_versionNonecxx_compiler_version13'}
Using linux_aarch64_c_compiler_version12cuda_compilercuda-nvcccuda_compiler_version12.6cxx_compiler_version12 configuration

@h-vetinari
Copy link
Member

ImportError: /lib64/libm.so.6: version GLIBC_2.27' not found`

This is not a failure per se, it just means you're on a too-old system (you need at least glibc 2.28). My understanding had been that only the CUDA binaries themselves (and not necessarily) pytorch required newer glibc, but it'd be very easy to add

c_stdlib_version:  # [aarch64]
  - "2.28"         # [aarch64]

to enforce that constraint.

@h-vetinari
Copy link
Member

I purposefully didn't start the CIRUN but here is a demo of me failing to cancel jobs and

Job cancellation on azure and on github are two completely different things...? If you want to be able to cancel jobs on azure, you'll need to ask Matt to add you to the admins of the conda-forge org (on the AzurePipelines-side).

In contrast, cancelling and restarting jobs in the github actions for cirun should work as soon as you have the rights to cirun.

image

image

@hmaarrfk
Copy link
Contributor

thanks!

@h-vetinari
Copy link
Member

Under normal circumstances this shouldn't be possible - the job is running inside an alma8 image with glibc 2.28. It might be that libcufile has been compiled without picking up the right sysroot

OK, I think I found the solution. The test environment containts an old sysroot (presumably as a transitive dependency), which is probably triggering this.

So the immediate solution is to add

run_constrained:
  - sysroot_{{ target_platform }} >={{ c_stdlib_version }}

to libcufile. The large-scale solution would be conda-forge/linux-sysroot-feedstock#63.

@hmaarrfk
Copy link
Contributor

hmaarrfk commented Nov 27, 2024

ok lets try an even shorter term solution! See latest commit!

@mgorny
Copy link
Contributor Author

mgorny commented Nov 27, 2024

If I have more changes, should I add them here or start another PR to let this one finish building (presumably, after you restart the builds)?

@hmaarrfk
Copy link
Contributor

So the CI failures are quite annoying..... as h-vetinari alluded to and make it hard to merge into main.

  • If you are experimenting, probably best to leave this working.
  • If you know they will work, you can add them here.

Babysitting the builds once they get merge (or triggering CFEP03) is tiring.... so "fewer merges" is good while the CIs are flaky.....

# The medium term solution is to add such a constraint to libcufile
# The long term solution is to add such a constraint to all packages
# that depend on a specific sysroot at building.
- sysroot_{{ target_platform }} >={{ c_stdlib_version }}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're doing this here rather than in libcufile, you either need to write >=2.28 explicitly, or set

c_stdlib_version:  # [aarch64]
  - "2.28"         # [aarch64]

and rerender

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks!

@hmaarrfk
Copy link
Contributor

@conda-forge-admin please rerender

@conda-forge-admin
Copy link
Contributor

conda-forge-admin commented Nov 27, 2024

Hi! This is the friendly automated conda-forge-linting service.

I wanted to let you know that I linted all conda-recipes in your PR (recipe/meta.yaml) and found some lint.

Here's what I've got...

For recipe/meta.yaml:

  • ❌ You are setting c_stdlib_version below the current global baseline in conda-forge (10.13). If this is your intention, you also need to override MACOSX_DEPLOYMENT_TARGET (with the same value) locally.

For recipe/meta.yaml:

  • ℹ️ The recipe is not parsable by parser conda-souschef (grayskull). Your recipe may not receive automatic updates and/or may not be compatible with conda-forge's infrastructure. Please check the logs for more information and ensure your recipe can be parsed.
  • ℹ️ The recipe is not parsable by parser conda-recipe-manager. Your recipe may not receive automatic updates and/or may not be compatible with conda-forge's infrastructure. Please check the logs for more information and ensure your recipe can be parsed.

This message was generated by GitHub Actions workflow run https://github.com/conda-forge/conda-forge-webservices/actions/runs/12055186149. Examine the logs at this URL for more detail.

@mgorny
Copy link
Contributor Author

mgorny commented Nov 27, 2024

If you are experimenting, probably best to leave this working.

They're touching cross, so perhaps best leave them out for now, and make another later. Maybe I'll have more changes by then.

@hmaarrfk
Copy link
Contributor

Sounds good. well then we can see if restarting these jobs enough will get them to be green.

@h-vetinari
Copy link
Member

h-vetinari commented Nov 27, 2024

Sounds good. well then we can see if restarting these jobs enough will get them to be green.

Apparently the issues should be fixed now (thanks @aktech! 🙏) -- or at least mitigated. I'll push a fix for the linter error that'll also test the hypothesis whether turning off the affected GPU returns stability.

@conda-forge-admin
Copy link
Contributor

Hi! This is the friendly automated conda-forge-linting service.

I just wanted to let you know that I linted all conda-recipes in your PR (recipe/meta.yaml) and found it was in an excellent condition.

I do have some suggestions for making it better though...

For recipe/meta.yaml:

  • ℹ️ The recipe is not parsable by parser conda-souschef (grayskull). Your recipe may not receive automatic updates and/or may not be compatible with conda-forge's infrastructure. Please check the logs for more information and ensure your recipe can be parsed.
  • ℹ️ The recipe is not parsable by parser conda-recipe-manager. Your recipe may not receive automatic updates and/or may not be compatible with conda-forge's infrastructure. Please check the logs for more information and ensure your recipe can be parsed.

This message was generated by GitHub Actions workflow run https://github.com/conda-forge/conda-forge-webservices/actions/runs/12059425335. Examine the logs at this URL for more detail.

@h-vetinari
Copy link
Member

h-vetinari commented Nov 28, 2024

All 6 CI runs on the opengpu server now started successfully, so I'm going to claim that the fix was sufficient. Thanks again @aktech! :)

@mgorny
Copy link
Contributor Author

mgorny commented Nov 28, 2024

And yay, it's all green!

Copy link
Member

@h-vetinari h-vetinari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👏

@hmaarrfk hmaarrfk merged commit 78ea0ce into conda-forge:main Nov 28, 2024
25 checks passed
@hmaarrfk
Copy link
Contributor

Nice to see some parallel builds in these final stages!

@mgorny mgorny deleted the cufile branch November 28, 2024 19:39
facebook-github-bot pushed a commit to facebookresearch/momentum that referenced this pull request Dec 3, 2024
Summary:
This PR aims to reduce maintenance effort now that the latest `conda-forge::pytorch` package [has dropped the CUDA 11.8 support](conda-forge/pytorch-cpu-feedstock#293 (comment)), and we no longer have any use cases for CUDA 11.8.

Additionally, removing the unnecessary use of 'pytorch' and 'nvidia' channels to simplify the number of conda channels used.

## Checklist:

- [x] Adheres to the [style guidelines](https://facebookincubator.github.io/momentum/docs/developer_guide/style_guide)
- [x] Codebase formatted by running `pixi run lint`

Pull Request resolved: #148

Test Plan: CI

Reviewed By: nickyhe-gemini

Differential Revision: D66679272

Pulled By: jeongseok-meta

fbshipit-source-id: a3fb7ef53e2b62a501601d871c76b99315b3ae96
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

(Future release) building with GDS
7 participants