Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid version identifier in filenames of nightly builds #7697

Closed
fellhorn opened this issue Jul 16, 2024 · 8 comments · Fixed by #7722 or #7891
Closed

Invalid version identifier in filenames of nightly builds #7697

fellhorn opened this issue Jul 16, 2024 · 8 comments · Fixed by #7722 or #7891

Comments

@fellhorn
Copy link

🐛 Bug

pip 24.1 deprecated legacy version identifiers and no longer allows installing the current nightly wheels directly. Other python package managers, like e.g. uv never supported these identifiers and always required renaming the wheel.

Additionally the version identifier in the wheel is different than the one in the filename.

To Reproduce

Steps to reproduce the behavior:

uv

> uv pip install torch_xla@https://storage.googleapis.com/pytorch-xla-releases/wheels/cuda/12.1/torch_xla-nightly-cp311-cp311-linux_x86_64.whl
error: The wheel filename "torch_xla-nightly-cp311-cp311-linux_x86_64.whl" has an invalid version part: expected version to start with a number, but no leading ASCII digits were found

pip

Broken:

> pip install pip==24.1.2
...
> pip install https://storage.googleapis.com/pytorch-xla-releases/wheels/cuda/12.1/torch_xla-nightly-cp311-cp311-linux_x86_64.whl
ERROR: Invalid requirement: 'torch-xla==nightly': Expected end or semicolon (after name and no valid version specifier)
    torch-xla==nightly
             ^

For others potentially finding this issue and need a workaround:

🟢 Works with torch_xla@ format

> pip install pip==24.1.2
...
> pip install torch_xla@https://storage.googleapis.com/pytorch-xla-releases/wheels/cuda/12.1/torch_xla-nightly-cp311-cp311-linux_x86_64.whl
...
Installing collected packages: torch_xla
Successfully installed torch_xla-2.5.0+git41d998d

🟢 Works with older pip versions

> pip install "pip<24"
...
> pip install https://storage.googleapis.com/pytorch-xla-releases/wheels/cuda/12.1/torch_xla-nightly-cp311-cp311-linux_x86_64.whl
...
Installing collected packages: torch_xla
Successfully installed torch_xla-2.5.0+git41d998d

Expected behavior

I would expect the version identifier in the filename to match the one in the wheel and be a valid identifier. This should allow installation with uv and modern pip versions.

Potential solutions

Ideas:

  • Use last release + day identifier like torch_xla-2.5.0+nightly20240716-cp311-cp311-linux_x86_64.whl
  • Keep current git hash versions, e.g. torch_xla-2.5.0+git41d998d-cp311-cp311-linux_x86_64.whl with a static URL redirecting to it
@JackCaoG
Copy link
Collaborator

@will-cromar FYI, @wonjoolee95 too since you are fixing the similar issue for our gpu whls

@wonjoolee95
Copy link
Collaborator

This is helpful, thanks for the info! I'm able to reproduce:

# Fails
wonjoo@t1v-n-b72eb559-w-0:~$ pip install https://storage.googleapis.com/pytorch-xla-releases/wheels/cuda/12.1/torch_xla-nightly-cp311-cp311-linux_x86_64.whl
Defaulting to user installation because normal site-packages is not writeable
ERROR: Invalid requirement: 'torch-xla==nightly': Expected end or semicolon (after name and no valid version specifier)
    torch-xla==nightly
             ^
# Works
wonjoo@t1v-n-b72eb559-w-0:~$ pip install "pip<24"
Defaulting to user installation because normal site-packages is not writeable
Collecting pip<24
  Downloading pip-23.3.2-py3-none-any.whl.metadata (3.5 kB)
Downloading pip-23.3.2-py3-none-any.whl (2.1 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.1/2.1 MB 34.0 MB/s eta 0:00:00
WARNING: Error parsing dependencies of distro-info: Invalid version: '1.1build1'
WARNING: Error parsing dependencies of python-debian: Invalid version: '0.1.43ubuntu1'
Installing collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 24.1.2
    Uninstalling pip-24.1.2:
      Successfully uninstalled pip-24.1.2
  WARNING: The scripts pip, pip3 and pip3.10 are installed in '/home/wonjoo/.local/bin' which is not on PATH.
  Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
Successfully installed pip-23.3.2

I think it's better if we do pip install "pip<24" to fix our GPU wheels asap, and then come up with a more long term solution. @will-cromar, do you know where would be the correct place to have this pip install "pip<24" command in our /infra files?

@will-cromar
Copy link
Collaborator

Is this issue actually what's causing our build breakage? Why are the TPU builds passing but not the GPU builds? The most recent failures I see there are this:

Step #2 - "build_xla_docker_image":     ERROR: An error occurred during the fetch of repository 'go_sdk':
Step #2 - "build_xla_docker_image":        Traceback (most recent call last):
Step #2 - "build_xla_docker_image":             File "/root/.cache/bazel/_bazel_root/2ba57cc32d8c1f12152416615363d16d/external/io_bazel_rules_go/go/private/sdk.bzl", line 101, column 16, in _go_download_sdk_impl
Step #2 - "build_xla_docker_image":                     _remote_sdk(ctx, [url.format(filename) for url in ctx.attr.urls], ctx.attr.strip_prefix, sha256)
Step #2 - "build_xla_docker_image":             File "/root/.cache/bazel/_bazel_root/2ba57cc32d8c1f12152416615363d16d/external/io_bazel_rules_go/go/private/sdk.bzl", line 209, column 21, in _remote_sdk
Step #2 - "build_xla_docker_image":                     ctx.download(
Step #2 - "build_xla_docker_image":     Error in download: java.io.IOException: Error downloading [https://dl.google.com/go/go1.18.4.linux-amd64.tar.gz] to /root/.cache/bazel/_bazel_root/2ba57cc32d8c1f12152416615363d16d/external/go_sdk/go_sdk.tar.gz: Bytes read 127925296 but wanted 141812725
Step #2 - "build_xla_docker_image":     ERROR: /src/pytorch/xla/WORKSPACE:136:15: fetching _go_download_sdk rule //external:go_sdk: Traceback (most recent call last):
Step #2 - "build_xla_docker_image":             File "/root/.cache/bazel/_bazel_root/2ba57cc32d8c1f12152416615363d16d/external/io_bazel_rules_go/go/private/sdk.bzl", line 101, column 16, in _go_download_sdk_impl
Step #2 - "build_xla_docker_image":                     _remote_sdk(ctx, [url.format(filename) for url in ctx.attr.urls], ctx.attr.strip_prefix, sha256)
Step #2 - "build_xla_docker_image":             File "/root/.cache/bazel/_bazel_root/2ba57cc32d8c1f12152416615363d16d/external/io_bazel_rules_go/go/private/sdk.bzl", line 209, column 21, in _remote_sdk
Step #2 - "build_xla_docker_image":                     ctx.download(
Step #2 - "build_xla_docker_image":     Error in download: java.io.IOException: Error downloading [https://dl.google.com/go/go1.18.4.linux-amd64.tar.gz] to /root/.cache/bazel/_bazel_root/2ba57cc32d8c1f12152416615363d16d/external/go_sdk/go_sdk.tar.gz: Bytes read 127925296 but wanted 141812725
Step #2 - "build_xla_docker_image":     ERROR: Analysis of target '//:_XLAC.so' failed; build aborted: java.io.IOException: Error downloading [https://dl.google.com/go/go1.18.4.linux-amd64.tar.gz] to /root/.cache/bazel/_bazel_root/2ba57cc32d8c1f12152416615363d16d/external/go_sdk/go_sdk.tar.gz: Bytes read 127925296 but wanted 141812725

Even if we can hack our build, this is a client issue. Nobody who updated their pip recently would be able to install our wheels, because the rename we're doing is no longer actually valid.

The build version we set is defined by some combination of these environment variables: https://github.com/pytorch/xla/blob/master/infra/ansible/config/env.yaml

I think TORCH_XLA_VERSION and GIT_VERSIONED_XLA_BUILD are the important ones, but you'll have to review setup.py to see how we set version exactly. That version name is probably still valid like torch_xla-2.5.0+git41d998d. The problem is, we rename the wheels with the nightly date here:

- name: Rename and append +YYYYMMDD suffix to nightly wheels
ansible.builtin.shell: |
pushd /tmp/staging-wheels
cp {{ item.dir }}/*.whl .
rename -v "s/^{{ item.prefix }}-(.*?)-cp/{{ item.prefix }}-nightly-cp/" *.whl
mv /tmp/staging-wheels/* /dist/
popd
rename -v "s/^{{ item.prefix }}-(.*?)-cp/{{ item.prefix }}-nightly+$(date -u +%Y%m%d)-cp/" *.whl
args:
executable: /bin/bash
chdir: "{{ item.dir }}"
loop:
- { dir: "{{ (src_root, 'pytorch/dist') | path_join }}", prefix: "torch" }
- { dir: "{{ (src_root, 'pytorch/xla/dist') | path_join }}", prefix: "torch_xla" }
when: nightly_release

We need to at least change that rename to one of the valid patterns like @fellhorn suggested or copy the pattern used by torch (e.g. torch-X.Y.Z.devYYYYMMDD)

You can dry run the ansible workflow with a command like this one:

ansible-playbook playbook.yaml -vvv -e "stage=build arch=amd64 accelerator=tpu src_root=${GITHUB_WORKSPACE} bundle_libtpu=0 build_cpp_tests=1 git_versioned_xla_build=1 cache_suffix=-ci" --skip-tags=fetch_srcs,install_deps

Anything that gets written to /dist is what we will upload to GCS.

@JackCaoG
Copy link
Collaborator

@zpcore can you made the rename logic that @will-cromar mentioned above since you are offcall this week? It should just be a one line change but then we need to update README to reflect the new format.

@mfatih7
Copy link

mfatih7 commented Jul 25, 2024

Hello all

As a general comment:

When users find errors in pytorch-xla developers fix it in nightly releases and ask the users to test them.
But generating an environment with compatible torch-xla, torch, and torch vision is not straightforward as told here.

This issue is one example of it.
I hope you provide a better way for users to test the nightly updates easily.

@zpcore
Copy link
Collaborator

zpcore commented Jul 26, 2024

Hello all

As a general comment:

When users find errors in pytorch-xla developers fix it in nightly releases and ask the users to test them. But generating an environment with compatible torch-xla, torch, and torch vision is not straightforward as told here.

This issue is one example of it. I hope you provide a better way for users to test the nightly updates easily.

Thanks for the feedback, I think we are missing to provide example commands to install compatible torch, torch[vision,audio], torch_xla for the cuda. We will make the document update. For now, you can use e.g.,:

pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu121
pip3 install https://storage.googleapis.com/pytorch-xla-releases/wheels/cuda/12.1/torch_xla-nightly-cp310-cp310-linux_x86_64.whl

In general, this should be compatible.

@mfatih7
Copy link

mfatih7 commented Aug 9, 2024

@zpcore

Thank you for the answer

I could explore your answer now.
The lines you provide are for CUDA.
I was trying to generate an environment with nightly releases of torch, torch(vision, audio), and torch_xla on a TPU VM.

@mfatih7
Copy link

mfatih7 commented Aug 9, 2024

I think

The updated lines in the main page

pip3 install --pre torch torchvision --index-url https://download.pytorch.org/whl/nightly/cpu
pip install 'torch_xla[tpu] @ https://storage.googleapis.com/pytorch-xla-releases/wheels/tpuvm/torch_xla-nightly-cp310-cp310-linux_x86_64.whl' -f https://storage.googleapis.com/libtpu-releases/index.html

are OK. Now I get the output below with pip list.

...
tomli                        2.0.1
torch                        2.5.0.dev20240809+cpu
torch-xla                    2.5.0+git9fbd64a
torchmetrics                 1.4.1
torchsummary                 1.5.1
torchvision                  0.20.0.dev20240809+cpu
traitlets                    5.14.3
...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants