Skip to content
This repository was archived by the owner on Nov 1, 2024. It is now read-only.
This repository was archived by the owner on Nov 1, 2024. It is now read-only.

Unable to build torchdistx for PT 2.0 #73

Open
@Vatshank

Description

@Vatshank

Hi!

Describe the bug:
I am trying to build torchdistx from source following the instructions in the readme. Basically, I am running -

pip install --upgrade -r requirements.txt -r use-cpu.txt

cmake -DTORCHDIST_INSTALL_STANDALONE=ON -B build
cmake --build build # <- This errors out

When running cmake --build build, I see the following error -

[ 12%] Building CXX object src/cc/torchdistx/CMakeFiles/torchdistx.dir/deferred_init.cc.o
[ 25%] Building CXX object src/cc/torchdistx/CMakeFiles/torchdistx.dir/fake.cc.o
[ 37%] Building CXX object src/cc/torchdistx/CMakeFiles/torchdistx.dir/stack_utils.cc.o
[ 50%] Linking CXX shared library libtorchdistx.so
[ 50%] Built target torchdistx
[ 62%] Building CXX object src/python/torchdistx/_C/CMakeFiles/torchdistx-py.dir/deferred_init.cc.o
/home/ubuntu/repos/torchdistx/src/python/torchdistx/_C/deferred_init.cc:24:14: error: ‘torch::TypeError’ has not been declared
 using torch::TypeError;
              ^~~~~~~~~
/home/ubuntu/repos/torchdistx/src/python/torchdistx/_C/deferred_init.cc: In function ‘pybind11::object torchdistx::python::{anonymous}::materializeVariable(const pybind11::object&)’:
/home/ubuntu/repos/torchdistx/src/python/torchdistx/_C/deferred_init.cc:64:11: error: ‘TypeError’ was not declared in this scope
     throw TypeError{"`var` has to be a `Variable`, but got `%s`.", Py_TYPE(naked_var)->tp_name};
           ^~~~~~~~~
/home/ubuntu/repos/torchdistx/src/python/torchdistx/_C/deferred_init.cc:64:11: note: suggested alternatives:
In file included from /opt/conda/envs/alpa/lib/python3.9/site-packages/torch/include/c10/core/Device.h:5:0,
                 from /opt/conda/envs/alpa/lib/python3.9/site-packages/torch/include/ATen/core/TensorBody.h:11,
                 from /opt/conda/envs/alpa/lib/python3.9/site-packages/torch/include/ATen/core/Tensor.h:3,
                 from /opt/conda/envs/alpa/lib/python3.9/site-packages/torch/include/ATen/Tensor.h:3,
                 from /home/ubuntu/repos/torchdistx/src/python/torchdistx/_C/deferred_init.cc:9:
/opt/conda/envs/alpa/lib/python3.9/site-packages/torch/include/c10/util/Exception.h:246:15: note:   ‘c10::TypeError’
 class C10_API TypeError : public Error {
               ^~~~~~~~~
/opt/conda/envs/alpa/lib/python3.9/site-packages/torch/include/c10/util/Exception.h:246:15: note:   ‘c10::TypeError’
/opt/conda/envs/alpa/lib/python3.9/site-packages/torch/include/c10/util/Exception.h:246:15: note:   ‘c10::TypeError’
make[2]: *** [src/python/torchdistx/_C/CMakeFiles/torchdistx-py.dir/build.make:76: src/python/torchdistx/_C/CMakeFiles/torchdistx-py.dir/deferred_init.cc.o] Error 1
make[1]: *** [CMakeFiles/Makefile2:914: src/python/torchdistx/_C/CMakeFiles/torchdistx-py.dir/all] Error 2
make: *** [Makefile:146: all] Error 2

Please let me know if I am doing something silly here or if torchdistx is not meant to support newer versions of PT? (And if so, is there another way to use the deferred_init or fake_tensor APIs in PyTorch?).

Describe how to reproduce:

pip install --upgrade -r requirements.txt -r use-cpu.txt

cmake -DTORCHDIST_INSTALL_STANDALONE=ON -B build
cmake --build build # <- This errors out

Environment:

  • OS: Ubuntu 20.04
  • main branch of torchdistx

Additional context:
The build works for PT 1.12 and PT 1.13 but not with PT 2.0. I am trying to get Alpa to work for PT2.0 and it uses torchdistx. Right now, Alpa works with PT1.12 and PT1.13 (with a minor change) but not PT2.0.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions