Error when training #44

Cybermb · 2023-02-11T18:31:00Z

I cannot manage to make this work on windows.

Running the following command python -m vall_e.train yaml=config/test/ar.yml

First, I was getting error RuntimeError: Distributed package doesn't have NCCL built in

Seems like NCCL backend of pytorch distributed pacakages is not working on windows.

Found out a workaround to use gloo backend and added the following code in data.py:

def get_free_port():
    sock = socket.socket()
    sock.bind(("", 0))
    return sock.getsockname()[1]

os.environ["RANK"]="0"
os.environ["WORLD_SIZE"]="1"
os.environ["MASTER_ADDR"]="localhost"
os.environ["MASTER_PORT"]=str(get_free_port())
os.environ["LOCAL_RANK"]="0"

torch.distributed.init_process_group(backend="gloo", rank=0, world_size=1)

Then it returns the following error:
RuntimeError: a leaf Variable that requires grad is being used in an in-place operation.

This is where I hit the brickwall

Platform: windows 11
Python: 3.10.9
torch: 1.11.0+cu113

The text was updated successfully, but these errors were encountered:

alimaslax · 2023-02-19T15:04:52Z

The easiest route would be to use wsl Ubuntu distro for windows. You can do that by using PowerShell.
wsl --install
Then install the Nvidia drivers inside the wsl. Another route would be to install Docker, and run a DockerFile containing all the drivers. There is a PR open that contains a Jupyter image that does all that. Here is another one that uses the Nvidia's ubuntu distro that supports WSL.

`
FROM nvidia/cuda:11.8.0-base-ubuntu22.04

USER root
ENV TORCH_HOME=/data/models

RUN apt update && apt install -y --no-install-recommends
build-essential
ffmpeg
git
python3
python3-dev
python3-pip
&& rm -rf /var/lib/apt/lists/*

RUN python3 -m pip install git+https://github.com/enhuiz/vall-e.git

VOLUME /data/models
VOLUME /data/shared

ENTRYPOINT ["/bin/bash", "--login", "-c"]
`

anuraagdjain · 2023-04-04T19:40:48Z

The easiest route would be to use wsl Ubuntu distro for windows. You can do that by using PowerShell. wsl --install Then install the Nvidia drivers inside the wsl. Another route would be to install Docker, and run a DockerFile containing all the drivers. There is a PR open that contains a Jupyter image that does all that. Here is another one that uses the Nvidia's ubuntu distro that supports WSL.

` FROM nvidia/cuda:11.8.0-base-ubuntu22.04

USER root ENV TORCH_HOME=/data/models

RUN apt update && apt install -y --no-install-recommends build-essential ffmpeg git python3 python3-dev python3-pip && rm -rf /var/lib/apt/lists/*

RUN python3 -m pip install git+https://github.com/enhuiz/vall-e.git

VOLUME /data/models VOLUME /data/shared

ENTRYPOINT ["/bin/bash", "--login", "-c"] `

Has anyone tried this on M1 Mac? I had similar error as described by the OP. I will try later this week or next, to run this docker container and update the result here..

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error when training #44

Error when training #44

Cybermb commented Feb 11, 2023

alimaslax commented Feb 19, 2023 •

edited

Loading

anuraagdjain commented Apr 4, 2023

Error when training #44

Error when training #44

Comments

Cybermb commented Feb 11, 2023

alimaslax commented Feb 19, 2023 • edited Loading

anuraagdjain commented Apr 4, 2023

alimaslax commented Feb 19, 2023 •

edited

Loading