Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error when training #44

Open
Cybermb opened this issue Feb 11, 2023 · 2 comments
Open

Error when training #44

Cybermb opened this issue Feb 11, 2023 · 2 comments

Comments

@Cybermb
Copy link

Cybermb commented Feb 11, 2023

I cannot manage to make this work on windows.

Running the following command python -m vall_e.train yaml=config/test/ar.yml

First, I was getting error RuntimeError: Distributed package doesn't have NCCL built in

Seems like NCCL backend of pytorch distributed pacakages is not working on windows.

Found out a workaround to use gloo backend and added the following code in data.py:

def get_free_port():
    sock = socket.socket()
    sock.bind(("", 0))
    return sock.getsockname()[1]

os.environ["RANK"]="0"
os.environ["WORLD_SIZE"]="1"
os.environ["MASTER_ADDR"]="localhost"
os.environ["MASTER_PORT"]=str(get_free_port())
os.environ["LOCAL_RANK"]="0"

torch.distributed.init_process_group(backend="gloo", rank=0, world_size=1)

Then it returns the following error:
RuntimeError: a leaf Variable that requires grad is being used in an in-place operation.

This is where I hit the brickwall

Platform: windows 11
Python: 3.10.9
torch: 1.11.0+cu113

@alimaslax
Copy link

alimaslax commented Feb 19, 2023

The easiest route would be to use wsl Ubuntu distro for windows. You can do that by using PowerShell.
wsl --install
Then install the Nvidia drivers inside the wsl. Another route would be to install Docker, and run a DockerFile containing all the drivers. There is a PR open that contains a Jupyter image that does all that. Here is another one that uses the Nvidia's ubuntu distro that supports WSL.

`
FROM nvidia/cuda:11.8.0-base-ubuntu22.04

USER root
ENV TORCH_HOME=/data/models

RUN apt update && apt install -y --no-install-recommends
build-essential
ffmpeg
git
python3
python3-dev
python3-pip
&& rm -rf /var/lib/apt/lists/*

RUN python3 -m pip install git+https://github.com/enhuiz/vall-e.git

VOLUME /data/models
VOLUME /data/shared

ENTRYPOINT ["/bin/bash", "--login", "-c"]
`

@anuraagdjain
Copy link

The easiest route would be to use wsl Ubuntu distro for windows. You can do that by using PowerShell. wsl --install Then install the Nvidia drivers inside the wsl. Another route would be to install Docker, and run a DockerFile containing all the drivers. There is a PR open that contains a Jupyter image that does all that. Here is another one that uses the Nvidia's ubuntu distro that supports WSL.

` FROM nvidia/cuda:11.8.0-base-ubuntu22.04

USER root ENV TORCH_HOME=/data/models

RUN apt update && apt install -y --no-install-recommends build-essential ffmpeg git python3 python3-dev python3-pip && rm -rf /var/lib/apt/lists/*

RUN python3 -m pip install git+https://github.com/enhuiz/vall-e.git

VOLUME /data/models VOLUME /data/shared

ENTRYPOINT ["/bin/bash", "--login", "-c"] `

Has anyone tried this on M1 Mac? I had similar error as described by the OP. I will try later this week or next, to run this docker container and update the result here..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants