Issue: Multi-node and Multi-GPU Inference Problems with DeepSpeed MII #545
Description
Problem Description
I am using DeepSpeed MII to perform sharding and multi-node inference with generative models. The objective is to distribute a model across two nodes (2 GPUs per node, total of 4 GPUs, each with ~24GB VRAM) and read prompts from JSON files in an input folder to generate responses, which are then saved in an output folder.
However, depending on the model used, I encounter various issues:
1. With the Qwen-32B model:
- Initial responses are correct.
- After a random number of iterations (even with the same prompt), the code hangs indefinitely during the response generation step, with no errors.
2. With Llama 3.1 8B:
- In single-node mode, everything works perfectly.
- In multi-node mode, the code does not hang as with Qwen, but the responses are garbled or incorrect. For example:
Prompt: "What is the sun?"
Response: "The sun is a str comTi asTur forBas al aaall wehnd us" (randomly scrambled words).
3. With Mistral 7B Instruct v0.3:
- The code hangs after only a few iterations.
- Responses are partially scrambled, similar to the Llama case.
Troubleshooting Attempts:
- I have tried several things to address these issues, but the following are particularly confusing and raise more doubts than solutions:
- Adding/Removing torch.distributed.barrier(): I attempted to synchronize processes using torch.distributed.barrier() both before and after the inference step. However, this did not resolve the hanging or the garbled responses.
- Modifying the all_rank_output Parameter: I experimented with enabling and disabling all_rank_output during the pipeline initialization. This also did not resolve the issues.
System Configuration:
- hostifile:
xxxx.xxx.xxx.xxx slots=2
yyyy.yyy.yyy.yyy slots=2
- Execution Commands:
Node0: deepspeed --hostfile=hostfile --no_ssh --node_rank=0 --master_addr=xxxx.xxx.xxx.xxx --master_port=xxxx multinode_dynamic_inference.py
Node1: deepspeed --hostfile=hostfile --no_ssh --node_rank=1 --master_addr=xxxx.xxx.xxx.xxx --master_port=xxxx multinode_dynamic_inference.py
- Code Used
import json
import os
from pathlib import Path
from time import sleep
import time
import torch
import mii
import gc
# Paths for input and output files
IN_REQUEST_PATH = Path("/path/to/input/")
OUT_REQUEST_PATH = Path("/path/to/output/")
# Local and global rank
local_rank = int(os.getenv("LOCAL_RANK", "-1"))
global_rank = int(os.getenv("RANK", "-1"))
# Initialize the model pipeline
pipe = mii.pipeline("/path/to/model/", all_rank_output=True)
iteration = 0
while True:
print(iteration)
iteration += 1
print(f"GPU memory allocated: {torch.cuda.memory_allocated()}")
print(f"GPU memory reserved: {torch.cuda.memory_reserved()}")
# Process input files
request_paths = list(IN_REQUEST_PATH.iterdir())
print(f"LOCAL RANK {local_rank}, GLOBAL RANK {global_rank}")
if len(request_paths) > 0:
requests = [json.loads(path.read_text(encoding="utf-8")) for path in request_paths]
prompts = [r["prompt"] for r in requests]
# Perform inference
start_time = time.time()
responses = pipe(prompts, max_new_tokens=128)
end_time = time.time()
print(f"Inference time: {end_time - start_time:.2f} seconds")
# Write results
if global_rank == 0:
print("Printing output")
Path("./responses.json").write_text("\n\n\n".join([r.generated_text for r in responses]))
for request, response in zip(requests, responses):
request["response"] = response.generated_text
Path(OUT_REQUEST_PATH / f"{request['id']}.json").write_text(
json.dumps(request, ensure_ascii=False), encoding="utf-8"
)
# Clear GPU cache
torch.cuda.empty_cache()
gc.collect()
sleep(10)
Activity