Skip to content

Issue: Multi-node and Multi-GPU Inference Problems with DeepSpeed MII #545



Problem Description
I am using DeepSpeed MII to perform sharding and multi-node inference with generative models. The objective is to distribute a model across two nodes (2 GPUs per node, total of 4 GPUs, each with ~24GB VRAM) and read prompts from JSON files in an input folder to generate responses, which are then saved in an output folder.

However, depending on the model used, I encounter various issues:

1. With the Qwen-32B model:

  • Initial responses are correct.
  • After a random number of iterations (even with the same prompt), the code hangs indefinitely during the response generation step, with no errors.

2. With Llama 3.1 8B:

  • In single-node mode, everything works perfectly.
  • In multi-node mode, the code does not hang as with Qwen, but the responses are garbled or incorrect. For example:

Prompt: "What is the sun?"
Response: "The sun is a str comTi asTur forBas al aaall wehnd us" (randomly scrambled words).

3. With Mistral 7B Instruct v0.3:

  • The code hangs after only a few iterations.
  • Responses are partially scrambled, similar to the Llama case.

Troubleshooting Attempts:

  • I have tried several things to address these issues, but the following are particularly confusing and raise more doubts than solutions:
  • Adding/Removing torch.distributed.barrier(): I attempted to synchronize processes using torch.distributed.barrier() both before and after the inference step. However, this did not resolve the hanging or the garbled responses.
  • Modifying the all_rank_output Parameter: I experimented with enabling and disabling all_rank_output during the pipeline initialization. This also did not resolve the issues.

System Configuration:

- hostifile: slots=2
yyyy.yyy.yyy.yyy slots=2

- Execution Commands:
Node0: deepspeed --hostfile=hostfile --no_ssh --node_rank=0 --master_port=xxxx

Node1: deepspeed --hostfile=hostfile --no_ssh --node_rank=1 --master_port=xxxx

- Code Used

import json
import os
from pathlib import Path
from time import sleep
import time
import torch
import mii
import gc

# Paths for input and output files
IN_REQUEST_PATH = Path("/path/to/input/")
OUT_REQUEST_PATH = Path("/path/to/output/")

# Local and global rank
local_rank = int(os.getenv("LOCAL_RANK", "-1"))
global_rank = int(os.getenv("RANK", "-1"))

# Initialize the model pipeline
pipe = mii.pipeline("/path/to/model/", all_rank_output=True)

iteration = 0

while True:
   iteration += 1

   print(f"GPU memory allocated: {torch.cuda.memory_allocated()}")
   print(f"GPU memory reserved: {torch.cuda.memory_reserved()}")

   # Process input files
   request_paths = list(IN_REQUEST_PATH.iterdir())
   print(f"LOCAL RANK {local_rank}, GLOBAL RANK {global_rank}")
   if len(request_paths) > 0:
       requests = [json.loads(path.read_text(encoding="utf-8")) for path in request_paths]
       prompts = [r["prompt"] for r in requests]

       # Perform inference
       start_time = time.time()
       responses = pipe(prompts, max_new_tokens=128)  
       end_time = time.time()
       print(f"Inference time: {end_time - start_time:.2f} seconds")

       # Write results
       if global_rank == 0:
           print("Printing output")
           Path("./responses.json").write_text("\n\n\n".join([r.generated_text for r in responses]))
           for request, response in zip(requests, responses):
               request["response"] = response.generated_text
               Path(OUT_REQUEST_PATH / f"{request['id']}.json").write_text(
                   json.dumps(request, ensure_ascii=False), encoding="utf-8"

   # Clear GPU cache


Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment



No one assigned


    No labels
    No labels


    No type


    No projects


    No milestone


    None yet


    No branches or pull requests

    Issue actions