Skip to content

Issue: Multi-node and Multi-GPU Inference Problems with DeepSpeed MII #545

Open
@lcnmzz00

Description

Problem Description
I am using DeepSpeed MII to perform sharding and multi-node inference with generative models. The objective is to distribute a model across two nodes (2 GPUs per node, total of 4 GPUs, each with ~24GB VRAM) and read prompts from JSON files in an input folder to generate responses, which are then saved in an output folder.

However, depending on the model used, I encounter various issues:

1. With the Qwen-32B model:

  • Initial responses are correct.
  • After a random number of iterations (even with the same prompt), the code hangs indefinitely during the response generation step, with no errors.

2. With Llama 3.1 8B:

  • In single-node mode, everything works perfectly.
  • In multi-node mode, the code does not hang as with Qwen, but the responses are garbled or incorrect. For example:

Prompt: "What is the sun?"
Response: "The sun is a str comTi asTur forBas al aaall wehnd us" (randomly scrambled words).

3. With Mistral 7B Instruct v0.3:

  • The code hangs after only a few iterations.
  • Responses are partially scrambled, similar to the Llama case.

Troubleshooting Attempts:

  • I have tried several things to address these issues, but the following are particularly confusing and raise more doubts than solutions:
  • Adding/Removing torch.distributed.barrier(): I attempted to synchronize processes using torch.distributed.barrier() both before and after the inference step. However, this did not resolve the hanging or the garbled responses.
  • Modifying the all_rank_output Parameter: I experimented with enabling and disabling all_rank_output during the pipeline initialization. This also did not resolve the issues.

System Configuration:

- hostifile:
xxxx.xxx.xxx.xxx slots=2
yyyy.yyy.yyy.yyy slots=2

- Execution Commands:
Node0: deepspeed --hostfile=hostfile --no_ssh --node_rank=0 --master_addr=xxxx.xxx.xxx.xxx --master_port=xxxx multinode_dynamic_inference.py

Node1: deepspeed --hostfile=hostfile --no_ssh --node_rank=1 --master_addr=xxxx.xxx.xxx.xxx --master_port=xxxx multinode_dynamic_inference.py

- Code Used

import json
import os
from pathlib import Path
from time import sleep
import time
import torch
import mii
import gc

# Paths for input and output files
IN_REQUEST_PATH = Path("/path/to/input/")
OUT_REQUEST_PATH = Path("/path/to/output/")

# Local and global rank
local_rank = int(os.getenv("LOCAL_RANK", "-1"))
global_rank = int(os.getenv("RANK", "-1"))

# Initialize the model pipeline
pipe = mii.pipeline("/path/to/model/", all_rank_output=True)

iteration = 0

while True:
   print(iteration)
   iteration += 1

   print(f"GPU memory allocated: {torch.cuda.memory_allocated()}")
   print(f"GPU memory reserved: {torch.cuda.memory_reserved()}")

   # Process input files
   request_paths = list(IN_REQUEST_PATH.iterdir())
   print(f"LOCAL RANK {local_rank}, GLOBAL RANK {global_rank}")
   
   if len(request_paths) > 0:
       requests = [json.loads(path.read_text(encoding="utf-8")) for path in request_paths]
       prompts = [r["prompt"] for r in requests]

       # Perform inference
       start_time = time.time()
       responses = pipe(prompts, max_new_tokens=128)  
       end_time = time.time()
       print(f"Inference time: {end_time - start_time:.2f} seconds")

       # Write results
       if global_rank == 0:
           print("Printing output")
           Path("./responses.json").write_text("\n\n\n".join([r.generated_text for r in responses]))
           
           for request, response in zip(requests, responses):
               request["response"] = response.generated_text
               Path(OUT_REQUEST_PATH / f"{request['id']}.json").write_text(
                   json.dumps(request, ensure_ascii=False), encoding="utf-8"
               )

   # Clear GPU cache
   torch.cuda.empty_cache()
   gc.collect()
   sleep(10)

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions