Skip to content

Add multi node training guide for XPU device #9464

Open
@zhouyu5

Description

📚 Describe the documentation issue

Currently, training_benchmark_xpu.py only support training with multiple XPU device, but only on the single node. If user want to try it on multiple node, each node with multiple XPU device, this script may need some minor modification. However, it's non trivial to make it work, I would like to submit a PR to improve the user experience when they want to launch multi-node multi-XPU training.

Suggest a potential alternative/fix

To my knowledge, the following files need modification

  • training_benchmark_xpu.py: need to modify the get_dist_params() function, which is used to initialize the DDP process group.
  • README.md: need to give a detail guide on how to setup environment and launch the multi-node training.

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions