Skip to content

Conversation

@zigzagcai
Copy link
Contributor

@zigzagcai zigzagcai commented Apr 23, 2024

📌 Checklist before creating the PR

  • I have created an issue for this PR for traceability
  • The title follows the standard format: [doc/gemini/tensor/...]: A concise description
  • I have added relevant tags if possible for us to better distinguish different PRs
  • I have installed pre-commit: pip install pre-commit && pre-commit install

🚨 Issue number

None

📝 What does this PR do?

During very large-scale training (especially gpus number >=1024), it would be a high probability that we get stuck when launching training processes. We found the root cause was a deadlock within singleton implementation in multhreading scenarios.

💥 Checklist before requesting a review

  • I have linked my PR to an issue (instruction)
  • My issue clearly describes the problem/feature/proposal, with diagrams/charts/table/code if possible
  • I have performed a self-review of my code
  • I have added thorough tests.
  • I have added docstrings for all the functions/methods I implemented

⭐️ Do you enjoy contributing to Colossal-AI?

  • 🌝 Yes, I do.
  • 🌚 No, I don't.

Tell us more if you don't enjoy contributing to Colossal-AI.

@zigzagcai zigzagcai requested a review from a team as a code owner April 23, 2024 03:29
@zigzagcai zigzagcai changed the title feat(singleton): implement thread-safety singleton to avoid deadlock for very large-scale training scenarios [fix]: implement thread-safety singleton to avoid deadlock for very large-scale training scenarios Apr 23, 2024
@zigzagcai zigzagcai changed the title [fix]: implement thread-safety singleton to avoid deadlock for very large-scale training scenarios [Fix]: implement thread-safety singleton to avoid deadlock for very large-scale training scenarios Apr 23, 2024
Copy link
Contributor

@ver217 ver217 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@ver217 ver217 merged commit 7ef9160 into hpcaitech:main Apr 25, 2024
wangbluo pushed a commit to wangbluo/ColossalAI that referenced this pull request May 7, 2024
…arge-scale training scenarios (hpcaitech#5625)

* implement thread-safety singleton

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* refactor singleton implementation

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants