Skip to content

Conversation

e7217
Copy link
Contributor

@e7217 e7217 commented Nov 29, 2024

Fixes #ISSUE_NUMBER
When attempting to reconfigure the environment without properly handling the PyTorch-related settings, you may encounter the following message.

                             │ /root/.cache/pypoetry/virtualenvs/app-rag-sample-9TtSrW0h-py3.10/lib/python3.10/site-packages/torch/distributed/distribut │               
                             │ ed_c10d.py:1215 in get_backend                                                                                            │               
                             │                                                                                                                           │               
                             │   1212 │   if _rank_not_in_group(pg):                                                                                     │               
                             │   1213 │   │   raise ValueError("Invalid process group specified")                                                        │               
                             │   1214 │   pg_store = _world.pg_map[pg] if pg in _world.pg_map else None                                                  │               
                             │ ❱ 1215 │   return Backend(not_none(pg_store)[0])                                                                          │               
                             │   1216                                                                                                                    │               
                             │   1217                                                                                                                    │               
                             │   1218 def _get_process_group_uid(pg: ProcessGroup) -> int:                                                               │               
                             │                                                                                                                           │               
                             │ /root/.cache/pypoetry/virtualenvs/app-rag-sample-9TtSrW0h-py3.10/lib/python3.10/site-packages/torch/utils/_typing_utils.p │               
                             │ y:13 in not_none                                                                                                          │               
                             │                                                                                                                           │               
                             │   10                                                                                                                      │               
                             │   11 def not_none(obj: Optional[T]) -> T:                                                                                 │               
                             │   12 │   if obj is None:                                                                                                  │               
                             │ ❱ 13 │   │   raise TypeError("Invariant encountered: value was None when it should not be")                               │               
                             │   14 │   return obj                                                                                                       │               
                             │   15                                                                                                                      │               
                             ╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯               
                             TypeError: Invariant encountered: value was None when it should not be                                                                      
Exception ignored in: <function Vllm.__del__ at 0x7f35f96b6dd0>

Since this message can cause confusion for multiple developers, the purpose of this PR is to suggest additional details to help clarify the situation.

cc @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o

Copy link

pytorch-bot bot commented Nov 29, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/141796

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 4bb1b0d with merge base 59f14d1 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Nov 29, 2024
@e7217 e7217 changed the title update error message in get_backend() more detail update error message in get_backend() more detail_ Nov 29, 2024
Skylion007
Skylion007 previously approved these changes Nov 29, 2024
pg_store = _world.pg_map[pg] if pg in _world.pg_map else None

pg_store = _world.pg_map.get(pg, None)
if pg_store is None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, how did this not get flagged by mypy I wonder.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Skylion007 thank for your checking. I have resolved it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, okay this just raise the error in not_none before, and now it raises it here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Skylion007 Yes, that's correct. There were developers who were confused due to not knowing the cause, so I made it a bit more detailed.

@e7217
Copy link
Contributor Author

e7217 commented Nov 30, 2024

@Skylion007
Hello,

The current stage is in a waiting state, so may I kindly ask you to check it?
If any changes are required in the code, please let me know.
Thank you!

@cpuhrsch cpuhrsch added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Dec 3, 2024
@cpuhrsch cpuhrsch requested a review from wconstab December 3, 2024 23:46
@e7217 e7217 requested a review from Skylion007 December 11, 2024 17:30
Copy link
Contributor

@kwen2501 kwen2501 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@e7217
Copy link
Contributor Author

e7217 commented Dec 16, 2024

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Dec 16, 2024
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 jobs have failed, first few of them are: trunk / win-vs2019-cpu-py3 / test (default, 3, 3, lf.windows.4xlarge.nonephemeral)

Details for Dev Infra team Raised by workflow job

Copy link

pytorch-bot bot commented Dec 16, 2024

You are not authorized to force merges to this repository. Please use the regular @pytorchmergebot merge command instead

@e7217
Copy link
Contributor Author

e7217 commented Dec 16, 2024

@pytorchmergebot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 jobs have failed, first few of them are: trunk / win-vs2019-cpu-py3 / test (default, 3, 3, lf.windows.4xlarge.nonephemeral)

Details for Dev Infra team Raised by workflow job

@e7217
Copy link
Contributor Author

e7217 commented Dec 16, 2024

@pytorchmergebot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 jobs have failed, first few of them are: trunk / win-vs2019-cpu-py3 / test (default, 3, 3, lf.windows.4xlarge.nonephemeral)

Details for Dev Infra team Raised by workflow job

@e7217
Copy link
Contributor Author

e7217 commented Dec 16, 2024

@kwen2501 Thank you for checking.

Could you please merge this PR? The CI was green in the previous iteration.

or, How can I retry it?

Copy link
Contributor

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

@github-actions github-actions bot added the Stale label Feb 18, 2025
@kwen2501
Copy link
Contributor

@pytorchbot merge -r

@kwen2501 kwen2501 added no-stale and removed Stale labels Mar 10, 2025
@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased fix/add-error-message-more-detail onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout fix/add-error-message-more-detail && git pull --rebase)

@pytorchmergebot pytorchmergebot force-pushed the fix/add-error-message-more-detail branch from e336db5 to 4bb1b0d Compare March 10, 2025 18:15
@pytorch-bot pytorch-bot bot removed the ciflow/trunk Trigger trunk jobs on your pull request label Mar 10, 2025
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 3 mandatory check(s) failed. The first few are:

Dig deeper by viewing the failures on hud

Details for Dev Infra team Raised by workflow job

Failing merge rule: Core Maintainers

@e7217
Copy link
Contributor Author

e7217 commented Mar 10, 2025

@kwen2501
Thank you for checking, even though it has been quite a while.
The PR is in an approved state, but for some reason, the merge has not been processed. HAHA...😂
Do you know why that might be?
Thank you once again for your follow-up, and I hope you have a great day!

@kwen2501
Copy link
Contributor

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Mar 10, 2025
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command
For more information see pytorch-bot wiki.

@e7217
Copy link
Contributor Author

e7217 commented Mar 11, 2025

@kwen2501
First of all, I appreciate your understanding regarding my continuous requests.

Currently, even though all tests have been completed, the merge is being blocked. Is this something that needs to be addressed by the contributor or maintainer? 😮😮

I appreciate any advice you can provide.

@Skylion007
Copy link
Collaborator

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@e7217 e7217 deleted the fix/add-error-message-more-detail branch March 15, 2025 00:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/trunk Trigger trunk jobs on your pull request Merged no-stale oncall: distributed Add this issue/PR to distributed oncall triage queue open source release notes: distributed (c10d) release notes category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants