Skip to content

Conversation

@specture724
Copy link
Collaborator

@specture724 specture724 commented Oct 21, 2025

resolve #38

send an Exception to PS when update_weights_from_ipc execute run failed. _update_per_bucket needs to ensure all processes not to get an Exception before continuing next updating iteration

@specture724 specture724 requested review from Copilot and weixiao-huang and removed request for Copilot October 21, 2025 12:35
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements error handling in the checkpoint worker process to ensure graceful shutdown when errors occur during weight updates. The changes propagate exceptions from worker processes back to the parameter server, allowing coordinated error handling across all distributed ranks.

Key Changes:

  • Worker processes now send exception objects to the parameter server when run() fails, instead of silently failing
  • Parameter server synchronizes error states across all ranks before continuing, ensuring no rank proceeds if any rank encounters an error
  • Process group cleanup moved to a finally block to ensure proper resource cleanup even when errors occur

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File Description
checkpoint_engine/worker.py Added try-except block to catch exceptions during weight updates and send them back to the parameter server
checkpoint_engine/ps.py Modified error handling to collect responses from all ranks and raise error if any rank failed; moved cleanup to finally block
tests/test_error_quit.py Added integration test that verifies proper error propagation and process termination when worker process encounters runtime errors

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@specture724 specture724 changed the title feat: quit checkpoint worker process when error occurs feat: quit checkpoint engine when error occurs Oct 21, 2025
@specture724 specture724 self-assigned this Oct 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Quit when error occur during updating

1 participant