-
Notifications
You must be signed in to change notification settings - Fork 58
Feat/gpu-ci #37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Feat/gpu-ci #37
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR refactors the test_update.py to use pytest framework and adds CI workflow capabilities. The main change is replacing the direct test execution with a pytest-based test structure that uses subprocess to run distributed training via torchrun.
- Converted test_update.py from standalone script to pytest-based test structure
- Added pytest mark for GPU-specific tests with multi-GPU requirements
- Enhanced logging control in ParameterServer to support flexible rank-based logging
Reviewed Changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| tests/test_update.py | Refactored from standalone script to pytest framework with subprocess-based distributed test execution |
| checkpoint_engine/ps.py | Enhanced logging system to support configurable logger rank for different update scenarios |
| .pre-commit-config.yaml | Added ruff ignore rule for subprocess security warning (S603) |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
| assert world_size >= 2, "This test requires at least 2 GPUs." | ||
|
|
||
| master_addr = "localhost" | ||
| master_port = random.randint(20000, 30000) |
Copilot
AI
Oct 20, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using random port selection for distributed training can lead to port conflicts in CI environments. Consider using a more robust port selection mechanism or environment variable override.
Copilot uses AI. Check for mistakes.
| self._logger_rank = 0 | ||
|
|
Copilot
AI
Oct 20, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The _logger_rank initialization should be documented to explain when and how this value changes during execution.
| self._logger_rank = 0 | |
| # _logger_rank determines which rank is responsible for logging output. | |
| # By default, it is set to 0, meaning only rank 0 will perform logging. | |
| # If logging from other ranks is required, this value can be changed accordingly. | |
| self._logger_rank = 0 |
Copilot uses AI. Check for mistakes.
Refactor test_update.py to pytest style and create a workflow for main branch