Enhancement - Fix preprocessing crash when num_workers=0 in Megatron GPT3 dataset generation#800
Enhancement - Fix preprocessing crash when num_workers=0 in Megatron GPT3 dataset generation#800polarG wants to merge 9 commits into
Conversation
There was a problem hiding this comment.
Pull request overview
This PR fixes a crash in the Megatron GPT-3 dataset preprocessing path when --num_workers=0 by ensuring the worker count passed to Megatron’s tools/preprocess_data.py is at least 1, while keeping num_workers unchanged for other consumers (e.g., PyTorch DataLoader).
Changes:
- Clamp
--workersforpreprocess_data.pytomax(1, self._args.num_workers)to avoidmultiprocessing.Pool(0)failures. - Add inline clarification comment documenting why preprocessing clamps workers while other components may allow
num_workers=0.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Megatron's preprocess_data.py appends '_text_document' to the
--output-prefix when producing the .bin/.idx files. Derive the
output-prefix from --data_prefix (stripping a trailing
'_text_document' suffix when present) so that the generated files
match the existence checks for any custom data_prefix value, instead
of being hardcoded to '<data_home>/dataset'.
- Add unit test test_megatron_gpt_dataset_generate_command covering:
num_workers=0 clamps to '--workers 1' with default data_prefix
('dataset_text_document' -> '<data_home>/dataset'), num_workers=4
with custom 'custom_text_document' -> '<data_home>/custom', and a
data_prefix without the suffix used as-is.
Avoid '--workers 1' matching '--workers 10' etc.
a475b85 to
ee2839e
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #800 +/- ##
==========================================
+ Coverage 85.88% 86.01% +0.13%
==========================================
Files 103 103
Lines 7929 7939 +10
==========================================
+ Hits 6810 6829 +19
+ Misses 1119 1110 -9
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
…et generate test
- Replace brittle whitespace substring assertions ('--workers 1 ', '--output-prefix ... ') with normalize_command()-based parsed CLI unit checks, so the test validates semantics rather than formatting.
- Use --code_base {self._tmp_dir} together with createMockFiles(['pretrain_gpt.py']) to avoid the unrealistic /root/Megatron-DeepSpeed path. The mocked run_command now creates the expected .bin/.idx files via side_effect so _preprocess() succeeds end-to-end and is asserted to be True.
- Warn when num_workers is silently clamped from 0 to 1 for the preprocess subprocess so the user sees the override in the log. The DataLoader still receives the original num_workers value. - Guard the '_text_document' suffix-strip against the edge case where data_prefix == '_text_document' (which would otherwise produce a malformed --output-prefix '<data_home>/' with an empty basename). Fall back to the original data_prefix value in that case. - Extend test_megatron_gpt_dataset_generate_command with a 4th case asserting the empty-basename fallback.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Fix test pollution: clean up pretrain_gpt.py created by test_megatron_gpt_dataset_generate_command so it does not break the alphabetically-later test_megatron_gpt_preprocess (root cause of CI cpu-unit-test failures on python-3.7/3.10/3.12). - Fix Python 3.7 incompatibility: use call_args_list[0][0][0] instead of .args[0] (mock.call.args is Python 3.8+). - Enforce data_prefix contract in _generate_dataset(): validate that data_prefix ends with '_text_document' and has a non-empty stem before invoking preprocess_data.py. Fail fast with a clear error otherwise (preprocess_data.py always appends '_text_document' to --output-prefix, so other prefixes would silently produce files whose names mismatch the existence check). - Replace test Cases 3 and 4 (which masked the contract bug via a side_effect that lied about preprocess_data.py output) with negative cases that assert _preprocess() fails fast and never invokes run_command for invalid data_prefix values.
- Collapse multi-line .format() call in _generate_dataset error message. - Add blank line before 'return _side_effect' in nested function in test.
| if self._args.dataset_url: | ||
| self._raw_data_path = str(Path(self._args.data_home) / 'data.json') | ||
| download_file(self._args.dataset_url, self._raw_data_path) | ||
|
|
||
| # Megatron's preprocess_data.py appends '_text_document' to --output-prefix | ||
| # when producing the .bin/.idx files. For the existence check below | ||
| # (which looks for {data_prefix}.bin/.idx) to pass, data_prefix must end | ||
| # with '_text_document' and have a non-empty stem when generation is needed. | ||
| suffix = '_text_document' | ||
| if not self._args.data_prefix.endswith(suffix) or self._args.data_prefix == suffix: | ||
| logger.error( | ||
| 'data_prefix must end with "{}" and have a non-empty stem when ' | ||
| 'dataset generation is required (got "{}"). preprocess_data.py ' | ||
| 'always appends "{}" to --output-prefix.'.format(suffix, self._args.data_prefix, suffix) | ||
| ) |
| # num_workers=0 is valid for DataLoader (main process loads data), | ||
| # but preprocess_data.py requires workers>=1 for multiprocessing.Pool. | ||
| preprocess_workers = max(1, self._args.num_workers) | ||
| if preprocess_workers != self._args.num_workers: | ||
| logger.warning( | ||
| 'preprocess_data.py requires --workers >= 1; ' | ||
| 'overriding num_workers={} to {} for dataset preprocessing only ' | ||
| '(DataLoader still uses num_workers={}).'.format( | ||
| self._args.num_workers, preprocess_workers, self._args.num_workers | ||
| ) | ||
| ) |
| self.addCleanup(lambda: [p.unlink() for p in created_files if p.is_file()]) | ||
|
|
Description
preprocess_data.py uses multiprocessing.Pool(workers), which requires workers >= 1. When num_workers is set to 0 (valid for DataLoader, where it means "load in main process"), the preprocessing step crashes.
This change clamps the worker count to max(1, self._args.num_workers) before passing it to preprocess_data.py, while leaving the original num_workers value unchanged for other uses like DataLoader.