Enhancement - Fix preprocessing crash when num_workers=0 in Megatron GPT3 dataset generation by polarG · Pull Request #800 · microsoft/superbenchmark

polarG · 2026-04-01T06:07:20Z

Description
preprocess_data.py uses multiprocessing.Pool(workers), which requires workers >= 1. When num_workers is set to 0 (valid for DataLoader, where it means "load in main process"), the preprocessing step crashes.

This change clamps the worker count to max(1, self._args.num_workers) before passing it to preprocess_data.py, while leaving the original num_workers value unchanged for other uses like DataLoader.

Copilot

Pull request overview

This PR fixes a crash in the Megatron GPT-3 dataset preprocessing path when --num_workers=0 by ensuring the worker count passed to Megatron’s tools/preprocess_data.py is at least 1, while keeping num_workers unchanged for other consumers (e.g., PyTorch DataLoader).

Changes:

Clamp --workers for preprocess_data.py to max(1, self._args.num_workers) to avoid multiprocessing.Pool(0) failures.
Add inline clarification comment documenting why preprocessing clamps workers while other components may allow num_workers=0.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

- Megatron's preprocess_data.py appends '_text_document' to the --output-prefix when producing the .bin/.idx files. Derive the output-prefix from --data_prefix (stripping a trailing '_text_document' suffix when present) so that the generated files match the existence checks for any custom data_prefix value, instead of being hardcoded to '<data_home>/dataset'. - Add unit test test_megatron_gpt_dataset_generate_command covering: num_workers=0 clamps to '--workers 1' with default data_prefix ('dataset_text_document' -> '<data_home>/dataset'), num_workers=4 with custom 'custom_text_document' -> '<data_home>/custom', and a data_prefix without the suffix used as-is.

Avoid '--workers 1' matching '--workers 10' etc.

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

codecov · 2026-04-27T23:20:03Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 86.01%. Comparing base (b41c391) to head (f786c64).

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #800      +/-   ##
==========================================
+ Coverage   85.88%   86.01%   +0.13%     
==========================================
  Files         103      103              
  Lines        7929     7939      +10     
==========================================
+ Hits         6810     6829      +19     
+ Misses       1119     1110       -9

Flag	Coverage Δ
cpu-python3.10-unit-test	`70.85% <100.00%> (+0.15%)`	⬆️
cpu-python3.12-unit-test	`70.85% <100.00%> (+0.15%)`	⬆️
cpu-python3.7-unit-test	`70.28% <100.00%> (+0.15%)`	⬆️
cuda-unit-test	`83.94% <100.00%> (+0.13%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

…et generate test - Replace brittle whitespace substring assertions ('--workers 1 ', '--output-prefix ... ') with normalize_command()-based parsed CLI unit checks, so the test validates semantics rather than formatting. - Use --code_base {self._tmp_dir} together with createMockFiles(['pretrain_gpt.py']) to avoid the unrealistic /root/Megatron-DeepSpeed path. The mocked run_command now creates the expected .bin/.idx files via side_effect so _preprocess() succeeds end-to-end and is asserted to be True.

- Warn when num_workers is silently clamped from 0 to 1 for the preprocess subprocess so the user sees the override in the log. The DataLoader still receives the original num_workers value. - Guard the '_text_document' suffix-strip against the edge case where data_prefix == '_text_document' (which would otherwise produce a malformed --output-prefix '<data_home>/' with an empty basename). Fall back to the original data_prefix value in that case. - Extend test_megatron_gpt_dataset_generate_command with a 4th case asserting the empty-basename fallback.

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

- Fix test pollution: clean up pretrain_gpt.py created by test_megatron_gpt_dataset_generate_command so it does not break the alphabetically-later test_megatron_gpt_preprocess (root cause of CI cpu-unit-test failures on python-3.7/3.10/3.12). - Fix Python 3.7 incompatibility: use call_args_list[0][0][0] instead of .args[0] (mock.call.args is Python 3.8+). - Enforce data_prefix contract in _generate_dataset(): validate that data_prefix ends with '_text_document' and has a non-empty stem before invoking preprocess_data.py. Fail fast with a clear error otherwise (preprocess_data.py always appends '_text_document' to --output-prefix, so other prefixes would silently produce files whose names mismatch the existence check). - Replace test Cases 3 and 4 (which masked the contract bug via a side_effect that lied about preprocess_data.py output) with negative cases that assert _preprocess() fails fast and never invokes run_command for invalid data_prefix values.

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.

- Collapse multi-line .format() call in _generate_dataset error message. - Add blank line before 'return _side_effect' in nested function in test.

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

                if self._args.dataset_url:
                    self._raw_data_path = str(Path(self._args.data_home) / 'data.json')
                    download_file(self._args.dataset_url, self._raw_data_path)
+
+                    # Megatron's preprocess_data.py appends '_text_document' to --output-prefix
+                    # when producing the .bin/.idx files. For the existence check below
+                    # (which looks for {data_prefix}.bin/.idx) to pass, data_prefix must end
+                    # with '_text_document' and have a non-empty stem when generation is needed.
+                    suffix = '_text_document'
+                    if not self._args.data_prefix.endswith(suffix) or self._args.data_prefix == suffix:
+                        logger.error(
+                            'data_prefix must end with "{}" and have a non-empty stem when '
+                            'dataset generation is required (got "{}"). preprocess_data.py '
+                            'always appends "{}" to --output-prefix.'.format(suffix, self._args.data_prefix, suffix)
+                        )


+                    # num_workers=0 is valid for DataLoader (main process loads data),
+                    # but preprocess_data.py requires workers>=1 for multiprocessing.Pool.
+                    preprocess_workers = max(1, self._args.num_workers)
+                    if preprocess_workers != self._args.num_workers:
+                        logger.warning(
+                            'preprocess_data.py requires --workers >= 1; '
+                            'overriding num_workers={} to {} for dataset preprocessing only '
+                            '(DataLoader still uses num_workers={}).'.format(
+                                self._args.num_workers, preprocess_workers, self._args.num_workers
+                            )
+                        )


+        self.addCleanup(lambda: [p.unlink() for p in created_files if p.is_file()])
+


Fix the num_workers usage.

203a883

polarG requested a review from a team as a code owner April 1, 2026 06:07

Copilot AI review requested due to automatic review settings April 1, 2026 06:07

Copilot started reviewing on behalf of polarG April 1, 2026 06:08 View session

Copilot AI reviewed Apr 1, 2026

View reviewed changes

Comment thread superbench/benchmarks/model_benchmarks/megatron_gpt3.py

Comment thread superbench/benchmarks/model_benchmarks/megatron_gpt3.py Outdated

polarG self-assigned this Apr 17, 2026

polarG added enhancement New feature or request ROCm labels Apr 27, 2026

Copilot AI review requested due to automatic review settings April 27, 2026 22:02

Tighten test substring matches with trailing space

ee2839e

Avoid '--workers 1' matching '--workers 10' etc.

polarG force-pushed the dev/hongtaozhang/fix-num_workers-usage branch from a475b85 to ee2839e Compare April 27, 2026 22:03

Copilot AI reviewed Apr 27, 2026

View reviewed changes

Comment thread tests/benchmarks/model_benchmarks/test_megatron_gpt.py Outdated

Comment thread tests/benchmarks/model_benchmarks/test_megatron_gpt.py Outdated

Hongtao Zhang added 2 commits May 2, 2026 00:19

Copilot AI review requested due to automatic review settings May 3, 2026 05:05

Copilot started reviewing on behalf of polarG May 3, 2026 05:05 View session

Copilot AI reviewed May 3, 2026

View reviewed changes

Comment thread superbench/benchmarks/model_benchmarks/megatron_gpt3.py Outdated

Comment thread tests/benchmarks/model_benchmarks/test_megatron_gpt.py Outdated

polarG and others added 2 commits May 15, 2026 13:18

Merge branch 'main' into dev/hongtaozhang/fix-num_workers-usage

77854b1

Copilot AI review requested due to automatic review settings May 15, 2026 20:50

Copilot started reviewing on behalf of polarG May 15, 2026 20:50 View session

Copilot AI reviewed May 15, 2026

View reviewed changes

style: apply yapf 0.31.0 formatting (CI lint fix)

d580e6e

- Collapse multi-line .format() call in _generate_dataset error message. - Add blank line before 'return _side_effect' in nested function in test.

guoshzhao approved these changes May 26, 2026

View reviewed changes

WenqingLan1 approved these changes May 26, 2026

View reviewed changes

Merge branch 'main' into dev/hongtaozhang/fix-num_workers-usage

f786c64

Copilot AI review requested due to automatic review settings May 27, 2026 04:56

polarG enabled auto-merge (squash) May 27, 2026 04:56

Copilot started reviewing on behalf of polarG May 27, 2026 04:56 View session

Copilot AI reviewed May 27, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhancement - Fix preprocessing crash when num_workers=0 in Megatron GPT3 dataset generation#800

Enhancement - Fix preprocessing crash when num_workers=0 in Megatron GPT3 dataset generation#800
polarG wants to merge 9 commits into
mainfrom
dev/hongtaozhang/fix-num_workers-usage

polarG commented Apr 1, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

codecov Bot commented Apr 27, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		self.addCleanup(lambda: [p.unlink() for p in created_files if p.is_file()])

Conversation

polarG commented Apr 1, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

codecov Bot commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov Bot commented Apr 27, 2026 •

edited

Loading