[Train] Add external memory datasets to xgboost trainers for better scalability #57609

soffer-anyscale · 2025-10-09T18:47:42Z

Why are these changes needed?

This PR adds support for scalable XGBoost training using external memory datasets in Ray Train. This enables training on datasets that are larger than available RAM by leveraging XGBoost's QuantileDMatrix with Ray Data streaming.

Key changes:

Created _external_memory_utils.py with shared utility functions for external memory DMatrix creation
Updated both V1 (ray.train.xgboost.XGBoostTrainer) and V2 (ray.train.v2.xgboost.XGBoostTrainer) trainers with external memory configuration parameters
Implemented RayDatasetIterator that implements xgboost.DataIter to stream batches from Ray Data
Included comprehensive input validation, error handling, and logging
Added extensive test coverage for external memory functionality

Benefits:

Enables training on datasets larger than RAM
Seamless integration with Ray Data for distributed data processing
Automatic configuration with sensible defaults
Support for both CPU and GPU training
Production-ready error handling and validation

Related issue number

Restarts #55550, which was accidentally closed

N/A - New feature implementation

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run pre-commit jobs to lint the changes in this PR. (pre-commit setup)
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: soffer-anyscale <stephen.offer@anyscale.com>

- Add utility functions for external memory DMatrix creation - Update V1 and V2 trainers with external memory configuration - Add comprehensive tests for external memory functionality - Include GPU support with RMM integration - Add input validation and error handling Signed-off-by: soffer-anyscale <stephen.offer@anyscale.com>

gemini-code-assist

Code Review

This pull request adds comprehensive support for external memory training in both V1 and V2 XGBoost trainers, which is a great feature for handling large datasets. The implementation includes a shared utility module for creating external memory DMatrix objects, which is well-designed with robust error handling and logging. The changes also include extensive test coverage for the new functionality.

My review focuses on a few areas for improvement:

There's a critical bug in the V1 trainer's xgb.train call where parameters are being passed incorrectly.
One of the new tests for the V2 trainer seems to have an incorrect expectation for the type of error raised.
The new tests use hardcoded temporary paths, which could be improved by using pytest fixtures for better portability.

Overall, this is a solid contribution that significantly enhances XGBoost training capabilities in Ray Train.

gemini-code-assist · 2025-10-09T18:49:34Z

python/ray/train/xgboost/xgboost_trainer.py

+        bst = xgb.train(
+            config,
+            dtrain=dtrain,
+            evals=evals,
+            num_boost_round=remaining_iters,
+            xgb_model=starting_model,
+            callbacks=[RayTrainReportCallback()],
+            **xgboost_train_kwargs,
+        )


The call to xgb.train here seems incorrect. The xgboost_train_kwargs argument contains the dictionary of XGBoost parameters, but it's being unpacked (**xgboost_train_kwargs) into keyword arguments. The xgb.train function expects the parameters dictionary as its first positional argument.

Additionally, config is passed as the first argument, but it will be an empty dictionary because the V1 XGBoostTrainer doesn't pass a train_loop_config to its parent.

The call should be changed to pass xgboost_train_kwargs as the first argument and not unpack it. Assuming params only contains XGBoost parameters, the fix would be:

bst = xgb.train( xgboost_train_kwargs, dtrain=dtrain, evals=evals, num_boost_round=remaining_iters, xgb_model=starting_model, callbacks=[RayTrainReportCallback()], )

gemini-code-assist · 2025-10-09T18:49:34Z

python/ray/train/v2/tests/test_xgboost_trainer.py

    with pytest.raises(DeprecationWarning):
-        XGBoostTrainer.get_model(result.checkpoint)
+        trainer = XGBoostTrainer(
+            train_fn_per_worker,
+            label_column="target",
+            params={"objective": "binary:logistic"},
+            num_boost_round=5,
+            scaling_config=ScalingConfig(num_workers=2),
+            datasets={TRAIN_DATASET_KEY: train_dataset},
+        )


This test seems to be incorrect. The XGBoostTrainer __init__ signature has been updated to remove the legacy parameters like label_column and params as keyword arguments when train_loop_per_worker is passed positionally. Therefore, calling XGBoostTrainer with these arguments will raise a TypeError because they are unexpected keyword arguments, not a DeprecationWarning. The test should be updated to expect a TypeError.

gemini-code-assist · 2025-10-09T18:49:34Z

python/ray/train/tests/test_xgboost_trainer.py

+        num_boost_round=10,
+        datasets={TRAIN_DATASET_KEY: train_dataset, "valid": valid_dataset},
+        use_external_memory=True,
+        external_memory_cache_dir="/tmp/xgboost_v1_test_cache",


It's better to use pytest's tmpdir or tmp_path fixtures for creating temporary directories in tests rather than hardcoding paths like /tmp/xgboost_v1_test_cache. This makes tests more robust and portable. You can add tmpdir to the test function signature and use it to create a temporary cache directory.

For example:

def test_external_memory_basic(ray_start_4_cpus, tmpdir): ... cache_dir = tmpdir.mkdir("xgboost_cache") trainer = XGBoostTrainer( ... external_memory_cache_dir=str(cache_dir), ... ) ...

This applies to other new tests in this file as well (test_external_memory_with_large_dataset).

gemini-code-assist · 2025-10-09T18:49:34Z

python/ray/train/v2/tests/test_xgboost_trainer.py

+        scaling_config=ScalingConfig(num_workers=2),
+        datasets={TRAIN_DATASET_KEY: train_dataset, "valid": valid_dataset},
+        use_external_memory=True,
+        external_memory_cache_dir="/tmp/xgboost_test_cache",


It's better to use pytest's tmp_path fixture for creating temporary directories in tests rather than hardcoding paths like /tmp/xgboost_test_cache. This makes tests more robust and portable. You can add tmp_path to the test function signature and use it.

For example:

def test_xgboost_trainer_external_memory_basic(ray_start_4_cpus, small_dataset, tmp_path): ... cache_dir = tmp_path / "xgboost_cache" cache_dir.mkdir() trainer = XGBoostTrainer( ... external_memory_cache_dir=str(cache_dir), ... ) ...

This applies to other new tests in this file as well (test_xgboost_trainer_external_memory_fallback_behavior, test_xgboost_trainer_run_config).

- Fix critical bug in V1 trainer xgb.train call - Update V2 test to expect TypeError instead of DeprecationWarning - Replace hardcoded temp paths with pytest fixtures Signed-off-by: soffer-anyscale <stephen.offer@anyscale.com>

- Replace recursive calls with while loop to prevent stack overflow - Add MAX_EMPTY_BATCHES and MAX_ERROR_RETRIES constants - Track empty batch count separately from error count - Fix code-block documentation to use testcode directive - Apply Black formatting fixes Signed-off-by: soffer-anyscale <stephen.offer@anyscale.com>

- Remove unnecessary info/debug logging statements - Keep only error and warning logs for serious issues - Add URL references to XGBoost documentation for all config values - Fix undefined variables in V2 XGBoostTrainer docstring example - Add references to external memory, GPU, and parameter documentation Signed-off-by: soffer-anyscale <stephen.offer@anyscale.com>

Critical Fix: - Replace QuantileDMatrix with ExtMemQuantileDMatrix QuantileDMatrix concatenates data in memory (wrong for external memory) ExtMemQuantileDMatrix fetches data on-demand (correct for external memory) Reference: https://xgboost.readthedocs.io/en/stable/tutorials/external_memory.html Simplifications (following XGBoost patterns): - Remove retry logic in iterator - fail fast on errors - Remove empty batch handling loop - empty batches indicate data issues - Remove error recovery with continue - makes debugging harder - Remove unnecessary error/empty batch counters - Remove verbose data type validation warnings Ray Data Best Practices: - Use iter_batches() for streaming execution (already correct) - No materialization of entire dataset (already correct) - Let Ray Data handle streaming logic (already correct) Result: Simpler, more maintainable code that follows XGBoost docs Signed-off-by: soffer-anyscale <stephen.offer@anyscale.com>

Pre-commit auto-fixed whitespace issues in all test files Signed-off-by: soffer-anyscale <stephen.offer@anyscale.com>

soffer-anyscale added 7 commits August 12, 2025 15:01

initial commit

c0e81a4

Signed-off-by: soffer-anyscale <stephen.offer@anyscale.com>

updated based on feedback

efb8c39

Signed-off-by: soffer-anyscale <stephen.offer@anyscale.com>

made scalability improvements and expanded testing

ac8d58e

Signed-off-by: soffer-anyscale <stephen.offer@anyscale.com>

simplified the updates to core functionality and added documentation

5535fa3

Signed-off-by: soffer-anyscale <stephen.offer@anyscale.com>

updated to fix lint

7745db6

Signed-off-by: soffer-anyscale <stephen.offer@anyscale.com>

updated external dataset API

83c1283

Signed-off-by: soffer-anyscale <stephen.offer@anyscale.com>

soffer-anyscale requested a review from a team as a code owner October 9, 2025 18:47

gemini-code-assist bot reviewed Oct 9, 2025

View reviewed changes

This comment was marked as outdated.

Sign in to view

ray-gardener bot added docs An issue or change related to documentation train Ray Train Related Issue data Ray Data-related issues labels Oct 9, 2025

soffer-anyscale changed the title ~~Train xgboost scale~~ [Train] Add external memory datasets to xgboost trainers for better scalability Oct 9, 2025

Address code review feedback for XGBoost external memory

a31763f

- Fix critical bug in V1 trainer xgb.train call - Update V2 test to expect TypeError instead of DeprecationWarning - Replace hardcoded temp paths with pytest fixtures Signed-off-by: soffer-anyscale <stephen.offer@anyscale.com>

This comment was marked as outdated.

Sign in to view

soffer-anyscale added 3 commits October 9, 2025 13:57

Fix linter error: remove unused trainer variable in test

da2ccdd

Pre-commit auto-fixed whitespace issues in all test files Signed-off-by: soffer-anyscale <stephen.offer@anyscale.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Train] Add external memory datasets to xgboost trainers for better scalability #57609

[Train] Add external memory datasets to xgboost trainers for better scalability #57609

Uh oh!

soffer-anyscale commented Oct 9, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Oct 9, 2025

Uh oh!

gemini-code-assist bot Oct 9, 2025

Uh oh!

gemini-code-assist bot Oct 9, 2025

Uh oh!

gemini-code-assist bot Oct 9, 2025

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

[Train] Add external memory datasets to xgboost trainers for better scalability #57609

Are you sure you want to change the base?

[Train] Add external memory datasets to xgboost trainers for better scalability #57609

Uh oh!

Conversation

soffer-anyscale commented Oct 9, 2025

Why are these changes needed?

Related issue number

Checks

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant