Skip to content

Conversation

soffer-anyscale
Copy link
Contributor

Why are these changes needed?

This PR adds support for scalable XGBoost training using external memory datasets in Ray Train. This enables training on datasets that are larger than available RAM by leveraging XGBoost's QuantileDMatrix with Ray Data streaming.

Key changes:

  • Created _external_memory_utils.py with shared utility functions for external memory DMatrix creation
  • Updated both V1 (ray.train.xgboost.XGBoostTrainer) and V2 (ray.train.v2.xgboost.XGBoostTrainer) trainers with external memory configuration parameters
  • Implemented RayDatasetIterator that implements xgboost.DataIter to stream batches from Ray Data
  • Included comprehensive input validation, error handling, and logging
  • Added extensive test coverage for external memory functionality

Benefits:

  • Enables training on datasets larger than RAM
  • Seamless integration with Ray Data for distributed data processing
  • Automatic configuration with sensible defaults
  • Support for both CPU and GPU training
  • Production-ready error handling and validation

Related issue number

Restarts #55550, which was accidentally closed

N/A - New feature implementation

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run pre-commit jobs to lint the changes in this PR. (pre-commit setup)
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: soffer-anyscale <stephen.offer@anyscale.com>
Signed-off-by: soffer-anyscale <stephen.offer@anyscale.com>
Signed-off-by: soffer-anyscale <stephen.offer@anyscale.com>
Signed-off-by: soffer-anyscale <stephen.offer@anyscale.com>
Signed-off-by: soffer-anyscale <stephen.offer@anyscale.com>
Signed-off-by: soffer-anyscale <stephen.offer@anyscale.com>
- Add utility functions for external memory DMatrix creation
- Update V1 and V2 trainers with external memory configuration
- Add comprehensive tests for external memory functionality
- Include GPU support with RMM integration
- Add input validation and error handling

Signed-off-by: soffer-anyscale <stephen.offer@anyscale.com>
@soffer-anyscale soffer-anyscale requested a review from a team as a code owner October 9, 2025 18:47
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds comprehensive support for external memory training in both V1 and V2 XGBoost trainers, which is a great feature for handling large datasets. The implementation includes a shared utility module for creating external memory DMatrix objects, which is well-designed with robust error handling and logging. The changes also include extensive test coverage for the new functionality.

My review focuses on a few areas for improvement:

  • There's a critical bug in the V1 trainer's xgb.train call where parameters are being passed incorrectly.
  • One of the new tests for the V2 trainer seems to have an incorrect expectation for the type of error raised.
  • The new tests use hardcoded temporary paths, which could be improved by using pytest fixtures for better portability.

Overall, this is a solid contribution that significantly enhances XGBoost training capabilities in Ray Train.

Comment on lines 276 to 284
bst = xgb.train(
config,
dtrain=dtrain,
evals=evals,
num_boost_round=remaining_iters,
xgb_model=starting_model,
callbacks=[RayTrainReportCallback()],
**xgboost_train_kwargs,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The call to xgb.train here seems incorrect. The xgboost_train_kwargs argument contains the dictionary of XGBoost parameters, but it's being unpacked (**xgboost_train_kwargs) into keyword arguments. The xgb.train function expects the parameters dictionary as its first positional argument.

Additionally, config is passed as the first argument, but it will be an empty dictionary because the V1 XGBoostTrainer doesn't pass a train_loop_config to its parent.

The call should be changed to pass xgboost_train_kwargs as the first argument and not unpack it. Assuming params only contains XGBoost parameters, the fix would be:

        bst = xgb.train(
            xgboost_train_kwargs,
            dtrain=dtrain,
            evals=evals,
            num_boost_round=remaining_iters,
            xgb_model=starting_model,
            callbacks=[RayTrainReportCallback()],
        )

Comment on lines 94 to 820
with pytest.raises(DeprecationWarning):
XGBoostTrainer.get_model(result.checkpoint)
trainer = XGBoostTrainer(
train_fn_per_worker,
label_column="target",
params={"objective": "binary:logistic"},
num_boost_round=5,
scaling_config=ScalingConfig(num_workers=2),
datasets={TRAIN_DATASET_KEY: train_dataset},
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This test seems to be incorrect. The XGBoostTrainer __init__ signature has been updated to remove the legacy parameters like label_column and params as keyword arguments when train_loop_per_worker is passed positionally. Therefore, calling XGBoostTrainer with these arguments will raise a TypeError because they are unexpected keyword arguments, not a DeprecationWarning. The test should be updated to expect a TypeError.

num_boost_round=10,
datasets={TRAIN_DATASET_KEY: train_dataset, "valid": valid_dataset},
use_external_memory=True,
external_memory_cache_dir="/tmp/xgboost_v1_test_cache",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

It's better to use pytest's tmpdir or tmp_path fixtures for creating temporary directories in tests rather than hardcoding paths like /tmp/xgboost_v1_test_cache. This makes tests more robust and portable. You can add tmpdir to the test function signature and use it to create a temporary cache directory.

For example:

def test_external_memory_basic(ray_start_4_cpus, tmpdir):
    ...
    cache_dir = tmpdir.mkdir("xgboost_cache")
    trainer = XGBoostTrainer(
        ...
        external_memory_cache_dir=str(cache_dir),
        ...
    )
    ...

This applies to other new tests in this file as well (test_external_memory_with_large_dataset).

scaling_config=ScalingConfig(num_workers=2),
datasets={TRAIN_DATASET_KEY: train_dataset, "valid": valid_dataset},
use_external_memory=True,
external_memory_cache_dir="/tmp/xgboost_test_cache",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

It's better to use pytest's tmp_path fixture for creating temporary directories in tests rather than hardcoding paths like /tmp/xgboost_test_cache. This makes tests more robust and portable. You can add tmp_path to the test function signature and use it.

For example:

def test_xgboost_trainer_external_memory_basic(ray_start_4_cpus, small_dataset, tmp_path):
    ...
    cache_dir = tmp_path / "xgboost_cache"
    cache_dir.mkdir()
    trainer = XGBoostTrainer(
        ...
        external_memory_cache_dir=str(cache_dir),
        ...
    )
    ...

This applies to other new tests in this file as well (test_xgboost_trainer_external_memory_fallback_behavior, test_xgboost_trainer_run_config).

cursor[bot]

This comment was marked as outdated.

@ray-gardener ray-gardener bot added docs An issue or change related to documentation train Ray Train Related Issue data Ray Data-related issues labels Oct 9, 2025
@soffer-anyscale soffer-anyscale changed the title Train xgboost scale [Train] Add external memory datasets to xgboost trainers for better scalability Oct 9, 2025
- Fix critical bug in V1 trainer xgb.train call
- Update V2 test to expect TypeError instead of DeprecationWarning
- Replace hardcoded temp paths with pytest fixtures

Signed-off-by: soffer-anyscale <stephen.offer@anyscale.com>
cursor[bot]

This comment was marked as outdated.

- Replace recursive calls with while loop to prevent stack overflow
- Add MAX_EMPTY_BATCHES and MAX_ERROR_RETRIES constants
- Track empty batch count separately from error count
- Fix code-block documentation to use testcode directive
- Apply Black formatting fixes

Signed-off-by: soffer-anyscale <stephen.offer@anyscale.com>
cursor[bot]

This comment was marked as outdated.

- Remove unnecessary info/debug logging statements
- Keep only error and warning logs for serious issues
- Add URL references to XGBoost documentation for all config values
- Fix undefined variables in V2 XGBoostTrainer docstring example
- Add references to external memory, GPU, and parameter documentation

Signed-off-by: soffer-anyscale <stephen.offer@anyscale.com>
Critical Fix:
- Replace QuantileDMatrix with ExtMemQuantileDMatrix
  QuantileDMatrix concatenates data in memory (wrong for external memory)
  ExtMemQuantileDMatrix fetches data on-demand (correct for external memory)
  Reference: https://xgboost.readthedocs.io/en/stable/tutorials/external_memory.html

Simplifications (following XGBoost patterns):
- Remove retry logic in iterator - fail fast on errors
- Remove empty batch handling loop - empty batches indicate data issues
- Remove error recovery with continue - makes debugging harder
- Remove unnecessary error/empty batch counters
- Remove verbose data type validation warnings

Ray Data Best Practices:
- Use iter_batches() for streaming execution (already correct)
- No materialization of entire dataset (already correct)
- Let Ray Data handle streaming logic (already correct)

Result: Simpler, more maintainable code that follows XGBoost docs
Signed-off-by: soffer-anyscale <stephen.offer@anyscale.com>
Pre-commit auto-fixed whitespace issues in all test files

Signed-off-by: soffer-anyscale <stephen.offer@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data Ray Data-related issues docs An issue or change related to documentation train Ray Train Related Issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant