-
Notifications
You must be signed in to change notification settings - Fork 6.9k
[Train] Add external memory datasets to xgboost trainers for better scalability #57609
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: soffer-anyscale <stephen.offer@anyscale.com>
Signed-off-by: soffer-anyscale <stephen.offer@anyscale.com>
Signed-off-by: soffer-anyscale <stephen.offer@anyscale.com>
Signed-off-by: soffer-anyscale <stephen.offer@anyscale.com>
Signed-off-by: soffer-anyscale <stephen.offer@anyscale.com>
Signed-off-by: soffer-anyscale <stephen.offer@anyscale.com>
- Add utility functions for external memory DMatrix creation - Update V1 and V2 trainers with external memory configuration - Add comprehensive tests for external memory functionality - Include GPU support with RMM integration - Add input validation and error handling Signed-off-by: soffer-anyscale <stephen.offer@anyscale.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds comprehensive support for external memory training in both V1 and V2 XGBoost trainers, which is a great feature for handling large datasets. The implementation includes a shared utility module for creating external memory DMatrix objects, which is well-designed with robust error handling and logging. The changes also include extensive test coverage for the new functionality.
My review focuses on a few areas for improvement:
- There's a critical bug in the V1 trainer's
xgb.train
call where parameters are being passed incorrectly. - One of the new tests for the V2 trainer seems to have an incorrect expectation for the type of error raised.
- The new tests use hardcoded temporary paths, which could be improved by using pytest fixtures for better portability.
Overall, this is a solid contribution that significantly enhances XGBoost training capabilities in Ray Train.
bst = xgb.train( | ||
config, | ||
dtrain=dtrain, | ||
evals=evals, | ||
num_boost_round=remaining_iters, | ||
xgb_model=starting_model, | ||
callbacks=[RayTrainReportCallback()], | ||
**xgboost_train_kwargs, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The call to xgb.train
here seems incorrect. The xgboost_train_kwargs
argument contains the dictionary of XGBoost parameters, but it's being unpacked (**xgboost_train_kwargs
) into keyword arguments. The xgb.train
function expects the parameters dictionary as its first positional argument.
Additionally, config
is passed as the first argument, but it will be an empty dictionary because the V1 XGBoostTrainer
doesn't pass a train_loop_config
to its parent.
The call should be changed to pass xgboost_train_kwargs
as the first argument and not unpack it. Assuming params
only contains XGBoost parameters, the fix would be:
bst = xgb.train(
xgboost_train_kwargs,
dtrain=dtrain,
evals=evals,
num_boost_round=remaining_iters,
xgb_model=starting_model,
callbacks=[RayTrainReportCallback()],
)
with pytest.raises(DeprecationWarning): | ||
XGBoostTrainer.get_model(result.checkpoint) | ||
trainer = XGBoostTrainer( | ||
train_fn_per_worker, | ||
label_column="target", | ||
params={"objective": "binary:logistic"}, | ||
num_boost_round=5, | ||
scaling_config=ScalingConfig(num_workers=2), | ||
datasets={TRAIN_DATASET_KEY: train_dataset}, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test seems to be incorrect. The XGBoostTrainer
__init__
signature has been updated to remove the legacy parameters like label_column
and params
as keyword arguments when train_loop_per_worker
is passed positionally. Therefore, calling XGBoostTrainer
with these arguments will raise a TypeError
because they are unexpected keyword arguments, not a DeprecationWarning
. The test should be updated to expect a TypeError
.
num_boost_round=10, | ||
datasets={TRAIN_DATASET_KEY: train_dataset, "valid": valid_dataset}, | ||
use_external_memory=True, | ||
external_memory_cache_dir="/tmp/xgboost_v1_test_cache", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's better to use pytest's tmpdir
or tmp_path
fixtures for creating temporary directories in tests rather than hardcoding paths like /tmp/xgboost_v1_test_cache
. This makes tests more robust and portable. You can add tmpdir
to the test function signature and use it to create a temporary cache directory.
For example:
def test_external_memory_basic(ray_start_4_cpus, tmpdir):
...
cache_dir = tmpdir.mkdir("xgboost_cache")
trainer = XGBoostTrainer(
...
external_memory_cache_dir=str(cache_dir),
...
)
...
This applies to other new tests in this file as well (test_external_memory_with_large_dataset
).
scaling_config=ScalingConfig(num_workers=2), | ||
datasets={TRAIN_DATASET_KEY: train_dataset, "valid": valid_dataset}, | ||
use_external_memory=True, | ||
external_memory_cache_dir="/tmp/xgboost_test_cache", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's better to use pytest's tmp_path
fixture for creating temporary directories in tests rather than hardcoding paths like /tmp/xgboost_test_cache
. This makes tests more robust and portable. You can add tmp_path
to the test function signature and use it.
For example:
def test_xgboost_trainer_external_memory_basic(ray_start_4_cpus, small_dataset, tmp_path):
...
cache_dir = tmp_path / "xgboost_cache"
cache_dir.mkdir()
trainer = XGBoostTrainer(
...
external_memory_cache_dir=str(cache_dir),
...
)
...
This applies to other new tests in this file as well (test_xgboost_trainer_external_memory_fallback_behavior
, test_xgboost_trainer_run_config
).
- Fix critical bug in V1 trainer xgb.train call - Update V2 test to expect TypeError instead of DeprecationWarning - Replace hardcoded temp paths with pytest fixtures Signed-off-by: soffer-anyscale <stephen.offer@anyscale.com>
- Replace recursive calls with while loop to prevent stack overflow - Add MAX_EMPTY_BATCHES and MAX_ERROR_RETRIES constants - Track empty batch count separately from error count - Fix code-block documentation to use testcode directive - Apply Black formatting fixes Signed-off-by: soffer-anyscale <stephen.offer@anyscale.com>
- Remove unnecessary info/debug logging statements - Keep only error and warning logs for serious issues - Add URL references to XGBoost documentation for all config values - Fix undefined variables in V2 XGBoostTrainer docstring example - Add references to external memory, GPU, and parameter documentation Signed-off-by: soffer-anyscale <stephen.offer@anyscale.com>
Critical Fix: - Replace QuantileDMatrix with ExtMemQuantileDMatrix QuantileDMatrix concatenates data in memory (wrong for external memory) ExtMemQuantileDMatrix fetches data on-demand (correct for external memory) Reference: https://xgboost.readthedocs.io/en/stable/tutorials/external_memory.html Simplifications (following XGBoost patterns): - Remove retry logic in iterator - fail fast on errors - Remove empty batch handling loop - empty batches indicate data issues - Remove error recovery with continue - makes debugging harder - Remove unnecessary error/empty batch counters - Remove verbose data type validation warnings Ray Data Best Practices: - Use iter_batches() for streaming execution (already correct) - No materialization of entire dataset (already correct) - Let Ray Data handle streaming logic (already correct) Result: Simpler, more maintainable code that follows XGBoost docs Signed-off-by: soffer-anyscale <stephen.offer@anyscale.com>
Pre-commit auto-fixed whitespace issues in all test files Signed-off-by: soffer-anyscale <stephen.offer@anyscale.com>
Why are these changes needed?
This PR adds support for scalable XGBoost training using external memory datasets in Ray Train. This enables training on datasets that are larger than available RAM by leveraging XGBoost's
QuantileDMatrix
with Ray Data streaming.Key changes:
_external_memory_utils.py
with shared utility functions for external memory DMatrix creationray.train.xgboost.XGBoostTrainer
) and V2 (ray.train.v2.xgboost.XGBoostTrainer
) trainers with external memory configuration parametersRayDatasetIterator
that implementsxgboost.DataIter
to stream batches from Ray DataBenefits:
Related issue number
Restarts #55550, which was accidentally closed
N/A - New feature implementation
Checks
git commit -s
) in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.