[ci][deps] Generating dependency sets for Data CI images#61532
[ci][deps] Generating dependency sets for Data CI images#61532elliot-barn wants to merge 17 commits intomasterfrom
Conversation
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
There was a problem hiding this comment.
Code Review
This pull request refactors the data CI dependency management to use depsets and uv, which is a significant improvement for build reproducibility and speed. The changes are largely consistent and well-structured. I've identified a few minor areas for improvement: an unused build argument, a configuration that could be clarified with comments for better maintainability, and a duplicated dependency in the new requirements files.
| requirements: | ||
| - python/requirements/ml/data-requirements-py313.txt | ||
| - python/requirements/ml/data-test-requirements-py313.txt | ||
| constraints: | ||
| - /tmp/ray-deps/requirements_compiled_py3.13.txt | ||
| output: python/deplocks/ci/data-base-ci_depset_py${PYTHON_VERSION}.lock | ||
| append_flags: | ||
| - --index https://download.pytorch.org/whl/cpu | ||
| - --python-version=${PYTHON_VERSION} | ||
| - --python-platform=linux | ||
| - --unsafe-package ray | ||
| build_arg_sets: | ||
| - py310 | ||
| - py312 | ||
| pre_hooks: | ||
| - ci/raydepsets/pre_hooks/remove-compiled-headers.sh 3.13 |
There was a problem hiding this comment.
The configuration for data_base_ci_depset uses hardcoded paths and arguments with py313 and 3.13 (e.g., data-requirements-py313.txt, requirements_compiled_py3.13.txt, remove-compiled-headers.sh 3.13) for Python 3.10 and 3.12 builds. This is confusing and could be a maintenance burden. Please add a comment explaining why this is necessary and how it works. For example, explaining that these py313 files are a generic base that is then specialized for each target Python version by uv.
There was a problem hiding this comment.
intentional to use the py313 dependencies
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
|
@elliot-barn , this is a pretty big change, could you manually run a round of postmerge on this branch with |
ran the postmerge tests: https://buildkite.com/ray-project/postmerge/builds/16394 same failures as master: https://buildkite.com/ray-project/postmerge/builds/16398 |
bveeramani
left a comment
There was a problem hiding this comment.
Blocking questions are about the Data diff and whether we're making any user-facing changes.
Trust you on the CI changes. Comments are mostly for my own understanding
| # Test that unsupported pyarrow versions cause an error to be raised upon the | ||
| # initial pyarrow use. | ||
| ray.init(runtime_env={"pip": ["pyarrow==8.0.0"]}) | ||
| ray.init(runtime_env={"pip": ["numpy==1.26.4", "pyarrow==8.0.0"]}) |
There was a problem hiding this comment.
Can you add a comment describing why we need to add numpy here? It won't be obvious when people read this test in future
python/ray/data/_internal/util.py
Outdated
| mem_size = datasource_or_legacy_reader.estimate_inmemory_data_size() | ||
| if ( | ||
| mem_size is not None | ||
| # Guard against non-numeric types (e.g. numpy objects) that would cause |
There was a problem hiding this comment.
In what cases is mem_size a NumPy object? If so, will this change Ray Data's parallelism detection for some datasources?
There was a problem hiding this comment.
From claude:
mem_size can be a NumPy scalar in at least two concrete cases from built-in datasources:
- Parquet datasource (most common case)
In parquet_datasource.py:1078, the encoding ratio is computed as:
estimated_ratio = np.mean(estimated_encoding_ratios)
np.mean() returns a numpy.float64. This ratio is then used in _estimate_in_mem_size (line 735):
in_mem_size = sum([f.file_size for f in fragments]) * self._encoding_ratio
The result is a numpy.float64, which gets passed through round() — still returning a numpy scalar (numpy.int64 or numpy.float64 depending on numpy version). So estimate_inmemory_data_size() returns a numpy type.
- Image datasource
image_datasource.py:107 does total_size * self._encoding_ratio, where _encoding_ratio can also come from numpy arithmetic, yielding a numpy scalar.
Does the isinstance(mem_size, (int, float)) guard change parallelism detection?
No, it does not change behavior. NumPy scalar types like numpy.int64 and numpy.float64 are subclasses of Python's int and float respectively:
isinstance(np.int64(5), int) # True
isinstance(np.float64(5.0), float) # True
So isinstance(mem_size, (int, float)) passes for numpy integer and float scalars. The guard is there to protect against truly non-numeric types (e.g., if a datasource returned a string, None-like object, or a numpy array rather than a scalar). A numpy array (e.g., np.array([100])) would fail isinstance(..., (int, float)) and
would also cause np.isnan() to behave differently.
| @pytest.mark.skipif( | ||
| get_pyarrow_version() < parse_version("10.0.1"), | ||
| reason="ArrowDtype requires pyarrow>=10.0.1", | ||
| ) |
There was a problem hiding this comment.
I think we test PyArrow 9. How were this tests passing before?
There was a problem hiding this comment.
Another case of forcing packages to work together. We use pylance 0.22.0 currently (confirmed here)
Because pylance==0.22.0 depends on pyarrow>=14 and you require pyarrow==9.0.0, we can conclude that your requirements and pylance==0.22.0 are
incompatible.
And because you require pylance==0.22.0, we can conclude that your requirements are unsatisfiable.
When using the correct pyarrow v9 compatible version of pylance i get this error: https://buildkite.com/ray-project/premerge/builds/61627#019cd46c-dd80-4499-9100-4938ddf68628/L1066-L1082
File "/rayci/python/ray/data/_internal/datasource/lance_datasink.py", line 38, in _write_fragment
from lance.fragment import DEFAULT_MAX_BYTES_PER_FILE, write_fragments
ModuleNotFoundError: No module named 'lance.fragment'
With our current implementation we can't support pyarrow v9 and any version of pylance. Options are to force install pylance 0.22.0 on the image, update ray data to support new and old versions of pylance , or just skip the tests on pyarrow v9
| @@ -961,6 +961,8 @@ py_test( | |||
| name = "test_tf", | |||
| size = "medium", | |||
| srcs = ["tests/datasource/test_tf.py"], | |||
There was a problem hiding this comment.
Does the TF-related APIs fail with Keras 3? If so, will that be a problem for our users?
There was a problem hiding this comment.
Yes TF related APIs need to be updated to work with keras 3. We are already using keras 2 here. Users will have to export TF_USE_LEGACY_KERAS=1 unless we do this on our end
| hudi | ||
| pylance | ||
| modin |
There was a problem hiding this comment.
Could you help me understand why these three packages in particular are included in the PyArrow requirements file? Don't think this will be obvious to other Data maintainers either
There was a problem hiding this comment.
since we relax the base depset and remove the listed packages: https://github.com/ray-project/ray/pull/61532/changes#diff-2e5d83ae3273bce0fde971e891370a5c45dcc204528cf7d4a9d74cc3d077687aR37. We need to specify the packages again in the expanded depsets for each version of pyarrow
| # Convert paths to Python strings in case they're numpy strings | ||
| # (e.g., from np.array_split in file_based_datasource.py) | ||
| input_files = [str(p) for p in paths] |
There was a problem hiding this comment.
Nit: In the interest of handling errors near their source/validating inputs early, I think it might be better to perform this conversion early when we do np.array_split.
Also, feels a bit weird that DefaultFileMetadataProvider contains logic that's specific to FileBasedDatasource's implementation
split_paths = [p.tolist() for p in np.array_split(paths, parallelism)]
| import warnings | ||
|
|
||
| with warnings.catch_warnings(record=True) as record: | ||
| warnings.simplefilter("always") |
There was a problem hiding this comment.
OOC, what happens if we continue to use pytest here?
| - ci/raydepsets/pre_hooks/remove-compiled-headers.sh 3.13 | ||
|
|
||
| - name: relaxed_data_ci_depset_${PYTHON_SHORT} | ||
| operation: relax |
There was a problem hiding this comment.
I feel a bit confused about this relaxed thing. For my own understanding, what happens if we use the regular data_ci_depset for the PyArrow v9 depset?
There was a problem hiding this comment.
A few things:
- regular data_ci_depset doesn't have pyarrow v9 it has pyarrow v19
- I actually can't downgrade packages in a lock file with uv pip compile (there are workarounds but they're messy)
- Also can't compile pyarrow v9 with numpy > 2 and more recent versions of pandas
- I need to relax the base data ci depset so it can be expanded with the appropriate pyarrow, pandas, datasets, numpy, pylance, pandas, modin versions for each job
Currently we have multiple pip install commands to satisfy requirements (example here & here). This isn't ideal as we just override the existing package and dependencies without any guarantee that the packages we end up with on the image are compatible. This way we compile a complete set of all dependencies needed for the image before we install
| # Older versions of Arrow Mongo require an older version of NumPy. | ||
| pip install numpy==1.23.5 | ||
| fi | ||
| uv pip install -r /home/ray/python_depset.lock --no-deps --system --index-strategy unsafe-best-match |
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
| pyarrow==23.0.1 | ||
| hudi | ||
| pylance | ||
| modin |
There was a problem hiding this comment.
Missing datasets package in pyarrow-latest and pyarrow-nightly requirements
Medium Severity
The relaxed depset explicitly removes seven packages (pyarrow, numpy, datasets, hudi, pylance, pandas, modin) from the base lock file. The pyarrow-v9.txt correctly adds datasets back, but pyarrow-latest.txt and pyarrow-nightly.txt do not. The compiled data-pyarrow-latest lock file confirms datasets is absent. Since the datalbuild image (pyarrow-latest) now replaces the old databuild image—which did include datasets—any HuggingFace datasource tests running against datalbuild will be missing the datasets dependency.
Additional Locations (1)
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
There are 3 total unresolved issues (including 1 from previous review).
Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.
| - --python-version=${PYTHON_VERSION} | ||
| - --python-platform=linux | ||
| - --unsafe-package ray | ||
| - --upgrade-package pyarrow |
There was a problem hiding this comment.
Pyarrow nightly depset missing PyTorch CPU index flag
Medium Severity
The data_pyarrow_nightly_depset append_flags are missing --index https://download.pytorch.org/whl/cpu, which all other expand depsets (data_pyarrow_latest_depset, data_pyarrow_v9_depset, data_mongo_depset) include. Since the nightly depset expands the same relaxed_data_ci_depset (which originates from the base depset including dl-cpu-requirements.txt), the PyTorch CPU index is likely needed for correct dependency resolution. Without it, uv pip compile may resolve GPU PyTorch variants from PyPI or fail to resolve torch packages entirely.
Additional Locations (1)
| - --unsafe-package ray | ||
| build_arg_sets: | ||
| - py310 | ||
| - py312 |
There was a problem hiding this comment.
Relaxed depset has unused constraints and append_flags fields
Low Severity
The relaxed_data_ci_depset defines constraints and append_flags fields, but the relax method in cli.py only accepts source_depset, packages, name, and output. These fields are silently ignored during execution, which could be confusing to future maintainers who might expect them to have an effect on dependency resolution.


Adding depsets for data ci tests:
Data ci images:
Updating tests
env = {"TF_USE_LEGACY_KERAS": "1"},to tensorflow related tests. Tensorflow upgrade to 2.20.0 pulls in keras 3.0 (which ray is not compatible with currently). Adding TF_USE_LEGACY_KERAS=1 forces tensorflow to use keras 2.0 via tf-keras (ref here)Postmerge build: https://buildkite.com/ray-project/postmerge/builds/16394