Skip to content

[ci][deps] Generating dependency sets for Data CI images#61532

Open
elliot-barn wants to merge 17 commits intomasterfrom
elliot-barn-data-ci-depsets
Open

[ci][deps] Generating dependency sets for Data CI images#61532
elliot-barn wants to merge 17 commits intomasterfrom
elliot-barn-data-ci-depsets

Conversation

@elliot-barn
Copy link
Contributor

@elliot-barn elliot-barn commented Mar 5, 2026

Adding depsets for data ci tests:

  • Data base ci depset (to expand ray requirements with data & data test requirements)
  • Relaxed data ci depset (to remove pyarrow and other packages to be expanded for different versions of pyarrow)
  • Data pyarrow latest depset (test against pyarrow v23)
  • Data pyarrow v9 depset (test against pyarrow v9)
  • Data mongo depset (test against pyarrow mongo)
  • Data pyarrow nightly (test against pyarrow nightly releases)

Data ci images:

  • installing depsets on image based on new Docker arg IMAGE_TYPE which maps to the depset lock file name

Updating tests

  • adding env = {"TF_USE_LEGACY_KERAS": "1"}, to tensorflow related tests. Tensorflow upgrade to 2.20.0 pulls in keras 3.0 (which ray is not compatible with currently). Adding TF_USE_LEGACY_KERAS=1 forces tensorflow to use keras 2.0 via tf-keras (ref here)
  • Skip lance tests for older pylance versions (<=0.3.19) due to incompatible lance API changes with pyarrow v9.0.0

Postmerge build: https://buildkite.com/ray-project/postmerge/builds/16394

Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
@elliot-barn elliot-barn requested a review from a team as a code owner March 5, 2026 20:40
@elliot-barn elliot-barn added the go add ONLY when ready to merge, run all tests label Mar 5, 2026
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the data CI dependency management to use depsets and uv, which is a significant improvement for build reproducibility and speed. The changes are largely consistent and well-structured. I've identified a few minor areas for improvement: an unused build argument, a configuration that could be clarified with comments for better maintainability, and a duplicated dependency in the new requirements files.

Comment on lines +14 to +29
requirements:
- python/requirements/ml/data-requirements-py313.txt
- python/requirements/ml/data-test-requirements-py313.txt
constraints:
- /tmp/ray-deps/requirements_compiled_py3.13.txt
output: python/deplocks/ci/data-base-ci_depset_py${PYTHON_VERSION}.lock
append_flags:
- --index https://download.pytorch.org/whl/cpu
- --python-version=${PYTHON_VERSION}
- --python-platform=linux
- --unsafe-package ray
build_arg_sets:
- py310
- py312
pre_hooks:
- ci/raydepsets/pre_hooks/remove-compiled-headers.sh 3.13
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The configuration for data_base_ci_depset uses hardcoded paths and arguments with py313 and 3.13 (e.g., data-requirements-py313.txt, requirements_compiled_py3.13.txt, remove-compiled-headers.sh 3.13) for Python 3.10 and 3.12 builds. This is confusing and could be a maintenance burden. Please add a comment explaining why this is necessary and how it works. For example, explaining that these py313 files are a generic base that is then specialized for each target Python version by uv.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

intentional to use the py313 dependencies

Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
@elliot-barn elliot-barn requested review from a team as code owners March 5, 2026 22:38
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
@ray-gardener ray-gardener bot added the devprod label Mar 6, 2026
Copy link
Collaborator

@aslonnie aslonnie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dnr?

Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
@elliot-barn elliot-barn changed the title [DNR][DNM][ci] Data ci depsets [ci] Data ci depsets Mar 9, 2026
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
@elliot-barn elliot-barn changed the title [ci] Data ci depsets [ci][deps] Generating dependency sets for Data CI images Mar 9, 2026
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
@aslonnie
Copy link
Collaborator

@elliot-barn , this is a pretty big change, could you manually run a round of postmerge on this branch with RAYCI_RUN_ALL_TESTS=1 set in env var and see if anything major breaks?

@elliot-barn
Copy link
Contributor Author

elliot-barn commented Mar 10, 2026

  • buildkite/postmerge

ran the postmerge tests: https://buildkite.com/ray-project/postmerge/builds/16394

same failures as master: https://buildkite.com/ray-project/postmerge/builds/16398

Copy link
Member

@bveeramani bveeramani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocking questions are about the Data diff and whether we're making any user-facing changes.

Trust you on the CI changes. Comments are mostly for my own understanding

# Test that unsupported pyarrow versions cause an error to be raised upon the
# initial pyarrow use.
ray.init(runtime_env={"pip": ["pyarrow==8.0.0"]})
ray.init(runtime_env={"pip": ["numpy==1.26.4", "pyarrow==8.0.0"]})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment describing why we need to add numpy here? It won't be obvious when people read this test in future

mem_size = datasource_or_legacy_reader.estimate_inmemory_data_size()
if (
mem_size is not None
# Guard against non-numeric types (e.g. numpy objects) that would cause
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In what cases is mem_size a NumPy object? If so, will this change Ray Data's parallelism detection for some datasources?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From claude:

mem_size can be a NumPy scalar in at least two concrete cases from built-in datasources:

  1. Parquet datasource (most common case)

In parquet_datasource.py:1078, the encoding ratio is computed as:
estimated_ratio = np.mean(estimated_encoding_ratios)
np.mean() returns a numpy.float64. This ratio is then used in _estimate_in_mem_size (line 735):
in_mem_size = sum([f.file_size for f in fragments]) * self._encoding_ratio
The result is a numpy.float64, which gets passed through round() — still returning a numpy scalar (numpy.int64 or numpy.float64 depending on numpy version). So estimate_inmemory_data_size() returns a numpy type.

  1. Image datasource

image_datasource.py:107 does total_size * self._encoding_ratio, where _encoding_ratio can also come from numpy arithmetic, yielding a numpy scalar.

Does the isinstance(mem_size, (int, float)) guard change parallelism detection?

No, it does not change behavior. NumPy scalar types like numpy.int64 and numpy.float64 are subclasses of Python's int and float respectively:

isinstance(np.int64(5), int) # True
isinstance(np.float64(5.0), float) # True

So isinstance(mem_size, (int, float)) passes for numpy integer and float scalars. The guard is there to protect against truly non-numeric types (e.g., if a datasource returned a string, None-like object, or a numpy array rather than a scalar). A numpy array (e.g., np.array([100])) would fail isinstance(..., (int, float)) and
would also cause np.isnan() to behave differently.

Comment on lines +446 to +449
@pytest.mark.skipif(
get_pyarrow_version() < parse_version("10.0.1"),
reason="ArrowDtype requires pyarrow>=10.0.1",
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we test PyArrow 9. How were this tests passing before?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another case of forcing packages to work together. We use pylance 0.22.0 currently (confirmed here)

Because pylance==0.22.0 depends on pyarrow>=14 and you require pyarrow==9.0.0, we can conclude that your requirements and pylance==0.22.0 are
      incompatible.
      And because you require pylance==0.22.0, we can conclude that your requirements are unsatisfiable.

When using the correct pyarrow v9 compatible version of pylance i get this error: https://buildkite.com/ray-project/premerge/builds/61627#019cd46c-dd80-4499-9100-4938ddf68628/L1066-L1082

File "/rayci/python/ray/data/_internal/datasource/lance_datasink.py", line 38, in _write_fragment
from lance.fragment import DEFAULT_MAX_BYTES_PER_FILE, write_fragments
ModuleNotFoundError: No module named 'lance.fragment'

With our current implementation we can't support pyarrow v9 and any version of pylance. Options are to force install pylance 0.22.0 on the image, update ray data to support new and old versions of pylance , or just skip the tests on pyarrow v9

@@ -961,6 +961,8 @@ py_test(
name = "test_tf",
size = "medium",
srcs = ["tests/datasource/test_tf.py"],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the TF-related APIs fail with Keras 3? If so, will that be a problem for our users?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes TF related APIs need to be updated to work with keras 3. We are already using keras 2 here. Users will have to export TF_USE_LEGACY_KERAS=1 unless we do this on our end

Comment on lines +2 to +4
hudi
pylance
modin
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you help me understand why these three packages in particular are included in the PyArrow requirements file? Don't think this will be obvious to other Data maintainers either

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we relax the base depset and remove the listed packages: https://github.com/ray-project/ray/pull/61532/changes#diff-2e5d83ae3273bce0fde971e891370a5c45dcc204528cf7d4a9d74cc3d077687aR37. We need to specify the packages again in the expanded depsets for each version of pyarrow

Comment on lines +156 to +158
# Convert paths to Python strings in case they're numpy strings
# (e.g., from np.array_split in file_based_datasource.py)
input_files = [str(p) for p in paths]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: In the interest of handling errors near their source/validating inputs early, I think it might be better to perform this conversion early when we do np.array_split.

Also, feels a bit weird that DefaultFileMetadataProvider contains logic that's specific to FileBasedDatasource's implementation

  split_paths = [p.tolist() for p in np.array_split(paths, parallelism)]                                                                                                                       

Comment on lines +213 to +216
import warnings

with warnings.catch_warnings(record=True) as record:
warnings.simplefilter("always")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OOC, what happens if we continue to use pytest here?

- ci/raydepsets/pre_hooks/remove-compiled-headers.sh 3.13

- name: relaxed_data_ci_depset_${PYTHON_SHORT}
operation: relax
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel a bit confused about this relaxed thing. For my own understanding, what happens if we use the regular data_ci_depset for the PyArrow v9 depset?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few things:

  • regular data_ci_depset doesn't have pyarrow v9 it has pyarrow v19
  • I actually can't downgrade packages in a lock file with uv pip compile (there are workarounds but they're messy)
  • Also can't compile pyarrow v9 with numpy > 2 and more recent versions of pandas
  • I need to relax the base data ci depset so it can be expanded with the appropriate pyarrow, pandas, datasets, numpy, pylance, pandas, modin versions for each job

Currently we have multiple pip install commands to satisfy requirements (example here & here). This isn't ideal as we just override the existing package and dependencies without any guarantee that the packages we end up with on the image are compatible. This way we compile a complete set of all dependencies needed for the image before we install

# Older versions of Arrow Mongo require an older version of NumPy.
pip install numpy==1.23.5
fi
uv pip install -r /home/ray/python_depset.lock --no-deps --system --index-strategy unsafe-best-match
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice

Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
pyarrow==23.0.1
hudi
pylance
modin
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing datasets package in pyarrow-latest and pyarrow-nightly requirements

Medium Severity

The relaxed depset explicitly removes seven packages (pyarrow, numpy, datasets, hudi, pylance, pandas, modin) from the base lock file. The pyarrow-v9.txt correctly adds datasets back, but pyarrow-latest.txt and pyarrow-nightly.txt do not. The compiled data-pyarrow-latest lock file confirms datasets is absent. Since the datalbuild image (pyarrow-latest) now replaces the old databuild image—which did include datasets—any HuggingFace datasource tests running against datalbuild will be missing the datasets dependency.

Additional Locations (1)
Fix in Cursor Fix in Web

@elliot-barn elliot-barn requested a review from bveeramani March 16, 2026 20:05
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

There are 3 total unresolved issues (including 1 from previous review).

Fix All in Cursor

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.

- --python-version=${PYTHON_VERSION}
- --python-platform=linux
- --unsafe-package ray
- --upgrade-package pyarrow
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pyarrow nightly depset missing PyTorch CPU index flag

Medium Severity

The data_pyarrow_nightly_depset append_flags are missing --index https://download.pytorch.org/whl/cpu, which all other expand depsets (data_pyarrow_latest_depset, data_pyarrow_v9_depset, data_mongo_depset) include. Since the nightly depset expands the same relaxed_data_ci_depset (which originates from the base depset including dl-cpu-requirements.txt), the PyTorch CPU index is likely needed for correct dependency resolution. Without it, uv pip compile may resolve GPU PyTorch variants from PyPI or fail to resolve torch packages entirely.

Additional Locations (1)
Fix in Cursor Fix in Web

- --unsafe-package ray
build_arg_sets:
- py310
- py312
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Relaxed depset has unused constraints and append_flags fields

Low Severity

The relaxed_data_ci_depset defines constraints and append_flags fields, but the relax method in cli.py only accepts source_depset, packages, name, and output. These fields are silently ignored during execution, which could be confusing to future maintainers who might expect them to have an effect on dependency resolution.

Additional Locations (1)
Fix in Cursor Fix in Web

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

devprod go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants