[ci][deps] Generating dependency sets for Data CI images by elliot-barn · Pull Request #61532 · ray-project/ray

elliot-barn · 2026-03-05T20:40:12Z

Adding depsets for data ci tests:

Data base ci depset (to expand ray requirements with data & data test requirements)
Relaxed data ci depset (to remove pyarrow and other packages to be expanded for different versions of pyarrow)
Data pyarrow latest depset (test against pyarrow v23)
Data pyarrow v9 depset (test against pyarrow v9)
Data mongo depset (test against pyarrow mongo)
Data pyarrow nightly (test against pyarrow nightly releases)

Data ci images:

installing depsets on image based on new Docker arg IMAGE_TYPE which maps to the depset lock file name

Updating tests

adding env = {"TF_USE_LEGACY_KERAS": "1"}, to tensorflow related tests. Tensorflow upgrade to 2.20.0 pulls in keras 3.0 (which ray is not compatible with currently). Adding TF_USE_LEGACY_KERAS=1 forces tensorflow to use keras 2.0 via tf-keras (ref here)
Skip lance tests for older pylance versions (<=0.3.19) due to incompatible lance API changes with pyarrow v9.0.0

Postmerge build: https://buildkite.com/ray-project/postmerge/builds/16394

Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>

.buildkite/data.rayci.yml

ci/docker/data.build.Dockerfile

ci/docker/datan.build.wanda.yaml

gemini-code-assist

Code Review

This pull request refactors the data CI dependency management to use depsets and uv, which is a significant improvement for build reproducibility and speed. The changes are largely consistent and well-structured. I've identified a few minor areas for improvement: an unused build argument, a configuration that could be clarified with comments for better maintainability, and a duplicated dependency in the new requirements files.

ci/docker/datan.build.wanda.yaml

gemini-code-assist · 2026-03-05T20:48:16Z

ci/raydepsets/configs/ci_data.depsets.yaml

+    requirements:
+      - python/requirements/ml/data-requirements-py313.txt
+      - python/requirements/ml/data-test-requirements-py313.txt
+    constraints:
+      - /tmp/ray-deps/requirements_compiled_py3.13.txt
+    output: python/deplocks/ci/data-base-ci_depset_py${PYTHON_VERSION}.lock
+    append_flags:
+      - --index https://download.pytorch.org/whl/cpu
+      - --python-version=${PYTHON_VERSION}
+      - --python-platform=linux
+      - --unsafe-package ray
+    build_arg_sets:
+      - py310
+      - py312
+    pre_hooks:
+      - ci/raydepsets/pre_hooks/remove-compiled-headers.sh 3.13


The configuration for data_base_ci_depset uses hardcoded paths and arguments with py313 and 3.13 (e.g., data-requirements-py313.txt, requirements_compiled_py3.13.txt, remove-compiled-headers.sh 3.13) for Python 3.10 and 3.12 builds. This is confusing and could be a maintenance burden. Please add a comment explaining why this is necessary and how it works. For example, explaining that these py313 files are a generic base that is then specialized for each target Python version by uv.

intentional to use the py313 dependencies

python/requirements/ml/py313/data-test-requirements.txt

python/requirements/ml/data-test-requirements.txt

Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>

ci/docker/data.build.Dockerfile

Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>

.buildkite/data.rayci.yml

aslonnie

dnr?

Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>

ci/docker/datal.build.wanda.yaml

aslonnie · 2026-03-10T17:31:36Z

@elliot-barn , this is a pretty big change, could you manually run a round of postmerge on this branch with RAYCI_RUN_ALL_TESTS=1 set in env var and see if anything major breaks?

elliot-barn · 2026-03-10T23:46:23Z

buildkite/postmerge

ran the postmerge tests: https://buildkite.com/ray-project/postmerge/builds/16394

same failures as master: https://buildkite.com/ray-project/postmerge/builds/16398

bveeramani

Blocking questions are about the Data diff and whether we're making any user-facing changes.

Trust you on the CI changes. Comments are mostly for my own understanding

bveeramani · 2026-03-12T08:35:49Z

python/ray/data/tests/test_dataset_validation.py

    # Test that unsupported pyarrow versions cause an error to be raised upon the
    # initial pyarrow use.
-    ray.init(runtime_env={"pip": ["pyarrow==8.0.0"]})
+    ray.init(runtime_env={"pip": ["numpy==1.26.4", "pyarrow==8.0.0"]})


Can you add a comment describing why we need to add numpy here? It won't be obvious when people read this test in future

bveeramani · 2026-03-12T08:36:44Z

python/ray/data/_internal/util.py

        mem_size = datasource_or_legacy_reader.estimate_inmemory_data_size()
    if (
        mem_size is not None
+        # Guard against non-numeric types (e.g. numpy objects) that would cause


In what cases is mem_size a NumPy object? If so, will this change Ray Data's parallelism detection for some datasources?

From claude:

mem_size can be a NumPy scalar in at least two concrete cases from built-in datasources:

Parquet datasource (most common case)

In parquet_datasource.py:1078, the encoding ratio is computed as:
estimated_ratio = np.mean(estimated_encoding_ratios)
np.mean() returns a numpy.float64. This ratio is then used in _estimate_in_mem_size (line 735):
in_mem_size = sum([f.file_size for f in fragments]) * self._encoding_ratio
The result is a numpy.float64, which gets passed through round() — still returning a numpy scalar (numpy.int64 or numpy.float64 depending on numpy version). So estimate_inmemory_data_size() returns a numpy type.

Image datasource

image_datasource.py:107 does total_size * self._encoding_ratio, where _encoding_ratio can also come from numpy arithmetic, yielding a numpy scalar.

Does the isinstance(mem_size, (int, float)) guard change parallelism detection?

No, it does not change behavior. NumPy scalar types like numpy.int64 and numpy.float64 are subclasses of Python's int and float respectively:

isinstance(np.int64(5), int) # True
isinstance(np.float64(5.0), float) # True

So isinstance(mem_size, (int, float)) passes for numpy integer and float scalars. The guard is there to protect against truly non-numeric types (e.g., if a datasource returned a string, None-like object, or a numpy array rather than a scalar). A numpy array (e.g., np.array([100])) would fail isinstance(..., (int, float)) and
would also cause np.isnan() to behave differently.

bveeramani · 2026-03-12T08:42:56Z

python/ray/data/tests/test_pandas_block.py

+    @pytest.mark.skipif(
+        get_pyarrow_version() < parse_version("10.0.1"),
+        reason="ArrowDtype requires pyarrow>=10.0.1",
+    )


I think we test PyArrow 9. How were this tests passing before?

Another case of forcing packages to work together. We use pylance 0.22.0 currently (confirmed here)

Because pylance==0.22.0 depends on pyarrow>=14 and you require pyarrow==9.0.0, we can conclude that your requirements and pylance==0.22.0 are incompatible. And because you require pylance==0.22.0, we can conclude that your requirements are unsatisfiable.

When using the correct pyarrow v9 compatible version of pylance i get this error: https://buildkite.com/ray-project/premerge/builds/61627#019cd46c-dd80-4499-9100-4938ddf68628/L1066-L1082

File "/rayci/python/ray/data/_internal/datasource/lance_datasink.py", line 38, in _write_fragment from lance.fragment import DEFAULT_MAX_BYTES_PER_FILE, write_fragments ModuleNotFoundError: No module named 'lance.fragment'

With our current implementation we can't support pyarrow v9 and any version of pylance. Options are to force install pylance 0.22.0 on the image, update ray data to support new and old versions of pylance , or just skip the tests on pyarrow v9

bveeramani · 2026-03-12T08:44:04Z

python/ray/data/BUILD.bazel

@@ -961,6 +961,8 @@ py_test(
    name = "test_tf",
    size = "medium",
    srcs = ["tests/datasource/test_tf.py"],


Does the TF-related APIs fail with Keras 3? If so, will that be a problem for our users?

Yes TF related APIs need to be updated to work with keras 3. We are already using keras 2 here. Users will have to export TF_USE_LEGACY_KERAS=1 unless we do this on our end

bveeramani · 2026-03-12T08:45:18Z

python/requirements/data/pyarrow-nightly.txt

+hudi
+pylance
+modin


Could you help me understand why these three packages in particular are included in the PyArrow requirements file? Don't think this will be obvious to other Data maintainers either

since we relax the base depset and remove the listed packages: https://github.com/ray-project/ray/pull/61532/changes#diff-2e5d83ae3273bce0fde971e891370a5c45dcc204528cf7d4a9d74cc3d077687aR37. We need to specify the packages again in the expanded depsets for each version of pyarrow

bveeramani · 2026-03-12T08:48:17Z

python/ray/data/datasource/file_meta_provider.py

+        # Convert paths to Python strings in case they're numpy strings
+        # (e.g., from np.array_split in file_based_datasource.py)
+        input_files = [str(p) for p in paths]


Nit: In the interest of handling errors near their source/validating inputs early, I think it might be better to perform this conversion early when we do np.array_split.

Also, feels a bit weird that DefaultFileMetadataProvider contains logic that's specific to FileBasedDatasource's implementation

split_paths = [p.tolist() for p in np.array_split(paths, parallelism)]

bveeramani · 2026-03-12T08:49:39Z

python/ray/data/tests/unit/test_data_batch_conversion.py

+    import warnings
+
+    with warnings.catch_warnings(record=True) as record:
+        warnings.simplefilter("always")


OOC, what happens if we continue to use pytest here?

bveeramani · 2026-03-12T08:52:38Z

ci/raydepsets/configs/ci_data.depsets.yaml

+      - ci/raydepsets/pre_hooks/remove-compiled-headers.sh 3.13
+
+  - name: relaxed_data_ci_depset_${PYTHON_SHORT}
+    operation: relax


I feel a bit confused about this relaxed thing. For my own understanding, what happens if we use the regular data_ci_depset for the PyArrow v9 depset?

A few things:

regular data_ci_depset doesn't have pyarrow v9 it has pyarrow v19

I actually can't downgrade packages in a lock file with uv pip compile (there are workarounds but they're messy)

Also can't compile pyarrow v9 with numpy > 2 and more recent versions of pandas

I need to relax the base data ci depset so it can be expanded with the appropriate pyarrow, pandas, datasets, numpy, pylance, pandas, modin versions for each job

Currently we have multiple pip install commands to satisfy requirements (example here & here). This isn't ideal as we just override the existing package and dependencies without any guarantee that the packages we end up with on the image are compatible. This way we compile a complete set of all dependencies needed for the image before we install

ci/docker/data.build.Dockerfile

bveeramani · 2026-03-12T08:54:20Z

ci/docker/data.build.Dockerfile

-  # Older versions of Arrow Mongo require an older version of NumPy.
-  pip install numpy==1.23.5
-fi
+uv pip install -r /home/ray/python_depset.lock --no-deps --system --index-strategy unsafe-best-match


Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>

ci/docker/datal.build.wanda.yaml

Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>

cursor · 2026-03-16T18:48:06Z

python/requirements/data/pyarrow-latest.txt

+pyarrow==23.0.1
+hudi
+pylance
+modin


Missing datasets package in pyarrow-latest and pyarrow-nightly requirements

Medium Severity

The relaxed depset explicitly removes seven packages (pyarrow, numpy, datasets, hudi, pylance, pandas, modin) from the base lock file. The pyarrow-v9.txt correctly adds datasets back, but pyarrow-latest.txt and pyarrow-nightly.txt do not. The compiled data-pyarrow-latest lock file confirms datasets is absent. Since the datalbuild image (pyarrow-latest) now replaces the old databuild image—which did include datasets—any HuggingFace datasource tests running against datalbuild will be missing the datasets dependency.

Additional Locations (1)

python/requirements/data/pyarrow-nightly.txt#L1-L4

Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

There are 3 total unresolved issues (including 1 from previous review).

^{Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.}

cursor · 2026-03-16T20:42:02Z

ci/raydepsets/configs/ci_data.depsets.yaml

+      - --python-version=${PYTHON_VERSION}
+      - --python-platform=linux
+      - --unsafe-package ray
+      - --upgrade-package pyarrow


Pyarrow nightly depset missing PyTorch CPU index flag

Medium Severity

The data_pyarrow_nightly_depset append_flags are missing --index https://download.pytorch.org/whl/cpu, which all other expand depsets (data_pyarrow_latest_depset, data_pyarrow_v9_depset, data_mongo_depset) include. Since the nightly depset expands the same relaxed_data_ci_depset (which originates from the base depset including dl-cpu-requirements.txt), the PyTorch CPU index is likely needed for correct dependency resolution. Without it, uv pip compile may resolve GPU PyTorch variants from PyPI or fail to resolve torch packages entirely.

Additional Locations (1)

ci/raydepsets/configs/ci_data.depsets.yaml#L63-L68

cursor · 2026-03-16T20:42:02Z

ci/raydepsets/configs/ci_data.depsets.yaml

+      - --unsafe-package ray
+    build_arg_sets:
+      - py310
+      - py312


Relaxed depset has unused constraints and append_flags fields

Low Severity

The relaxed_data_ci_depset defines constraints and append_flags fields, but the relax method in cli.py only accepts source_depset, packages, name, and output. These fields are silently ignored during execution, which could be confusing to future maintainers who might expect them to have an effect on dependency resolution.

Additional Locations (1)

ci/raydepsets/cli.py#L431-L438

elliot-barn added 3 commits February 23, 2026 18:14

removing duplicate package definitions

2b6a930

Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>

merging master

634392e

Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>

relaxing pyarrow version

1630cf4

Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>

elliot-barn requested a review from a team as a code owner March 5, 2026 20:40

elliot-barn added the go add ONLY when ready to merge, run all tests label Mar 5, 2026

cursor bot reviewed Mar 5, 2026

View reviewed changes

.buildkite/data.rayci.yml Show resolved Hide resolved

ci/docker/data.build.Dockerfile Show resolved Hide resolved

ci/docker/datan.build.wanda.yaml Show resolved Hide resolved

gemini-code-assist bot reviewed Mar 5, 2026

View reviewed changes

new set of requirements

eb0bcb0

Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>

elliot-barn requested review from a team as code owners March 5, 2026 22:38

cursor bot reviewed Mar 5, 2026

View reviewed changes

ci/docker/data.build.Dockerfile Show resolved Hide resolved

reverting data-test-requirements.txt

b7d1723

Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>

cursor bot reviewed Mar 5, 2026

View reviewed changes

.buildkite/data.rayci.yml Show resolved Hide resolved

ray-gardener bot added the devprod label Mar 6, 2026

aslonnie reviewed Mar 6, 2026

View reviewed changes

elliot-barn added 4 commits March 6, 2026 21:17

updating relax function

3799778

Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>

updating deps

7073c0c

Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>

adding pylance for arrow 9 tests

3fdf6da

Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>

reverting workspace changes

c541542

Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>

elliot-barn changed the title ~~[DNR][DNM][ci] Data ci depsets~~ [ci] Data ci depsets Mar 9, 2026

removing ARROW_MONGO_VERSION

4882fa0

Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>

elliot-barn changed the title ~~[ci] Data ci depsets~~ [ci][deps] Generating dependency sets for Data CI images Mar 9, 2026

elliot-barn added 2 commits March 10, 2026 02:32

skipping lance tests

0968b4c

Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>

removing arrow version from wanda file

0f469f5

Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>

cursor bot reviewed Mar 10, 2026

View reviewed changes

ci/docker/datal.build.wanda.yaml Show resolved Hide resolved

elliot-barn requested review from aslonnie and bveeramani March 10, 2026 15:58

aslonnie approved these changes Mar 10, 2026

View reviewed changes

bveeramani reviewed Mar 12, 2026

View reviewed changes

elliot-barn added 2 commits March 16, 2026 16:55

removing log

a8f81b9

Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>

defaulting python changing core test image

f300c65

Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>

cursor bot reviewed Mar 16, 2026

View reviewed changes

ci/docker/datal.build.wanda.yaml Show resolved Hide resolved

elliot-barn added 2 commits March 16, 2026 18:39

adding comments to tests

9f24751

Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>

including python as a build arg

dec3b79

Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>

cursor bot reviewed Mar 16, 2026

View reviewed changes

elliot-barn requested a review from bveeramani March 16, 2026 20:05

converting file paths earlier in filebaseddatasource

268e32b

Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>

cursor bot reviewed Mar 16, 2026

View reviewed changes

Conversation

elliot-barn commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

gemini-code-assist bot Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

aslonnie left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

aslonnie commented Mar 10, 2026

Uh oh!

elliot-barn commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bveeramani left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cursor bot Mar 16, 2026

Choose a reason for hiding this comment

Missing datasets package in pyarrow-latest and pyarrow-nightly requirements

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot Mar 16, 2026

Choose a reason for hiding this comment

Pyarrow nightly depset missing PyTorch CPU index flag

Uh oh!

cursor bot Mar 16, 2026

Choose a reason for hiding this comment

Relaxed depset has unused constraints and append_flags fields

Uh oh!

Reviewers

Assignees

elliot-barn commented Mar 5, 2026 •

edited

Loading

elliot-barn commented Mar 10, 2026 •

edited

Loading

Missing `datasets` package in pyarrow-latest and pyarrow-nightly requirements