Skip to content

Conversation

@Partth101
Copy link

This PR addresses intermittent test failures in the execution optimizer integration suite. The test test_from_pandas_refs_e2e previously assumed a deterministic row ordering when reading from multiple pandas references, which is not guaranteed by the Ray Data interface in a distributed execution environment.

Changes
Replaced manual tuple-list comparisons with the rows_same utility from ray.data._internal.util.

Refactored assertions to use ds.to_pandas() for robust, order-agnostic data validation.

Applied the fix to all three affected assertions within the test function.

Fixes 60553

@Partth101 Partth101 requested a review from a team as a code owner January 30, 2026 00:44
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively addresses the flakiness in test_from_pandas_refs_e2e by removing the assumption of deterministic row ordering. The switch to using the rows_same utility for order-agnostic comparison is a solid approach.

I've added one suggestion to cache the concatenated DataFrame to improve readability and avoid redundant computation, which mirrors the pattern in the original test code.

Overall, the changes are correct and improve test stability.

Comment on lines 66 to 76
df2 = pd.DataFrame({"one": [4, 5, 6], "two": ["e", "f", "g"]})

ds = ray.data.from_pandas_refs([ray.put(df1), ray.put(df2)])
values = [(r["one"], r["two"]) for r in ds.take(6)]
rows = [(r.one, r.two) for _, r in pd.concat([df1, df2]).iterrows()]
assert values == rows
assert rows_same(ds.to_pandas(), pd.concat([df1, df2]))
# Check that metadata fetch is included in stats.
assert "FromPandas" in ds.stats()
assert ds._plan._logical_plan.dag.name == "FromPandas"

# Test chaining multiple operations
ds2 = ds.map_batches(lambda x: x)
values = [(r["one"], r["two"]) for r in ds2.take(6)]
assert values == rows
assert rows_same(ds2.to_pandas(), pd.concat([df1, df2]))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To improve readability and avoid re-computing the concatenated DataFrame, you can store pd.concat([df1, df2]) in a variable and reuse it for both assertions. This also aligns with the pattern in the original test code where the expected rows were computed once.

Suggested change
df2 = pd.DataFrame({"one": [4, 5, 6], "two": ["e", "f", "g"]})
ds = ray.data.from_pandas_refs([ray.put(df1), ray.put(df2)])
values = [(r["one"], r["two"]) for r in ds.take(6)]
rows = [(r.one, r.two) for _, r in pd.concat([df1, df2]).iterrows()]
assert values == rows
assert rows_same(ds.to_pandas(), pd.concat([df1, df2]))
# Check that metadata fetch is included in stats.
assert "FromPandas" in ds.stats()
assert ds._plan._logical_plan.dag.name == "FromPandas"
# Test chaining multiple operations
ds2 = ds.map_batches(lambda x: x)
values = [(r["one"], r["two"]) for r in ds2.take(6)]
assert values == rows
assert rows_same(ds2.to_pandas(), pd.concat([df1, df2]))
df2 = pd.DataFrame({"one": [4, 5, 6], "two": ["e", "f", "g"]})
expected_df = pd.concat([df1, df2])
ds = ray.data.from_pandas_refs([ray.put(df1), ray.put(df2)])
assert rows_same(ds.to_pandas(), expected_df)
# Check that metadata fetch is included in stats.
assert "FromPandas" in ds.stats()
assert ds._plan._logical_plan.dag.name == "FromPandas"
# Test chaining multiple operations
ds2 = ds.map_batches(lambda x: x)
assert rows_same(ds2.to_pandas(), expected_df)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have implemented this suggestion

@Partth101 Partth101 force-pushed the fix/deflake-pandas-refs-test branch 2 times, most recently from 81593de to d79cd36 Compare January 30, 2026 00:57
@ray-gardener ray-gardener bot added data Ray Data-related issues community-contribution Contributed by the community labels Jan 30, 2026
…sumptions

Replaced list-based equality assertions with the rows_same utility in test_from_pandas_refs_e2e to handle non-deterministic row ordering in distributed execution.

Refined the implementation based on reviewer feedback to cache the expected DataFrame in an expected_df variable, improving readability and avoiding redundant computation. This aligns the test with the project's standard testing patterns.

Signed-off-by: Parth Ghayal <parthmghayal@gmail.com>
@Partth101 Partth101 force-pushed the fix/deflake-pandas-refs-test branch from d79cd36 to 3858d57 Compare January 30, 2026 01:39
@Partth101
Copy link
Author

@bveeramani Can I get a review on this please?

Related Issue: 60553

Copy link
Member

@bveeramani bveeramani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice

@bveeramani bveeramani enabled auto-merge (squash) January 30, 2026 19:01
@github-actions github-actions bot added the go add ONLY when ready to merge, run all tests label Jan 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community data Ray Data-related issues go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Data] Deflake test_from_pandas_refs_e2e by removing ordering assumptions

2 participants