-
Notifications
You must be signed in to change notification settings - Fork 7.2k
[Data] Deflake test_from_pandas_refs_e2e by removing ordering assumptions #60595
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
[Data] Deflake test_from_pandas_refs_e2e by removing ordering assumptions #60595
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request effectively addresses the flakiness in test_from_pandas_refs_e2e by removing the assumption of deterministic row ordering. The switch to using the rows_same utility for order-agnostic comparison is a solid approach.
I've added one suggestion to cache the concatenated DataFrame to improve readability and avoid redundant computation, which mirrors the pattern in the original test code.
Overall, the changes are correct and improve test stability.
| df2 = pd.DataFrame({"one": [4, 5, 6], "two": ["e", "f", "g"]}) | ||
|
|
||
| ds = ray.data.from_pandas_refs([ray.put(df1), ray.put(df2)]) | ||
| values = [(r["one"], r["two"]) for r in ds.take(6)] | ||
| rows = [(r.one, r.two) for _, r in pd.concat([df1, df2]).iterrows()] | ||
| assert values == rows | ||
| assert rows_same(ds.to_pandas(), pd.concat([df1, df2])) | ||
| # Check that metadata fetch is included in stats. | ||
| assert "FromPandas" in ds.stats() | ||
| assert ds._plan._logical_plan.dag.name == "FromPandas" | ||
|
|
||
| # Test chaining multiple operations | ||
| ds2 = ds.map_batches(lambda x: x) | ||
| values = [(r["one"], r["two"]) for r in ds2.take(6)] | ||
| assert values == rows | ||
| assert rows_same(ds2.to_pandas(), pd.concat([df1, df2])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To improve readability and avoid re-computing the concatenated DataFrame, you can store pd.concat([df1, df2]) in a variable and reuse it for both assertions. This also aligns with the pattern in the original test code where the expected rows were computed once.
| df2 = pd.DataFrame({"one": [4, 5, 6], "two": ["e", "f", "g"]}) | |
| ds = ray.data.from_pandas_refs([ray.put(df1), ray.put(df2)]) | |
| values = [(r["one"], r["two"]) for r in ds.take(6)] | |
| rows = [(r.one, r.two) for _, r in pd.concat([df1, df2]).iterrows()] | |
| assert values == rows | |
| assert rows_same(ds.to_pandas(), pd.concat([df1, df2])) | |
| # Check that metadata fetch is included in stats. | |
| assert "FromPandas" in ds.stats() | |
| assert ds._plan._logical_plan.dag.name == "FromPandas" | |
| # Test chaining multiple operations | |
| ds2 = ds.map_batches(lambda x: x) | |
| values = [(r["one"], r["two"]) for r in ds2.take(6)] | |
| assert values == rows | |
| assert rows_same(ds2.to_pandas(), pd.concat([df1, df2])) | |
| df2 = pd.DataFrame({"one": [4, 5, 6], "two": ["e", "f", "g"]}) | |
| expected_df = pd.concat([df1, df2]) | |
| ds = ray.data.from_pandas_refs([ray.put(df1), ray.put(df2)]) | |
| assert rows_same(ds.to_pandas(), expected_df) | |
| # Check that metadata fetch is included in stats. | |
| assert "FromPandas" in ds.stats() | |
| assert ds._plan._logical_plan.dag.name == "FromPandas" | |
| # Test chaining multiple operations | |
| ds2 = ds.map_batches(lambda x: x) | |
| assert rows_same(ds2.to_pandas(), expected_df) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have implemented this suggestion
81593de to
d79cd36
Compare
…sumptions Replaced list-based equality assertions with the rows_same utility in test_from_pandas_refs_e2e to handle non-deterministic row ordering in distributed execution. Refined the implementation based on reviewer feedback to cache the expected DataFrame in an expected_df variable, improving readability and avoiding redundant computation. This aligns the test with the project's standard testing patterns. Signed-off-by: Parth Ghayal <parthmghayal@gmail.com>
d79cd36 to
3858d57
Compare
|
@bveeramani Can I get a review on this please? Related Issue: 60553 |
bveeramani
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice
This PR addresses intermittent test failures in the execution optimizer integration suite. The test test_from_pandas_refs_e2e previously assumed a deterministic row ordering when reading from multiple pandas references, which is not guaranteed by the Ray Data interface in a distributed execution environment.
Changes
Replaced manual tuple-list comparisons with the rows_same utility from ray.data._internal.util.
Refactored assertions to use ds.to_pandas() for robust, order-agnostic data validation.
Applied the fix to all three affected assertions within the test function.
Fixes 60553