Skip to content

Commit 739cca1

Browse files
goutamvenkat-anyscaleelliot-barn
authored andcommitted
[data] - if column is empty skip the sampling step in pandas_block (#57740)
<!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? If pandas column is empty, don't continue with the sampling. ## Related issue number <!-- For example: "Closes #1234" --> ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( Signed-off-by: Goutam <goutam@anyscale.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
1 parent 480d0bc commit 739cca1

File tree

2 files changed

+32
-0
lines changed

2 files changed

+32
-0
lines changed

python/ray/data/_internal/pandas_block.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -508,6 +508,9 @@ def get_deep_size(obj):
508508

509509
# Determine the sample size based on max_sample_count
510510
sample_size = min(total_size, max_sample_count)
511+
# Skip size calculation for empty columns
512+
if sample_size == 0:
513+
continue
511514
# Following codes can also handel case that sample_size == total_size
512515
sampled_data = self._table[column].sample(n=sample_size).values
513516

python/ray/data/tests/test_pandas_block.py

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -466,5 +466,34 @@ def test_iter_rows_with_na(ray_start_regular_shared):
466466
assert list(rows) == [{"col": None}]
467467

468468

469+
def test_empty_dataframe_with_object_columns(ray_start_regular_shared):
470+
"""Test that size_bytes handles empty DataFrames with object/string columns.
471+
472+
The warning log:
473+
"Error calculating size for column 'parent': cannot call `vectorize`
474+
on size 0 inputs unless `otypes` is set"
475+
should not be logged in the presence of empty columns.
476+
"""
477+
from unittest.mock import patch
478+
479+
# Create an empty DataFrame but with defined columns and dtypes
480+
block = pd.DataFrame(
481+
{
482+
"parent": pd.Series([], dtype=object),
483+
"child": pd.Series([], dtype="string"),
484+
"data": pd.Series([], dtype=object),
485+
}
486+
)
487+
488+
block_accessor = PandasBlockAccessor.for_block(block)
489+
490+
# Check that NO warning is logged after calling size_bytes
491+
with patch("ray.data._internal.pandas_block.logger.warning") as mock_warning:
492+
bytes_size = block_accessor.size_bytes()
493+
mock_warning.assert_not_called()
494+
495+
assert bytes_size >= 0
496+
497+
469498
if __name__ == "__main__":
470499
sys.exit(pytest.main(["-v", __file__]))

0 commit comments

Comments
 (0)