[data] Data doesn't account for object store memory from pandas batch formats #48506
Closed
Description
While it's true that this is no longer an issue if the blocks are Arrow table, you'll still run into the issue if the blocks are pandas tables. This can happen if you use the "pandas"
batch format, or if you use APIs like drop_columns
that use the "pandas"
batch format under-the-hood.
Here's a simple repro:
import ray
def generate_data(batch):
for _ in range(8):
yield {"data": [[b"\x00" * 128 * 1024 * 1024]]}
ds = (
ray.data.range(1, override_num_blocks=1)
.map_batches(generate_data, batch_size=1)
.map_batches(lambda batch: batch, batch_format=...)
)
for bundle in ds.iter_internal_ref_bundles():
print(f"num_rows={bundle.num_rows()} size_bytes={bundle.size_bytes()}")
Output with pandas:
num_rows=8 size_bytes=192
Output with PyArrow:
num_rows=1 size_bytes=134217748
num_rows=1 size_bytes=134217748
num_rows=1 size_bytes=134217748
num_rows=1 size_bytes=134217748
num_rows=1 size_bytes=134217748
num_rows=1 size_bytes=134217748
num_rows=1 size_bytes=134217748
num_rows=1 size_bytes=134217748
Originally posted by @bveeramani in #44577 (comment)