[data] Data doesn't account for object store memory from pandas batch formats

While it's true that this is no longer an issue if the blocks are Arrow table, you'll still run into the issue if the blocks are pandas tables. This can happen if you use the `"pandas"` batch format, or if you use APIs like `drop_columns` that use the `"pandas"` batch format under-the-hood.

Here's a simple repro:
```python
import ray


def generate_data(batch):
    for _ in range(8):
        yield {"data": [[b"\x00" * 128 * 1024 * 1024]]}


ds = (
    ray.data.range(1, override_num_blocks=1)
    .map_batches(generate_data, batch_size=1)
    .map_batches(lambda batch: batch, batch_format=...)
)

for bundle in ds.iter_internal_ref_bundles():
    print(f"num_rows={bundle.num_rows()} size_bytes={bundle.size_bytes()}")
```

Output with pandas:

```
num_rows=8 size_bytes=192         
```

Output with PyArrow:

```
num_rows=1 size_bytes=134217748                                                                                 
num_rows=1 size_bytes=134217748                                                                                                          
num_rows=1 size_bytes=134217748                                                                                                          
num_rows=1 size_bytes=134217748                                                                                                          
num_rows=1 size_bytes=134217748                                                                                                             
num_rows=1 size_bytes=134217748                                                                                                             
num_rows=1 size_bytes=134217748                                                                                                             
num_rows=1 size_bytes=134217748  
```

_Originally posted by @bveeramani in https://github.com/ray-project/ray/issues/44577#issuecomment-2428035432_
            

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[data] Data doesn't account for object store memory from pandas batch formats #48506

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development