Skip to content

[Data] RandomAccessDataset.multiget return unexpected values for missing keys. #44768

Open
@sunyakun

Description

What happened + What you expected to happen

the ray.data.RandomAccessDataset.multiget expected return a None for missing records, in fact, I got an unexpected value for the missing key.

I find this PR update the _RandomAccessWorker.multiget: #24825, and it use the np.searchsorted to speed up the multiget, but the np.searchsorted will return the insertion points for missing records and it use the search result directly to get the row from the block without test col[i] == key, just like the code here:

i = bisect.bisect_left(column, x)
if i != len(column) and column[i] == x:
return i
return None

Versions / Dependencies

Ray: latest master
Python: 3.9.2
OS: linux

Reproduction script

import ray
import ray.data

kv_store = ray.data.from_items(
    [i for i in range(0, 1000, 2)]
).repartition(5).to_random_access_dataset(key="item", num_workers=1)

print(ray.get(kv_store.get_async(1)), ray.get(kv_store.get_async(901)))
# output: None None

print(kv_store.multiget([1, 901]))
# output: [{'item': 2}, {'item': 902}]

Issue Severity

None

Metadata

Assignees

No one assigned

    Labels

    P3Issue moderate in impact or severitybugSomething that is supposed to be working; but isn'tdataRay Data-related issues

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions