[Ray Data] Add filtering and column pruning when reading from BigQuery table #48821
Description
Description
It would be great to have the ability to provide filters and columns to be read from the BQ table.
Use case
Existing implementation
As of now, I can run a code like this to get data from a table filtered and with only selected columns & filter conditions:
import ray
ds = ray.data.read_bigquery(
project_id="my_project",
query="""
SELECT station_number, mean_temp
FROM `bigquery-public-data.samples.gsod`
where year = 1940 and month = 1 and day = 1
""",
)
However, it will run this query and create temporary table introducing extra costs and delay before starting reading data.
Proposed option
On the other hand, BQ Read API supports providing filters and fields directly to the read request to the existing table, via TableReadOptions (parameters
selected_fields[]
and row_restriction
)
So what I would like to have is to have an interface like this:
import ray
ds = ray.data.read_bigquery(
project_id="my_project",
dataset="bigquery-public-data.samples.gsod",
selected_fields = ["station_number", "mean_temp"],
row_restriction = "year = 1940 and month = 1 and day = 1"
)
And these new fields to be propagated down to BQ Read API read request. In such case it will be streaming data directly from the existing table without extra costs and time spent on creating intermediate table.