Skip to content

[Ray Data] Add filtering and column pruning when reading from BigQuery table  #48821

Open
@PetrZhitnikov

Description

Description

It would be great to have the ability to provide filters and columns to be read from the BQ table.

Use case

Existing implementation

As of now, I can run a code like this to get data from a table filtered and with only selected columns & filter conditions:

import ray
ds = ray.data.read_bigquery(
    project_id="my_project",
    query="""
        SELECT station_number, mean_temp
        FROM `bigquery-public-data.samples.gsod`
        where year = 1940 and month = 1 and day = 1
    """,
)

However, it will run this query and create temporary table introducing extra costs and delay before starting reading data.

Proposed option

On the other hand, BQ Read API supports providing filters and fields directly to the read request to the existing table, via TableReadOptions (parameters
selected_fields[] and row_restriction)

So what I would like to have is to have an interface like this:

import ray
ds = ray.data.read_bigquery(
    project_id="my_project",
    dataset="bigquery-public-data.samples.gsod",
    selected_fields = ["station_number", "mean_temp"],
    row_restriction = "year = 1940 and month = 1 and day = 1"
)

And these new fields to be propagated down to BQ Read API read request. In such case it will be streaming data directly from the existing table without extra costs and time spent on creating intermediate table.

Metadata

Assignees

No one assigned

    Labels

    P2Important issue, but not time-criticaldataRay Data-related issuesenhancementRequest for new feature and/or capability

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions