Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add columns argument to read_gbq #15

Merged
merged 4 commits into from
Oct 13, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 5 additions & 2 deletions dask_bigquery/core.py
Original file line number Diff line number Diff line change
Expand Up @@ -88,7 +88,8 @@ def read_gbq(
project_id: str,
dataset_id: str,
table_id: str,
row_filter="",
row_filter: str = "",
columns: list[str] = None,
read_kwargs: dict = None,
):
"""Read table as dask dataframe using BigQuery Storage API via Arrow format.
Expand All @@ -104,6 +105,8 @@ def read_gbq(
BigQuery table within dataset
row_filter: str
SQL text filtering statement to pass to `row_restriction`
columns: list[str]
list of columns to load from the table
read_kwargs: dict
kwargs to pass to read_rows()

Expand All @@ -124,7 +127,7 @@ def make_create_read_session_request(row_filter=""):
read_session=bigquery_storage.types.ReadSession(
data_format=bigquery_storage.types.DataFormat.ARROW,
read_options=bigquery_storage.types.ReadSession.TableReadOptions(
row_restriction=row_filter,
row_restriction=row_filter, selected_fields=columns
),
table=table_ref.to_bqstorage(),
),
Expand Down
14 changes: 14 additions & 0 deletions dask_bigquery/tests/test_core.py
Original file line number Diff line number Diff line change
Expand Up @@ -82,3 +82,17 @@ def test_read_kwargs(dataset, client):

with pytest.raises(Exception, match="504 Deadline Exceeded"):
ddf.compute()


def test_read_columns(df, dataset, client):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this test looks great. One small suggestions: could we add

assert df.shape[1] > 1

to ensure that the original DataFrame has more than one column? Otherwise, in the future the example DataFrame could be updated to only have a single "name" column and this test would still pass (this is unlikely, but possible)

project_id, dataset_id, table_id = dataset
assert df.shape[1] > 1, "Test data should have multiple columns"

columns = ["name"]
ddf = read_gbq(
project_id=project_id,
dataset_id=dataset_id,
table_id=table_id,
columns=columns,
)
assert list(ddf.columns) == columns