Skip to content

feat: Add filters and columns arguments to read_gbq for enhanced data querying #198

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 23 commits into from
Dec 13, 2023
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
7a006b0
feat: Add filters argument to read_gbq for enhanced data querying
Genesis929 Nov 13, 2023
37794a3
feat: Add filters argument to read_gbq for enhanced data querying
Genesis929 Nov 13, 2023
499bdcd
🦉 Updates from OwlBot post-processor
gcf-owl-bot[bot] Nov 13, 2023
300263e
feat: Add filters and columns arguments to read_gbq for enhanced data…
gcf-owl-bot[bot] Nov 13, 2023
fdc539d
🦉 Updates from OwlBot post-processor
gcf-owl-bot[bot] Nov 14, 2023
6ed4194
feat: Add filters and columns arguments to read_gbq for enhanced data…
gcf-owl-bot[bot] Nov 13, 2023
ad6d37f
feat: Add filters and columns arguments to read_gbq for enhanced data…
gcf-owl-bot[bot] Nov 13, 2023
3473780
feat: Add filters and columns arguments to read_gbq for enhanced data…
gcf-owl-bot[bot] Nov 13, 2023
276bfd0
feat: Add filters and columns arguments to read_gbq for enhanced data…
gcf-owl-bot[bot] Nov 13, 2023
8a4e940
feat: Add filters and columns arguments to read_gbq for enhanced data…
gcf-owl-bot[bot] Nov 13, 2023
b29e9b7
Merge branch 'main' into b299514019-read-gbq-filter
Genesis929 Nov 14, 2023
c00a05e
feat: Add filters and columns arguments to read_gbq for enhanced data…
gcf-owl-bot[bot] Nov 13, 2023
dd94369
feat: Add filters and columns arguments to read_gbq for enhanced data…
gcf-owl-bot[bot] Nov 13, 2023
54ca688
feat: Add filters and columns arguments to read_gbq for enhanced data…
gcf-owl-bot[bot] Nov 13, 2023
95e318b
🦉 Updates from OwlBot post-processor
gcf-owl-bot[bot] Nov 20, 2023
ced491f
feat: Add filters and columns arguments to read_gbq for enhanced data…
gcf-owl-bot[bot] Nov 13, 2023
0f2840d
🦉 Updates from OwlBot post-processor
gcf-owl-bot[bot] Nov 20, 2023
771d093
Merge branch 'main' into b299514019-read-gbq-filter
Genesis929 Dec 12, 2023
82f74fd
update docstring
Genesis929 Dec 12, 2023
1c038b5
Merge branch 'main' into b299514019-read-gbq-filter
tswast Dec 13, 2023
354fd8e
remove columns input
Genesis929 Dec 13, 2023
434c559
make filter_to_query run only when there are filters
Genesis929 Dec 13, 2023
c17b815
remove named input
Genesis929 Dec 13, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions bigframes/pandas/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -486,6 +486,7 @@ def read_gbq(
index_col: Iterable[str] | str = (),
col_order: Iterable[str] = (),
max_results: Optional[int] = None,
filters: Optional[List[Tuple]] = None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make this an Iterable[Tuple] instead. I presume empty list of filters is semantically the same as None, right?

Also, we can document the supported operators with a Literal type.

Suggested change
filters: Optional[List[Tuple]] = None,
filters: Iterable[Tuple[str, Literal["in", "not in", "<", "<=", "==", "!=", ">=", ">"], Any]] = (),

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

) -> bigframes.dataframe.DataFrame:
_set_default_session_location_if_possible(query_or_table)
return global_session.with_default_session(
Expand All @@ -494,6 +495,7 @@ def read_gbq(
index_col=index_col,
col_order=col_order,
max_results=max_results,
filters=filters,
)


Expand Down
68 changes: 68 additions & 0 deletions bigframes/session/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -284,9 +284,11 @@ def read_gbq(
index_col: Iterable[str] | str = (),
col_order: Iterable[str] = (),
max_results: Optional[int] = None,
filters: Optional[List[Tuple]] = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likewise, let's update these types.

Suggested change
filters: Optional[List[Tuple]] = None
filters: Iterable[Tuple[str, Literal["in", "not in", "<", "<=", "==", "!=", ">=", ">"], Any]] = (),

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

# Add a verify index argument that fails if the index is not unique.
) -> dataframe.DataFrame:
# TODO(b/281571214): Generate prompt to show the progress of read_gbq.
query_or_table = self._filters_to_query(query_or_table, filters)
if _is_query(query_or_table):
return self._read_gbq_query(
query_or_table,
Expand All @@ -307,6 +309,72 @@ def read_gbq(
api_name="read_gbq",
)

def _filters_to_query(self, query_or_table, filters):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like a great candidate to write unit tests (string-in-string-out). In fact, at a later point it can go in a sql helper module like bigframes/ml/sql.py.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to unit test.

"""Convert filters to query"""

if (filters is None) or (len(filters) == 0):
return query_or_table

valid_operators = ["IN", "NOT IN", "=", ">", "<", ">=", "<=", "!="]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use pandas / python syntax. We can transform to SQL with a dictionary.

Suggested change
valid_operators = ["IN", "NOT IN", "=", ">", "<", ">=", "<=", "!="]
valid_operators = {
"in": "IN",
"not in": "NOT IN",
"==": "=",
">": ">",
"<": "<",
">=": ">=",
"<=": "<=",
"!=": "!=",
}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


sub_query = (
f"({query_or_table})" if _is_query(query_or_table) else query_or_table
)

where_clause = ""
if filters:
if not isinstance(filters, list):
raise ValueError("Filters should be a list.")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't need to check this. Any Iterable should be fine.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


if not (
all(isinstance(item, list) for item in filters)
or all(isinstance(item, tuple) for item in filters)
):
raise ValueError(
"All items in filters should be either all lists or all tuples."
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to just catch this when we encounter it so we can let them know which item was incorrect.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


if all(isinstance(sub_filter, tuple) for sub_filter in filters):
filters = [filters]

grouped_expressions = []
for group in filters:
if not isinstance(group, list):
raise ValueError("Each filter group should be a list.")

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the double nesting?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see: it's for the OR. You can ignore this feedback. :-)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed the list names to or_expressions and and_expressions, to make the logic more clear.

group_expressions = []
for filter_item in group:
if not isinstance(filter_item, tuple):
raise ValueError("Each filter condition should be a tuple.")

column, operator, value = filter_item
operator = operator.upper()

if operator not in valid_operators:
raise ValueError(f"Operator {operator} is not valid.")

if operator in ["IN", "NOT IN"]:
if not isinstance(value, list):
raise ValueError(
f"Value for operator {operator} should be a list."
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need this check. Any iterable should be OK.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

value_list = ", ".join(
[f'"{v}"' if isinstance(v, str) else str(v) for v in value]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
[f'"{v}"' if isinstance(v, str) else str(v) for v in value]
[repr(v) for v in value]

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

)
expression = f"{column} {operator} ({value_list})"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Column names could contain spaces. We need to enclose them in backticks.

Suggested change
expression = f"{column} {operator} ({value_list})"
expression = f"`{column}` {operator} ({value_list})"

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

else:
value = f'"{value}"' if isinstance(value, str) else value
expression = f"{column} {operator} {value}"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Column names could contain spaces. We need to enclose them in backticks.

Suggested change
value = f'"{value}"' if isinstance(value, str) else value
expression = f"{column} {operator} {value}"
expression = f"`{column}` {operator} {repr(value)}"

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

group_expressions.append(expression)

grouped_expressions.append(" AND ".join(group_expressions))

where_clause = " WHERE " + " OR ".join(grouped_expressions)

full_query = f"SELECT * FROM {sub_query} AS sub{where_clause}"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to include the column filter here too, please.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, added.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I meant using col_order which is already used as a column filter.


return full_query

def _query_to_destination(
self,
query: str,
Expand Down
52 changes: 52 additions & 0 deletions tests/system/small/test_session.py
Original file line number Diff line number Diff line change
Expand Up @@ -309,6 +309,58 @@ def test_read_gbq_w_script_no_select(session, dataset_id: str):
assert df["statement_type"][0] == "SCRIPT"


@pytest.mark.parametrize(
("query_or_table", "filters", "validator"),
[
pytest.param(
"""SELECT
rowindex,
string_col,
FROM `{scalars_table_id}` AS t
""",
[("rowindex", "<", 4), ("string_col", "=", "Hello, World!")],
lambda row: row["rowindex"] < 4 and row["string_col"] == "Hello, World!",
id="query_input",
),
pytest.param(
"{scalars_table_id}",
[("date_col", ">", "2022-10-20")],
lambda row: pd.to_datetime(row["date_col"]) > pd.to_datetime("2022-10-20"),
id="table_input",
),
pytest.param(
"{scalars_table_id}",
[
("rowindex", "not in", [0, 6]),
("string_col", "in", ["Hello, World!", "こんにちは"]),
],
lambda row: row["rowindex"] not in [0, 6]
and row["string_col"] in ["Hello, World!", "こんにちは"],
id="or_operation",
),
pytest.param(
"{scalars_table_id}",
["date_col", ">", "2022-10-20"],
None,
marks=pytest.mark.xfail(
raises=ValueError,
),
id="raise_error",
),
],
)
def test_read_gbq_with_filters(
session, scalars_table_id: str, query_or_table, filters, validator
):
df = session.read_gbq(
query_or_table.format(scalars_table_id=scalars_table_id),
filters=filters,
)

for _, row in df.iterrows():
assert validator(row)


def test_read_gbq_model(session, penguins_linear_model_name):
model = session.read_gbq_model(penguins_linear_model_name)
assert isinstance(model, bigframes.ml.linear_model.LinearRegression)
Expand Down
10 changes: 9 additions & 1 deletion third_party/bigframes_vendored/pandas/io/gbq.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@

from __future__ import annotations

from typing import Iterable, Optional
from typing import Iterable, List, Optional, Tuple

from bigframes import constants

Expand All @@ -16,6 +16,7 @@ def read_gbq(
index_col: Iterable[str] | str = (),
col_order: Iterable[str] = (),
max_results: Optional[int] = None,
filters: Optional[List[Tuple]] = None,
):
"""Loads a DataFrame from BigQuery.

Expand Down Expand Up @@ -83,6 +84,13 @@ def read_gbq(
max_results (Optional[int], default None):
If set, limit the maximum number of rows to fetch from the
query results.
filters (List[Tuple], default []): To filter out data. Filter syntax:
[[(column, op, val), …],…] where op is [=, >, >=, <, <=, !=, in,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are doing a type annotation of List[Tuple], but [[(column, op, val), …],…] looks like a List[List[Tuple]]?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, thought I have made change here.

not in] The innermost tuples are transposed into a set of filters
applied through an AND operation. The outer list combines these sets
of filters through an OR operation. A single list of tuples can also
be used, meaning that no OR operation between set of filters is to be
conducted.

Returns:
bigframes.dataframe.DataFrame: A DataFrame representing results of the query or table.
Expand Down