-
Notifications
You must be signed in to change notification settings - Fork 53
feat: Add filters and columns arguments to read_gbq for enhanced data querying #198
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 3 commits
7a006b0
37794a3
499bdcd
300263e
fdc539d
6ed4194
ad6d37f
3473780
276bfd0
8a4e940
b29e9b7
c00a05e
dd94369
54ca688
95e318b
ced491f
0f2840d
771d093
82f74fd
1c038b5
354fd8e
434c559
c17b815
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
@@ -284,9 +284,11 @@ def read_gbq( | |||||||||||||||||||||||
index_col: Iterable[str] | str = (), | ||||||||||||||||||||||||
col_order: Iterable[str] = (), | ||||||||||||||||||||||||
max_results: Optional[int] = None, | ||||||||||||||||||||||||
filters: Optional[List[Tuple]] = None | ||||||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Likewise, let's update these types.
Suggested change
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Updated. |
||||||||||||||||||||||||
# Add a verify index argument that fails if the index is not unique. | ||||||||||||||||||||||||
) -> dataframe.DataFrame: | ||||||||||||||||||||||||
# TODO(b/281571214): Generate prompt to show the progress of read_gbq. | ||||||||||||||||||||||||
query_or_table = self._filters_to_query(query_or_table, filters) | ||||||||||||||||||||||||
if _is_query(query_or_table): | ||||||||||||||||||||||||
return self._read_gbq_query( | ||||||||||||||||||||||||
query_or_table, | ||||||||||||||||||||||||
|
@@ -307,6 +309,72 @@ def read_gbq( | |||||||||||||||||||||||
api_name="read_gbq", | ||||||||||||||||||||||||
) | ||||||||||||||||||||||||
|
||||||||||||||||||||||||
def _filters_to_query(self, query_or_table, filters): | ||||||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Seems like a great candidate to write unit tests (string-in-string-out). In fact, at a later point it can go in a sql helper module like There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Moved to unit test. |
||||||||||||||||||||||||
"""Convert filters to query""" | ||||||||||||||||||||||||
|
||||||||||||||||||||||||
if (filters is None) or (len(filters) == 0): | ||||||||||||||||||||||||
return query_or_table | ||||||||||||||||||||||||
|
||||||||||||||||||||||||
valid_operators = ["IN", "NOT IN", "=", ">", "<", ">=", "<=", "!="] | ||||||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Please use pandas / python syntax. We can transform to SQL with a dictionary.
Suggested change
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done. |
||||||||||||||||||||||||
|
||||||||||||||||||||||||
sub_query = ( | ||||||||||||||||||||||||
f"({query_or_table})" if _is_query(query_or_table) else query_or_table | ||||||||||||||||||||||||
) | ||||||||||||||||||||||||
|
||||||||||||||||||||||||
where_clause = "" | ||||||||||||||||||||||||
if filters: | ||||||||||||||||||||||||
if not isinstance(filters, list): | ||||||||||||||||||||||||
raise ValueError("Filters should be a list.") | ||||||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Don't need to check this. Any Iterable should be fine. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done. |
||||||||||||||||||||||||
|
||||||||||||||||||||||||
if not ( | ||||||||||||||||||||||||
all(isinstance(item, list) for item in filters) | ||||||||||||||||||||||||
or all(isinstance(item, tuple) for item in filters) | ||||||||||||||||||||||||
): | ||||||||||||||||||||||||
raise ValueError( | ||||||||||||||||||||||||
"All items in filters should be either all lists or all tuples." | ||||||||||||||||||||||||
) | ||||||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Better to just catch this when we encounter it so we can let them know which item was incorrect. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done. |
||||||||||||||||||||||||
|
||||||||||||||||||||||||
if all(isinstance(sub_filter, tuple) for sub_filter in filters): | ||||||||||||||||||||||||
filters = [filters] | ||||||||||||||||||||||||
|
||||||||||||||||||||||||
grouped_expressions = [] | ||||||||||||||||||||||||
for group in filters: | ||||||||||||||||||||||||
if not isinstance(group, list): | ||||||||||||||||||||||||
raise ValueError("Each filter group should be a list.") | ||||||||||||||||||||||||
|
||||||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why the double nesting? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Oh, I see: it's for the OR. You can ignore this feedback. :-) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Changed the list names to or_expressions and and_expressions, to make the logic more clear. |
||||||||||||||||||||||||
group_expressions = [] | ||||||||||||||||||||||||
for filter_item in group: | ||||||||||||||||||||||||
if not isinstance(filter_item, tuple): | ||||||||||||||||||||||||
raise ValueError("Each filter condition should be a tuple.") | ||||||||||||||||||||||||
|
||||||||||||||||||||||||
column, operator, value = filter_item | ||||||||||||||||||||||||
operator = operator.upper() | ||||||||||||||||||||||||
|
||||||||||||||||||||||||
if operator not in valid_operators: | ||||||||||||||||||||||||
raise ValueError(f"Operator {operator} is not valid.") | ||||||||||||||||||||||||
|
||||||||||||||||||||||||
if operator in ["IN", "NOT IN"]: | ||||||||||||||||||||||||
if not isinstance(value, list): | ||||||||||||||||||||||||
raise ValueError( | ||||||||||||||||||||||||
f"Value for operator {operator} should be a list." | ||||||||||||||||||||||||
) | ||||||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We don't need this check. Any iterable should be OK. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done. |
||||||||||||||||||||||||
value_list = ", ".join( | ||||||||||||||||||||||||
[f'"{v}"' if isinstance(v, str) else str(v) for v in value] | ||||||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done |
||||||||||||||||||||||||
) | ||||||||||||||||||||||||
expression = f"{column} {operator} ({value_list})" | ||||||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Column names could contain spaces. We need to enclose them in backticks.
Suggested change
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done |
||||||||||||||||||||||||
else: | ||||||||||||||||||||||||
value = f'"{value}"' if isinstance(value, str) else value | ||||||||||||||||||||||||
expression = f"{column} {operator} {value}" | ||||||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Column names could contain spaces. We need to enclose them in backticks.
Suggested change
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done |
||||||||||||||||||||||||
group_expressions.append(expression) | ||||||||||||||||||||||||
|
||||||||||||||||||||||||
grouped_expressions.append(" AND ".join(group_expressions)) | ||||||||||||||||||||||||
|
||||||||||||||||||||||||
where_clause = " WHERE " + " OR ".join(grouped_expressions) | ||||||||||||||||||||||||
|
||||||||||||||||||||||||
full_query = f"SELECT * FROM {sub_query} AS sub{where_clause}" | ||||||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd like to include the column filter here too, please. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done, added. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sorry, I meant using |
||||||||||||||||||||||||
|
||||||||||||||||||||||||
return full_query | ||||||||||||||||||||||||
|
||||||||||||||||||||||||
def _query_to_destination( | ||||||||||||||||||||||||
self, | ||||||||||||||||||||||||
query: str, | ||||||||||||||||||||||||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -3,7 +3,7 @@ | |
|
||
from __future__ import annotations | ||
|
||
from typing import Iterable, Optional | ||
from typing import Iterable, List, Optional, Tuple | ||
|
||
from bigframes import constants | ||
|
||
|
@@ -16,6 +16,7 @@ def read_gbq( | |
index_col: Iterable[str] | str = (), | ||
col_order: Iterable[str] = (), | ||
max_results: Optional[int] = None, | ||
filters: Optional[List[Tuple]] = None, | ||
): | ||
"""Loads a DataFrame from BigQuery. | ||
|
||
|
@@ -83,6 +84,13 @@ def read_gbq( | |
max_results (Optional[int], default None): | ||
If set, limit the maximum number of rows to fetch from the | ||
query results. | ||
filters (List[Tuple], default []): To filter out data. Filter syntax: | ||
[[(column, op, val), …],…] where op is [=, >, >=, <, <=, !=, in, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We are doing a type annotation of There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sorry, thought I have made change here. |
||
not in] The innermost tuples are transposed into a set of filters | ||
applied through an AND operation. The outer list combines these sets | ||
of filters through an OR operation. A single list of tuples can also | ||
be used, meaning that no OR operation between set of filters is to be | ||
conducted. | ||
|
||
Returns: | ||
bigframes.dataframe.DataFrame: A DataFrame representing results of the query or table. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's make this an
Iterable[Tuple]
instead. I presume empty list of filters is semantically the same asNone
, right?Also, we can document the supported operators with a Literal type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.