-
Notifications
You must be signed in to change notification settings - Fork 125
feat: read_gbq
suggests using BigQuery DataFrames with large results
#769
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
601a197
0911d88
fbca5bb
a7e556c
8752fb7
6ba62d1
16d381e
46da4ac
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,12 @@ | ||
# Copyright (c) 2024 pandas-gbq Authors All rights reserved. | ||
# Use of this source code is governed by a BSD-style | ||
# license that can be found in the LICENSE file. | ||
Comment on lines
+1
to
+3
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. just double checking that this is normal given that we use a difference license header for other Google stuff There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not normal, but expected. pandas-gbq was originally split off from pandas itself, so we can't use the Google header. It's not 100% Google's copyright since not all contributions were under the CLA. |
||
|
||
# BigQuery uses powers of 2 in calculating data sizes. See: | ||
# https://cloud.google.com/bigquery/pricing#data The documentation uses | ||
# GiB rather than GB to disambiguate from the alternative base 10 units. | ||
# https://en.wikipedia.org/wiki/Byte#Multiple-byte_units | ||
BYTES_IN_KIB = 1024 | ||
BYTES_IN_MIB = 1024 * BYTES_IN_KIB | ||
BYTES_IN_GIB = 1024 * BYTES_IN_MIB | ||
BYTES_TO_RECOMMEND_BIGFRAMES = BYTES_IN_GIB | ||
leahecole marked this conversation as resolved.
Show resolved
Hide resolved
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -19,6 +19,8 @@ | |
if typing.TYPE_CHECKING: # pragma: NO COVER | ||
import pandas | ||
|
||
import pandas_gbq.constants | ||
import pandas_gbq.exceptions | ||
from pandas_gbq.exceptions import GenericGBQException, QueryTimeout | ||
from pandas_gbq.features import FEATURES | ||
import pandas_gbq.query | ||
|
@@ -478,6 +480,35 @@ def _download_results( | |
if max_results is not None: | ||
create_bqstorage_client = False | ||
|
||
# If we're downloading a large table, BigQuery DataFrames might be a | ||
# better fit. Not all code paths will populate rows_iter._table, but | ||
# if it's not populated that means we are working with a small result | ||
# set. | ||
if (table_ref := getattr(rows_iter, "_table", None)) is not None: | ||
table = self.client.get_table(table_ref) | ||
if ( | ||
isinstance((num_bytes := table.num_bytes), int) | ||
and num_bytes > pandas_gbq.constants.BYTES_TO_RECOMMEND_BIGFRAMES | ||
): | ||
num_gib = num_bytes / pandas_gbq.constants.BYTES_IN_GIB | ||
warnings.warn( | ||
f"Recommendation: Your results are {num_gib:.1f} GiB. " | ||
"Consider using BigQuery DataFrames " | ||
"(https://cloud.google.com/bigquery/docs/bigquery-dataframes-introduction) " | ||
"to process large results with pandas compatible APIs with transparent SQL " | ||
"pushdown to BigQuery engine. This provides an opportunity to save on costs " | ||
"and improve performance. " | ||
"Please reach out to bigframes-feedback@google.com with any " | ||
"questions or concerns. To disable this message, run " | ||
"warnings.simplefilter('ignore', category=pandas_gbq.exceptions.LargeResultsWarning)", | ||
category=pandas_gbq.exceptions.LargeResultsWarning, | ||
# user's code | ||
# -> read_gbq | ||
# -> run_query | ||
# -> download_results | ||
stacklevel=4, | ||
) | ||
|
||
try: | ||
schema_fields = [field.to_api_repr() for field in rows_iter.schema] | ||
conversion_dtypes = _bqschema_to_nullsafe_dtypes(schema_fields) | ||
|
@@ -663,18 +694,25 @@ def read_gbq( | |
*, | ||
col_order=None, | ||
): | ||
r"""Load data from Google BigQuery using google-cloud-python | ||
|
||
The main method a user calls to execute a Query in Google BigQuery | ||
and read results into a pandas DataFrame. | ||
r"""Read data from Google BigQuery to a pandas DataFrame. | ||
|
||
This method uses the Google Cloud client library to make requests to | ||
Google BigQuery, documented `here | ||
<https://googleapis.dev/python/bigquery/latest/index.html>`__. | ||
Run a SQL query in BigQuery or read directly from a table | ||
the `Python client library for BigQuery | ||
<https://cloud.google.com/python/docs/reference/bigquery/latest/index.html>`__ | ||
and for `BigQuery Storage | ||
<https://cloud.google.com/python/docs/reference/bigquerystorage/latest>`__ | ||
to make API requests. | ||
|
||
See the :ref:`How to authenticate with Google BigQuery <authentication>` | ||
guide for authentication instructions. | ||
|
||
.. note:: | ||
Consider using `BigQuery DataFrames | ||
<https://cloud.google.com/bigquery/docs/dataframes-quickstart>`__ to | ||
process large results with pandas compatible APIs that run in the | ||
BigQuery SQL query engine. This provides an opportunity to save on | ||
costs and improve performance. | ||
|
||
Parameters | ||
---------- | ||
query_or_table : str | ||
|
@@ -1050,12 +1088,7 @@ def to_gbq( | |
) | ||
|
||
if api_method == "default": | ||
# Avoid using parquet if pandas doesn't support lossless conversions to | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. per coverage report, this was dead code. Our minimum pandas version is beyond the one where this feature wasn't available. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. thank you this is a helpful comment! |
||
# parquet timestamp. See: https://stackoverflow.com/a/69758676/101923 | ||
if FEATURES.pandas_has_parquet_with_lossless_timestamp: | ||
api_method = "load_parquet" | ||
else: | ||
api_method = "load_csv" | ||
api_method = "load_parquet" | ||
|
||
if chunksize is not None: | ||
if api_method == "load_parquet": | ||
|
Uh oh!
There was an error while loading. Please reload this page.