Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: DB-API uses more efficient query_and_wait when no job ID is provided #1747

Merged
merged 58 commits into from
Dec 19, 2023

Conversation

tswast
Copy link
Contributor

@tswast tswast commented Dec 8, 2023

Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:

  • Make sure to open an issue as a bug/issue before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea
  • Ensure the tests and linter pass
  • Code coverage does not decrease (if any source code was changed)
  • Appropriate docs were updated (if necessary)

Fixes #1745 🦕

…or` of results

Set the `QUERY_PREVIEW_ENABLED=TRUE` environment variable to use this with the
new JOB_CREATION_OPTIONAL mode (currently in preview).
@tswast tswast requested review from chalmerlowe and Linchin and removed request for obada-ab December 11, 2023 19:49
@tswast
Copy link
Contributor Author

tswast commented Dec 12, 2023

I tested this manually in ipython as well, since it seems we don't have much (any?) system tests / samples in this repo that test this functionality.

# In [1]:
import google.cloud.bigquery.dbapi as bqdb
conn = bqdb.connect()
cur = conn.cursor()
cur.execute("SELECT 1")
cur.fetchall()
# Out[1]: [Row((1,), {'f0_': 0})]

# In [2]:
cur._query_rows._should_use_bqstorage(None, create_bqstorage_client=True)
# Out[2]: False

# In [4]:
cur.execute("SELECT name, SUM(`number`) AS total FROM `bigquery-public-data.usa_names.usa_1910_2013` GROUP BY name")
cur._query_rows._should_use_bqstorage(None, create_bqstorage_client=True)
# Out[4]: False

# In [5]:
r = cur.fetchall()
len(r)
# Out[5]: 29828

# In [7]:
cur.execute("SELECT name, number, year FROM `bigquery-public-data.usa_names.usa_1910_2013`")
cur._query_rows._should_use_bqstorage(None, create_bqstorage_client=True)
# Out[7]: True

# In [8]:
r = cur.fetchall()
len(r)
# Out[8]: 5552452

I have also tested these same queries with SQLAlchemy to make sure this doesn't somehow break the connector.

# In [1]:
from sqlalchemy import *
from sqlalchemy.engine import create_engine
from sqlalchemy.schema import *
engine = create_engine('bigquery://swast-scratch')

# In [2]:
table = Table(
    'usa_1910_2013',
    MetaData(bind=engine),
    autoload=True,
    schema='bigquery-public-data.usa_names',
)
select([func.count('*')], from_obj=table).scalar()
# Out[2]: 5552452

# In[3]:
len(select(
    [table.c.name, func.sum(table.c.number).label('total')],
    from_obj=table
).group_by(table.c.name).execute().all())
# Out[3]: 29828

# In[4]:
len(select(
    [table.c.name, table.c.number, table.c.year],
    from_obj=table
).execute().all())
# Out[4]: 5552452

For sanity, I checked that there is a speedup when using this change:

After this change:

# In [1]:
import google.cloud.bigquery.dbapi as bqdb
conn = bqdb.connect()
cur = conn.cursor()

# In [2]:
%%timeit -n10 -r10
cur.execute("SELECT 1")
r = cur.fetchall()
# Out [2]:
# 319 ms ± 19.5 ms per loop (mean ± std. dev. of 10 runs, 10 loops each)

# In [3]:
%%timeit -n5 -r4
cur.execute("SELECT name, SUM(`number`) AS total FROM `bigquery-public-data.usa_names.usa_1910_2013` GROUP BY name")
cur.fetchall()
# Out [3]:
# 1.63 s ± 80.3 ms per loop (mean ± std. dev. of 4 runs, 5 loops each)

Before this change:

# In [1]:
import google.cloud.bigquery.dbapi as bqdb
conn = bqdb.connect()
cur = conn.cursor()

# In [2]:
%%timeit -n10 -r10
cur.execute("SELECT 1")
r = cur.fetchall()
# Out [2]:
# 1.18 s ± 73.2 ms per loop (mean ± std. dev. of 10 runs, 10 loops each)

# In [3]:
%%timeit -n5 -r4
cur.execute("SELECT name, SUM(`number`) AS total FROM `bigquery-public-data.usa_names.usa_1910_2013` GROUP BY name")
cur.fetchall()
# Out [3]:
# 1.67 s ± 62.2 ms per loop (mean ± std. dev. of 4 runs, 5 loops each)

This means that small query results (SELECT 1) have a (1.18 / 0.319) = 3.7x speedup! For medium-sized results/queries, this is less dramatic at (1.67 / 1.63) = 1.2x speedup and not statistically significant.

@@ -1635,7 +1647,10 @@ def _is_almost_completely_cached(self):
This is useful to know, because we can avoid alternative download
mechanisms.
"""
if self._first_page_response is None:
if (
not hasattr(self, "_first_page_response")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we set at line 1591 self._first_page_response = first_page_response, this attribute will always exist? Maybe we can check whether the value is None or not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also was needed for some tests where we have a mock row iterator but want to test with a real implementation of this method.

if self.next_page_token is not None:
# The developer has already started paging through results if
# next_page_token is set.
if hasattr(self, "next_page_token") and self.next_page_token is not None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for my education, it looks like attribute next_page_token is inherited from "grandparent" class Iterator from the core library, which creates this attribute at init. Is it necessary to check whether this attribute exist or not?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was purely for some failing unit tests where this superclass was mocked out.

@Linchin
Copy link
Contributor

Linchin commented Dec 17, 2023

Thank you Tim for the timely fix! LGTM, except for some nits.

@Linchin Linchin self-requested a review December 17, 2023 05:58
Copy link
Collaborator

@chalmerlowe chalmerlowe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tswast tswast added the automerge Merge the pull request once unit tests and other checks pass. label Dec 19, 2023
@gcf-merge-on-green gcf-merge-on-green bot merged commit d225a94 into main Dec 19, 2023
21 of 22 checks passed
@gcf-merge-on-green gcf-merge-on-green bot deleted the b1745-dbapi-query_and_wait branch December 19, 2023 22:00
@gcf-merge-on-green gcf-merge-on-green bot removed the automerge Merge the pull request once unit tests and other checks pass. label Dec 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the googleapis/python-bigquery API. size: l Pull request size is large.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

dbapi: Skip storage client fetch when results cached
3 participants