Closed
Description
We're starting to use BigQuery heavily but becoming increasingly 'bottlenecked' with the performance of moving moderate amounts of data from BigQuery to python.
Here's a few stats:
- 29.1s: Pulling 500k rows with 3 columns of data (with cached data) using pandas-gbq
- 36.5s: Pulling the same query with
google-cloud-bigquery
- i.e.client.query(query)..to_dataframe()
- 2.4s: Pulling very similar data - same types, same size, from our existing MSSQL box hosted in AWS (using
pd.read_sql
). That's on standard drivers, nothing liketurbodbc
involved
...so using BigQuery with python is at least an order of magnitude slower than traditional DBs.
We've tried exporting tables to CSV on GCS and reading those in, which works fairly well for data processes, though not for exploration.
A few questions - feel free to jump in with partial replies:
- Are these results expected, or are we doing something very wrong?
- My prior is that a lot of this slowdown is caused by pulling in HTTP pages, converting to python objects, and then writing those into arrays. Is this approach really scalable? Should
pandas-gbq
invest resources into getting a format that's query-able in exploratory workflows that can deal with more reasonable datasets? (or at least encourage Google to)
Metadata
Assignees
Labels
No labels