Performance 

We're starting to use BigQuery heavily but becoming increasingly 'bottlenecked' with the performance of moving moderate amounts of data from BigQuery to python.

Here's a few stats:
- **29.1s**: Pulling 500k rows with 3 columns of data (with cached data) using pandas-gbq 
- **36.5s**: Pulling the same query with `google-cloud-bigquery` - i.e. `client.query(query)..to_dataframe()` 
- **2.4s**: Pulling very similar data - same types, same size, from our existing MSSQL box hosted in AWS  (using `pd.read_sql`). That's on standard drivers, nothing like `turbodbc` involved

...so using BigQuery with python is *at least an order of magnitude slower than traditional DBs*.

We've tried exporting tables to CSV on GCS and reading those in, which works fairly well for data processes, though not for exploration.

 A few questions - feel free to jump in with partial replies:
- Are these results expected, or are we doing something very wrong? 
- My prior is that a lot of this slowdown is caused by pulling in HTTP pages, converting to python objects, and then writing those into arrays. Is this approach really scalable? Should `pandas-gbq` invest resources into getting a format that's query-able in exploratory workflows that can deal with more reasonable datasets? (or at least encourage Google to)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance #133

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development