Skip to content

Performance  #133

Closed
Closed
@max-sixty

Description

We're starting to use BigQuery heavily but becoming increasingly 'bottlenecked' with the performance of moving moderate amounts of data from BigQuery to python.

Here's a few stats:

  • 29.1s: Pulling 500k rows with 3 columns of data (with cached data) using pandas-gbq
  • 36.5s: Pulling the same query with google-cloud-bigquery - i.e. client.query(query)..to_dataframe()
  • 2.4s: Pulling very similar data - same types, same size, from our existing MSSQL box hosted in AWS (using pd.read_sql). That's on standard drivers, nothing like turbodbc involved

...so using BigQuery with python is at least an order of magnitude slower than traditional DBs.

We've tried exporting tables to CSV on GCS and reading those in, which works fairly well for data processes, though not for exploration.

A few questions - feel free to jump in with partial replies:

  • Are these results expected, or are we doing something very wrong?
  • My prior is that a lot of this slowdown is caused by pulling in HTTP pages, converting to python objects, and then writing those into arrays. Is this approach really scalable? Should pandas-gbq invest resources into getting a format that's query-able in exploratory workflows that can deal with more reasonable datasets? (or at least encourage Google to)

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions