Skip to content

Massive downloads (1B+ rows) causes read errors #1252

Closed
@jlynchMicron

Description

@jlynchMicron

After troubleshooting for a long while on why my pandas read_gbq() (with bqstoarage_api enabled) query read request data throughput would drop like a rock early on in the download, I think I eventually found my answer.

Looking at the GCP API monitor, I saw that my requests would eventually error out in a 499 response message (client error).

After all my debugging, I found that this function was returning with 1000 read steams/threads to download.
https://github.com/googleapis/python-bigquery/blob/main/google/cloud/bigquery/_pandas_helpers.py#L838

I believe that for massive query results and a (max_stream_count=requested_streams) value of 0, the BQ server returns with its max stream count of 1000 streams to use. This most likely overwhelms a system and causes some of the threads to die due to timeout connections or something like that. I found that when I forced the stream number to be much more reasonable, like 48 that my download worked fine.

Environment details

  • OS type and version: Linux 64-bit
  • Python version: 3.8
  • google-cloud-bigquery version: 2.31.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    api: bigqueryIssues related to the googleapis/python-bigquery API.priority: p3Desirable enhancement or fix. May not be included in next release.type: bugError or flaw in code with unintended results or allowing sub-optimal usage patterns.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions