Massive downloads (1B+ rows) causes read errors #1252
Labels
api: bigquery
Issues related to the googleapis/python-bigquery API.
priority: p3
Desirable enhancement or fix. May not be included in next release.
type: bug
Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
After troubleshooting for a long while on why my pandas read_gbq() (with bqstoarage_api enabled) query read request data throughput would drop like a rock early on in the download, I think I eventually found my answer.
Looking at the GCP API monitor, I saw that my requests would eventually error out in a 499 response message (client error).
After all my debugging, I found that this function was returning with 1000 read steams/threads to download.
https://github.com/googleapis/python-bigquery/blob/main/google/cloud/bigquery/_pandas_helpers.py#L838
I believe that for massive query results and a (max_stream_count=requested_streams) value of 0, the BQ server returns with its max stream count of 1000 streams to use. This most likely overwhelms a system and causes some of the threads to die due to timeout connections or something like that. I found that when I forced the stream number to be much more reasonable, like 48 that my download worked fine.
Environment details
google-cloud-bigquery
version: 2.31.0The text was updated successfully, but these errors were encountered: