Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Massive downloads (1B+ rows) causes read errors #1252

Open
jlynchMicron opened this issue May 13, 2022 · 3 comments
Open

Massive downloads (1B+ rows) causes read errors #1252

jlynchMicron opened this issue May 13, 2022 · 3 comments
Labels
api: bigquery Issues related to the googleapis/python-bigquery API. priority: p3 Desirable enhancement or fix. May not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.

Comments

@jlynchMicron
Copy link

jlynchMicron commented May 13, 2022

After troubleshooting for a long while on why my pandas read_gbq() (with bqstoarage_api enabled) query read request data throughput would drop like a rock early on in the download, I think I eventually found my answer.

Looking at the GCP API monitor, I saw that my requests would eventually error out in a 499 response message (client error).

After all my debugging, I found that this function was returning with 1000 read steams/threads to download.
https://github.com/googleapis/python-bigquery/blob/main/google/cloud/bigquery/_pandas_helpers.py#L838

I believe that for massive query results and a (max_stream_count=requested_streams) value of 0, the BQ server returns with its max stream count of 1000 streams to use. This most likely overwhelms a system and causes some of the threads to die due to timeout connections or something like that. I found that when I forced the stream number to be much more reasonable, like 48 that my download worked fine.

Environment details

  • OS type and version: Linux 64-bit
  • Python version: 3.8
  • google-cloud-bigquery version: 2.31.0
@product-auto-label product-auto-label bot added the api: bigquery Issues related to the googleapis/python-bigquery API. label May 13, 2022
@yoshi-automation yoshi-automation added triage me I really want to be triaged. 🚨 This issue needs some love. labels May 14, 2022
@parthea parthea added type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. priority: p2 Moderately-important priority. Fix may not be included in next release. and removed 🚨 This issue needs some love. triage me I really want to be triaged. labels May 19, 2022
@tswast
Copy link
Contributor

tswast commented Jun 22, 2022

I wonder if removing the default pool size would be a sufficient fix?

with concurrent.futures.ThreadPoolExecutor(max_workers=total_streams) as pool:

Alternatively (preferably?) we could set requested_streams to max(requested_streams, some multiple of the number of available cores)

requested_streams = 1 if preserve_order else 0

@shollyman
Copy link
Contributor

The other option here is something like a bag of tasks where we bound concurrent work via semaphore but still allow for a large number of streams. It requires more concurrency coordination which admittedly isn't python's strong suit, but it would prevent overwhelming the client.

@chalmerlowe chalmerlowe added priority: p3 Desirable enhancement or fix. May not be included in next release. and removed priority: p2 Moderately-important priority. Fix may not be included in next release. labels Aug 17, 2023
@kien-truong
Copy link

Maybe instead of trying to guess the max concurrent stream that a system can support, it's simpler to just let user overwrite the max_stream_count when the default 0 doesn't work for them.

#2030

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the googleapis/python-bigquery API. priority: p3 Desirable enhancement or fix. May not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Projects
None yet
Development

No branches or pull requests

7 participants