-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BigQuery: Add timeout parameter to to_dataframe() #7612
Comments
Just a minor note; this is a problem also when pandas doesn't actually stall but just continues running. In a test run yesterday the function completed, but it took over an hour to process. When this happens some capacity to check in advance or to tell the user why the program suddenly stopped responding or a timeout of some sort will be very helpful. I'd also wonder if the timeout error should include links to where and how one downloads a csv from bigquery, as doing this and reading the csv into pandas seems orders of magnitude faster for large results than converting the json query results locally. Thanks for adding this to your issue list! |
You're in luck. There's an unreleased feature that was literally just added (#7552) to display a progress bar while a DataFrame downloads.
Have you tried the BigQuery Storage API, yet? It should be much faster to download large result sets. See this guide for how to use the BigQuery Storage API with |
Will this actually help in this instance? I'm a bit unclear on when the downloading occurs vs. the processing. but as I understood it the download is actually completing fine, it's the processing of the json data into a dataframe that's taking so long. As such, will the download bars show up when the to_dataframe function is running? I just upgraded and tested again and didn't get any new behavior, but I'm also limited to conda upgrades behind the corporate firewall (no pip) so not sure it's actually the new library. Either way, it sounds to me like the real solution is indeed the link you sent. It seems the approach I've been using is only fitting for queries with much smaller returns and there's not a clear way in advance to know which will be which. I'll work to switch my usages to the other approach; thanks! |
I've done some profiling, and while JSON parsing does take a while, even downloading the query results without parsing them is quite slow compared to using the BigQuery Storage API.
Yes, that's a problem that I'm addressing in #7633, which will prevent the client from erroring on small result sets with a
I haven't released the pip package yet, and the conda package update always follows that release. I'm hoping to have a new version by the end of the week. |
@tswast to clarify - I heard some details from @tomshaffner on an internal bug last week, and it sounds like the query is completing in a reasonable time, but the results are sometimes just too big to be held in a DataFrame in memory, so we need a way of gracefully handling this case. |
I assume #7339 would handle that case, so that one page can be downloaded at a time? |
This might be true, but the issue I was hitting was actually the processing one. Early on I thought it might be a memory limitation (and once since then it has been) but it's more about the time.
I've now switched to an implementation of this approach. Interestingly I now have a similar problem there. I can create a new table with the query using the storage approach but when I then execute it I basically stop getting a response until it's done and downloaded. Is there a way to monitor progress at all with that? I'm using the "Download query results" approach from that page (as opposed to the Table download). It does indeed seem to be faster, but I'm still left with blocks of time where basically nothing happens and I have to hope it's still processing and will finish in a reasonable amount of time. |
Not yet, unfortunately. The BigQuery Storage API is faster, but also harder to implement a progress bar. I have some ideas on how this can be done: #7654. I agree that it's frustrating to have output completely hang while the data downloads. |
I actually ended up implementing a third approach that creates a temporary table (as in the storage example), exports it to a storage bucket as a compressed csv, combines the shards, and downloads the result. In this approach I'm able to track query progess on both the original query and the export query, the only parts I can't track are the download (though that goes much faster now with compression so not a huge issue) and the pandas import. In the short term this is more complicated but actually seems like the most robust solution across all sizes of query results. Also of note, based on my testing with this new approach I think some fair percentage of the slowness I was experiencing originally was actually just pandas based; the import of the results now seem like the slowest part by far. As such, my apologies for some misdirected frustration there. It strikes me that a compressed csv download option from the storage library might thus be a nice additional feature to have as it would effectively solve much of the above problems using a solution similar to the one I'm now using without having to mess with buckets and the sharding/combining again. It would also allow you to separate out instances where the BigQuery/download is going slowly from those where Pandas is the issue. Think my use cases is now covered for the moment, but thought I'd leave a note that a feature like this might be really useful to help deal with this problem in the future! Thanks again for all the support! |
This project is more complex than I originally thought. Timeouts for both
I propose we close this FR as "Won't Implement", as it will add more to our library maintenance burden than I originally estimated. The workaround is to call |
Opening this feature request for discussion. There is currently no way to provide a timeout for cases when
to_dataframe()
continues to run and eventually stalls due to a query returning results that are too large for a pandas DataFrame.A timeout given to
QueryJob.to_dataframe
should probably pass this timeout to theresult()
function, and use the remaining time to construct the DataFrame.There is some ambiguity in how to handle
RowIterator.to_dataframe
because it does not callresult()
, so there are two separate timeouts that can be given:client.query(sql).result(timeout=10).to_dataframe(timeout=10)
It would likely be confusing to users that the timeout given to
to_dataframe()
will apply to both the query job and the DataFrame construction when given to aQueryJob
, but will only apply to the DataFrame construction when given to aRowIterator
.The text was updated successfully, but these errors were encountered: