Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BigQuery: Add timeout parameter to to_dataframe() #7612

Closed
alixhami opened this issue Mar 29, 2019 · 10 comments
Closed

BigQuery: Add timeout parameter to to_dataframe() #7612

alixhami opened this issue Mar 29, 2019 · 10 comments
Assignees
Labels
api: bigquery Issues related to the BigQuery API. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.

Comments

@alixhami
Copy link
Contributor

Opening this feature request for discussion. There is currently no way to provide a timeout for cases when to_dataframe() continues to run and eventually stalls due to a query returning results that are too large for a pandas DataFrame.

A timeout given to QueryJob.to_dataframe should probably pass this timeout to the result() function, and use the remaining time to construct the DataFrame.

There is some ambiguity in how to handle RowIterator.to_dataframe because it does not call result(), so there are two separate timeouts that can be given:
client.query(sql).result(timeout=10).to_dataframe(timeout=10)

It would likely be confusing to users that the timeout given to to_dataframe() will apply to both the query job and the DataFrame construction when given to a QueryJob, but will only apply to the DataFrame construction when given to a RowIterator.

@tswast tswast added type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. api: bigquery Issues related to the BigQuery API. labels Mar 29, 2019
@tomshaffner
Copy link

Just a minor note; this is a problem also when pandas doesn't actually stall but just continues running. In a test run yesterday the function completed, but it took over an hour to process. When this happens some capacity to check in advance or to tell the user why the program suddenly stopped responding or a timeout of some sort will be very helpful.

I'd also wonder if the timeout error should include links to where and how one downloads a csv from bigquery, as doing this and reading the csv into pandas seems orders of magnitude faster for large results than converting the json query results locally.

Thanks for adding this to your issue list!

@tswast
Copy link
Contributor

tswast commented Mar 29, 2019

some capacity to check in advance or to tell the user why the program suddenly stopped responding

You're in luck. There's an unreleased feature that was literally just added (#7552) to display a progress bar while a DataFrame downloads.

reading the csv into pandas seems orders of magnitude faster

Have you tried the BigQuery Storage API, yet? It should be much faster to download large result sets. See this guide for how to use the BigQuery Storage API with to_dataframe for instructions on how to use it.

@tomshaffner
Copy link

You're in luck. There's an unreleased feature that was literally just added (#7552) to display a progress bar while a DataFrame downloads.

Will this actually help in this instance? I'm a bit unclear on when the downloading occurs vs. the processing. but as I understood it the download is actually completing fine, it's the processing of the json data into a dataframe that's taking so long. As such, will the download bars show up when the to_dataframe function is running? I just upgraded and tested again and didn't get any new behavior, but I'm also limited to conda upgrades behind the corporate firewall (no pip) so not sure it's actually the new library.

Either way, it sounds to me like the real solution is indeed the link you sent. It seems the approach I've been using is only fitting for queries with much smaller returns and there's not a clear way in advance to know which will be which. I'll work to switch my usages to the other approach; thanks!

@tswast
Copy link
Contributor

tswast commented Apr 2, 2019

I understood it the download is actually completing fine, it's the processing of the json data into a dataframe that's taking so long.

I've done some profiling, and while JSON parsing does take a while, even downloading the query results without parsing them is quite slow compared to using the BigQuery Storage API.

It seems the approach I've been using is only fitting for queries with much smaller returns and there's not a clear way in advance to know which will be which.

Yes, that's a problem that I'm addressing in #7633, which will prevent the client from erroring on small result sets with a bqstorage_client is provided.

I'm also limited to conda upgrades behind the corporate firewall (no pip) so not sure it's actually the new library.

I haven't released the pip package yet, and the conda package update always follows that release. I'm hoping to have a new version by the end of the week.

@alixhami
Copy link
Contributor Author

alixhami commented Apr 2, 2019

@tswast to clarify - I heard some details from @tomshaffner on an internal bug last week, and it sounds like the query is completing in a reasonable time, but the results are sometimes just too big to be held in a DataFrame in memory, so we need a way of gracefully handling this case.

@tswast
Copy link
Contributor

tswast commented Apr 2, 2019

The results are sometimes just too big to be held in a DataFrame in memory, so we need a way of gracefully handling this case.

I assume #7339 would handle that case, so that one page can be downloaded at a time?

@tomshaffner
Copy link

tomshaffner commented Apr 3, 2019

@tswast to clarify - I heard some details from @tomshaffner on an internal bug last week, and it sounds like the query is completing in a reasonable time, but the results are sometimes just too big to be held in a DataFrame in memory, so we need a way of gracefully handling this case.

This might be true, but the issue I was hitting was actually the processing one. Early on I thought it might be a memory limitation (and once since then it has been) but it's more about the time.

Have you tried the BigQuery Storage API, yet? It should be much faster to download large result sets. See this guide for how to use the BigQuery Storage API with to_dataframe for instructions on how to use it.

I've now switched to an implementation of this approach. Interestingly I now have a similar problem there. I can create a new table with the query using the storage approach but when I then execute it I basically stop getting a response until it's done and downloaded. Is there a way to monitor progress at all with that? I'm using the "Download query results" approach from that page (as opposed to the Table download). It does indeed seem to be faster, but I'm still left with blocks of time where basically nothing happens and I have to hope it's still processing and will finish in a reasonable amount of time.

@tswast
Copy link
Contributor

tswast commented Apr 3, 2019

I basically stop getting a response until it's done and downloaded. Is there a way to monitor progress at all with that?

Not yet, unfortunately. The BigQuery Storage API is faster, but also harder to implement a progress bar. I have some ideas on how this can be done: #7654. I agree that it's frustrating to have output completely hang while the data downloads.

@tomshaffner
Copy link

I actually ended up implementing a third approach that creates a temporary table (as in the storage example), exports it to a storage bucket as a compressed csv, combines the shards, and downloads the result. In this approach I'm able to track query progess on both the original query and the export query, the only parts I can't track are the download (though that goes much faster now with compression so not a huge issue) and the pandas import. In the short term this is more complicated but actually seems like the most robust solution across all sizes of query results. Also of note, based on my testing with this new approach I think some fair percentage of the slowness I was experiencing originally was actually just pandas based; the import of the results now seem like the slowest part by far. As such, my apologies for some misdirected frustration there.

It strikes me that a compressed csv download option from the storage library might thus be a nice additional feature to have as it would effectively solve much of the above problems using a solution similar to the one I'm now using without having to mess with buckets and the sharding/combining again. It would also allow you to separate out instances where the BigQuery/download is going slowly from those where Pandas is the issue. Think my use cases is now covered for the moment, but thought I'd leave a note that a feature like this might be really useful to help deal with this problem in the future!

Thanks again for all the support!

@alixhami alixhami removed their assignment Jun 12, 2019
@tswast tswast assigned HemangChothani and unassigned tswast Oct 11, 2019
@tswast
Copy link
Contributor

tswast commented Oct 16, 2019

This project is more complex than I originally thought. Timeouts for both tabledata.list download and the BQ Storage API downloads need to be accounted for.

tabledata.list requires keeping track of time across multiple next page requests. We could do something similar with the BQ Storage API, but since it's a streaming gRPC API, I foresee problems with getting the connections closed & cleaning up memory usage.

I propose we close this FR as "Won't Implement", as it will add more to our library maintenance burden than I originally estimated.

The workaround is to call result(timeout=my_timeout) and then fetch a single page at a time rather than call to_dataframe.

@tswast tswast closed this as completed Oct 16, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the BigQuery API. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.
Projects
None yet
Development

No branches or pull requests

4 participants