BigQuery: Add timeout parameter to to_dataframe() #7612

alixhami · 2019-03-29T00:08:58Z

Opening this feature request for discussion. There is currently no way to provide a timeout for cases when to_dataframe() continues to run and eventually stalls due to a query returning results that are too large for a pandas DataFrame.

A timeout given to QueryJob.to_dataframe should probably pass this timeout to the result() function, and use the remaining time to construct the DataFrame.

There is some ambiguity in how to handle RowIterator.to_dataframe because it does not call result(), so there are two separate timeouts that can be given:
client.query(sql).result(timeout=10).to_dataframe(timeout=10)

It would likely be confusing to users that the timeout given to to_dataframe() will apply to both the query job and the DataFrame construction when given to a QueryJob, but will only apply to the DataFrame construction when given to a RowIterator.

The text was updated successfully, but these errors were encountered:

tomshaffner · 2019-03-29T08:30:21Z

Just a minor note; this is a problem also when pandas doesn't actually stall but just continues running. In a test run yesterday the function completed, but it took over an hour to process. When this happens some capacity to check in advance or to tell the user why the program suddenly stopped responding or a timeout of some sort will be very helpful.

I'd also wonder if the timeout error should include links to where and how one downloads a csv from bigquery, as doing this and reading the csv into pandas seems orders of magnitude faster for large results than converting the json query results locally.

Thanks for adding this to your issue list!

tswast · 2019-03-29T16:25:10Z

some capacity to check in advance or to tell the user why the program suddenly stopped responding

You're in luck. There's an unreleased feature that was literally just added (#7552) to display a progress bar while a DataFrame downloads.

reading the csv into pandas seems orders of magnitude faster

Have you tried the BigQuery Storage API, yet? It should be much faster to download large result sets. See this guide for how to use the BigQuery Storage API with to_dataframe for instructions on how to use it.

tomshaffner · 2019-04-02T15:21:17Z

You're in luck. There's an unreleased feature that was literally just added (#7552) to display a progress bar while a DataFrame downloads.

Will this actually help in this instance? I'm a bit unclear on when the downloading occurs vs. the processing. but as I understood it the download is actually completing fine, it's the processing of the json data into a dataframe that's taking so long. As such, will the download bars show up when the to_dataframe function is running? I just upgraded and tested again and didn't get any new behavior, but I'm also limited to conda upgrades behind the corporate firewall (no pip) so not sure it's actually the new library.

Either way, it sounds to me like the real solution is indeed the link you sent. It seems the approach I've been using is only fitting for queries with much smaller returns and there's not a clear way in advance to know which will be which. I'll work to switch my usages to the other approach; thanks!

tswast · 2019-04-02T16:28:02Z

I understood it the download is actually completing fine, it's the processing of the json data into a dataframe that's taking so long.

I've done some profiling, and while JSON parsing does take a while, even downloading the query results without parsing them is quite slow compared to using the BigQuery Storage API.

It seems the approach I've been using is only fitting for queries with much smaller returns and there's not a clear way in advance to know which will be which.

Yes, that's a problem that I'm addressing in #7633, which will prevent the client from erroring on small result sets with a bqstorage_client is provided.

I'm also limited to conda upgrades behind the corporate firewall (no pip) so not sure it's actually the new library.

I haven't released the pip package yet, and the conda package update always follows that release. I'm hoping to have a new version by the end of the week.

alixhami · 2019-04-02T17:40:10Z

@tswast to clarify - I heard some details from @tomshaffner on an internal bug last week, and it sounds like the query is completing in a reasonable time, but the results are sometimes just too big to be held in a DataFrame in memory, so we need a way of gracefully handling this case.

tswast · 2019-04-02T18:32:14Z

The results are sometimes just too big to be held in a DataFrame in memory, so we need a way of gracefully handling this case.

I assume #7339 would handle that case, so that one page can be downloaded at a time?

tomshaffner · 2019-04-03T13:34:28Z

@tswast to clarify - I heard some details from @tomshaffner on an internal bug last week, and it sounds like the query is completing in a reasonable time, but the results are sometimes just too big to be held in a DataFrame in memory, so we need a way of gracefully handling this case.

This might be true, but the issue I was hitting was actually the processing one. Early on I thought it might be a memory limitation (and once since then it has been) but it's more about the time.

Have you tried the BigQuery Storage API, yet? It should be much faster to download large result sets. See this guide for how to use the BigQuery Storage API with to_dataframe for instructions on how to use it.

I've now switched to an implementation of this approach. Interestingly I now have a similar problem there. I can create a new table with the query using the storage approach but when I then execute it I basically stop getting a response until it's done and downloaded. Is there a way to monitor progress at all with that? I'm using the "Download query results" approach from that page (as opposed to the Table download). It does indeed seem to be faster, but I'm still left with blocks of time where basically nothing happens and I have to hope it's still processing and will finish in a reasonable amount of time.

tswast · 2019-04-03T13:55:22Z

I basically stop getting a response until it's done and downloaded. Is there a way to monitor progress at all with that?

Not yet, unfortunately. The BigQuery Storage API is faster, but also harder to implement a progress bar. I have some ideas on how this can be done: #7654. I agree that it's frustrating to have output completely hang while the data downloads.

tomshaffner · 2019-04-04T14:14:07Z

I actually ended up implementing a third approach that creates a temporary table (as in the storage example), exports it to a storage bucket as a compressed csv, combines the shards, and downloads the result. In this approach I'm able to track query progess on both the original query and the export query, the only parts I can't track are the download (though that goes much faster now with compression so not a huge issue) and the pandas import. In the short term this is more complicated but actually seems like the most robust solution across all sizes of query results. Also of note, based on my testing with this new approach I think some fair percentage of the slowness I was experiencing originally was actually just pandas based; the import of the results now seem like the slowest part by far. As such, my apologies for some misdirected frustration there.

It strikes me that a compressed csv download option from the storage library might thus be a nice additional feature to have as it would effectively solve much of the above problems using a solution similar to the one I'm now using without having to mess with buckets and the sharding/combining again. It would also allow you to separate out instances where the BigQuery/download is going slowly from those where Pandas is the issue. Think my use cases is now covered for the moment, but thought I'd leave a note that a feature like this might be really useful to help deal with this problem in the future!

Thanks again for all the support!

tswast · 2019-10-16T16:49:46Z

This project is more complex than I originally thought. Timeouts for both tabledata.list download and the BQ Storage API downloads need to be accounted for.

tabledata.list requires keeping track of time across multiple next page requests. We could do something similar with the BQ Storage API, but since it's a streaming gRPC API, I foresee problems with getting the connections closed & cleaning up memory usage.

I propose we close this FR as "Won't Implement", as it will add more to our library maintenance burden than I originally estimated.

The workaround is to call result(timeout=my_timeout) and then fetch a single page at a time rather than call to_dataframe.

alixhami assigned tswast and alixhami Mar 29, 2019

tswast added type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. api: bigquery Issues related to the BigQuery API. labels Mar 29, 2019

alixhami removed their assignment Jun 12, 2019

tswast assigned HemangChothani and unassigned tswast Oct 11, 2019

tswast closed this as completed Oct 16, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BigQuery: Add timeout parameter to to_dataframe() #7612

BigQuery: Add timeout parameter to to_dataframe() #7612

alixhami commented Mar 29, 2019

tomshaffner commented Mar 29, 2019

tswast commented Mar 29, 2019

tomshaffner commented Apr 2, 2019

tswast commented Apr 2, 2019

alixhami commented Apr 2, 2019 •

edited

Loading

tswast commented Apr 2, 2019

tomshaffner commented Apr 3, 2019 •

edited

Loading

tswast commented Apr 3, 2019

tomshaffner commented Apr 4, 2019

tswast commented Oct 16, 2019

BigQuery: Add timeout parameter to to_dataframe() #7612

BigQuery: Add timeout parameter to to_dataframe() #7612

Comments

alixhami commented Mar 29, 2019

tomshaffner commented Mar 29, 2019

tswast commented Mar 29, 2019

tomshaffner commented Apr 2, 2019

tswast commented Apr 2, 2019

alixhami commented Apr 2, 2019 • edited Loading

tswast commented Apr 2, 2019

tomshaffner commented Apr 3, 2019 • edited Loading

tswast commented Apr 3, 2019

tomshaffner commented Apr 4, 2019

tswast commented Oct 16, 2019

alixhami commented Apr 2, 2019 •

edited

Loading

tomshaffner commented Apr 3, 2019 •

edited

Loading