-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor to_dataframe
deterministicly update progress bar.
#8303
Conversation
Previously, a background thread was used to collect progress bar updates from worker threads. So as not to block downloads for progress bar updates, `put_nowait` was used to make progress bar updates. Missed writes to the progress bar were ignored. This caused non-deterministic progress bar updates and test flakiness. Now, worker threads push dataframes to the queue, and the return values for `download_dataframe_bqstorage` and `download_dataframe_tabledata_list` have been updated to return an iterable of pandas DataFrame objects instead of a single DataFrame. This allows progress bar updates to be done independently of which underlying API is used to download the DataFrames. Also, the logic for working with pandas has been moved to the `_pandas_helpers` module.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes generally seem fine to me, but I did leave a few comments that might be relevant - please check, just in case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, and thanks for the additional explanation of the design decisions!
Will wait if the other reviewers have something else to add.
# prevents the queue from filling up, because the main thread | ||
# has smaller gaps in time between calls to the queue's get | ||
# method. For a detailed explaination, see: | ||
# https://friendliness.dev/2019/06/18/python-nowait/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great. 👍
Previously, a background thread was used to collect progress bar updates
from worker threads. So as not to block downloads for progress bar
updates,
put_nowait
was used to make progress bar updates. Missedwrites to the progress bar were ignored. This caused non-deterministic
progress bar updates and test flakiness.
Now, worker threads push dataframes to the queue, and the return values
for
download_dataframe_bqstorage
anddownload_dataframe_tabledata_list
have been updated to return aniterable of pandas DataFrame objects instead of a single DataFrame. This
allows progress bar updates to be done independently of which underlying
API is used to download the DataFrames.
Also, the logic for working with pandas has been moved to the
_pandas_helpers
module.Closes #8175