-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor(bigquerystorage): to_dataframe
on an arrow stream uses faster to_arrow
+ to_pandas
, internally
#9997
Conversation
…to_arrow + to_pandas, internally Towards https://issuetracker.google.com/140579733
2009c86
to
ef291eb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me code-wise, but I still need to verify the performance gain on a "big" table (ran out of time in this review pass 😃 ).
Unfortunately, I cannot confirm the speedup, the timings with and without this fix are about the same on my machine (i7, 16GB RAM, ~25 Mbs network). It's probably because I also wasn't able to reproduce the speed difference in the first place using the sample from the issue description (tested on 2x 10e6 table with random floats). @tswast What data did you use in testing? Maybe I can try with that ... |
I'm getting a more modest speed-up than I was getting this past summer, too. It's at least some speedup, though. The taxi cab data has a more variety of data types. I wonder if the latest Pandas does a better job of combining floating point columns with concatenate now? From a VM in us-central:
benchmark_bqstorage.py import sys
from google.cloud import bigquery_storage_v1beta1
# TODO(developer): Set the project_id variable.
project_id = 'swast-scratch'
#
# The read session is created in this project. This project can be
# different from that which contains the table.
client = bigquery_storage_v1beta1.BigQueryStorageClient()
# This example reads baby name data from the public datasets.
table_ref = bigquery_storage_v1beta1.types.TableReference()
table_ref.project_id = "swast-scratch"
table_ref.dataset_id = "to_dataframe_benchmark"
table_ref.table_id = "tlc_green_{}pct".format(sys.argv[1])
parent = "projects/{}".format(project_id)
session = client.create_read_session(
table_ref,
parent,
format_=bigquery_storage_v1beta1.enums.DataFormat.ARROW,
sharding_strategy=(bigquery_storage_v1beta1.enums.ShardingStrategy.LIQUID),
requested_streams=1,
) # API request.
reader = client.read_rows(
bigquery_storage_v1beta1.types.StreamPosition(stream=session.streams[0])
)
dataframe = reader.rows(session).to_dataframe()
print("Got {} rows.".format(len(dataframe.index))) After this change:
Before this change:
|
to_dataframe
on an arrow stream uses faster to_arrow
+ to_pandas
, internally
to_dataframe
on an arrow stream uses faster to_arrow
+ to_pandas
, internallyto_dataframe
on an arrow stream uses faster to_arrow
+ to_pandas
, internally
I made the I've updated the PR title to remove the "2x", as it's a more modest 1.126x speedup. |
@tswast Thanks for the data, and the benchmarking results! I was finally able to reproduce a significant difference pretty consistently (~33 seconds vs. ~35 seconds for the 6pct test, for example). As this was much slower than your results, it became apparent that the network I/O dominated the timings on my end (I'm on a 50 Mbps fiber optic). By shutting down everything that might have affected the average download speed, I was able to bring the variance down by enough to not lose the signal in the noise (didn't initially expect that such measures would be necessary based on the reported 2x gains 🙂). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The speedup is indeed noticeable 👍
@tswast using |
Yes, that's true. We should see similar performance for |
…to_arrow + to_pandas, internally
Towards https://issuetracker.google.com/140579733