Open
Description
Environment details
- GCP VM instance: c4a-highcpu-72
- OS type and version: Linux, 6.1.0-31-cloud-arm64
- Python version: Python 3.11.2
- pip version: pip 23.0.1
google-cloud-storage
version: 2.19.0 & 3.1.0
Steps to reproduce
When comparing download performance between Google Cloud Storage Python Client and boto3 (AWS SDK), we observed that GCS client is significantly slower (about 50% slower) than boto3 for downloading the same objects stored in a GCS bucket.
GCS Client Implementation (Two methods tested)
- Using
blob.download_as_bytes()
:
blob = bucket.blob(key)
data = blob.download_as_bytes()
- Using
blob.open()
(10%-30% faster than method 1, but still 50% slower than boto3):
with blob.open("rb") as f:
data = f.read()
boto3 Implementation
response = s3_client.get_object(Bucket=bucket_name, Key=key)
data = response['Body'].read()
Performance Results
With ThreadPoolExecutor(max_workers=16)
, I got following average throughput downloading 64MB x 1000 objects from GCS bucket to memory:
- boto3
get_object
: 12 Gbps - GCS
download_as_bytes()
: 3.2 Gbps in 2.19.0 & 4.2 Gbps in 3.1.0 - GCS
blob.open()
: 4.5 Gbps in both 2.19.0 & 3.1.0
Questions
- Is this performance gap expected?
- Are there any recommended optimizations or best practices for improving download performance with the GCS Python client?
- Are there any internal differences in how GCS supports S3-compatible APIs handling downloads that might explain the performance gap?
Additional Context
- We've tried various optimizations including:
- Using
raw_download=True
- Configuring connection pools (Ref: https://stackoverflow.com/questions/52653409/increase-connection-pool-size)
- Using different chunk sizes in
blob.open("rb", chunk_size=xxx)
- Using
- The performance gap remains consistent across multiple test runs
Benchmarking scripts are available at https://github.com/dreamtalen/boto3-benchmark/tree/main/google-cloud-storage
Thanks!