Description
Environment
Delta-rs version: 0.10.2
Environment:
- Cloud provider: Google Cloud Storage
- Other: Python 3.10
Bug
What happened:
Not sure if this is a bug, but it was recommended I post this issue here on stack overflow: https://stackoverflow.com/questions/77639348/delta-rs-package-incurs-high-costs-on-gcs/77681169#77681169.
I'm using the package to store files on the Google Cloud Storage dual-region bucket. I use the following code to store the data:
def save_data(self, df: Generator[pa.RecordBatch, Any, None]):
write_deltalake(
f"gs://<my-bucket-name>",
df,
schema=df_schema,
partition_by="my_id",
mode="append",
max_rows_per_file=self.max_rows_per_file,
max_rows_per_group=self.max_rows_per_file,
min_rows_per_group=int(self.max_rows_per_file / 2)
)
The input data is a generator since I'm taking the data from a Postgres database in batches. I am saving similar data into two different tables and I'm also saving a SUCCESS file for each uploaded partition.
I have around 25,000 partitions and most of them only have a single parquet file in them. The total number of rows that I've inserted is around 700,000,000. This incurred the following costs:
Class A operations: 127,000.
Class B operations: 109,856,507.
Download Worldwide Destinations: 300 gibibyte.
The number of class A operations makes sense to me when accounting for 2 writes per partition + an additional success file -- these are inserts. Some partitions probably have more than 1 file, so the number is a bit higher than 25,000 (number of partitions) x 3.
I can't figure out where so many class B operations and Download Worldwide Destinations. Is this to be expected or could it be a bug?
Can you provide any insights into why the costs are so high and how I would need to change the code to decrease them?
What you expected to happen:
Much lower costs for Class B operations on GCS.
Activity