Skip to content

Unexpected high costs on Google Cloud Storage #2085

Closed
@gregorp90

Description

@gregorp90

Environment

Delta-rs version: 0.10.2

Environment:

  • Cloud provider: Google Cloud Storage
  • Other: Python 3.10

Bug

What happened:
Not sure if this is a bug, but it was recommended I post this issue here on stack overflow: https://stackoverflow.com/questions/77639348/delta-rs-package-incurs-high-costs-on-gcs/77681169#77681169.

I'm using the package to store files on the Google Cloud Storage dual-region bucket. I use the following code to store the data:

def save_data(self, df: Generator[pa.RecordBatch, Any, None]):
    write_deltalake(
        f"gs://<my-bucket-name>",
        df,
        schema=df_schema,
        partition_by="my_id",
        mode="append",
        max_rows_per_file=self.max_rows_per_file,
        max_rows_per_group=self.max_rows_per_file,
        min_rows_per_group=int(self.max_rows_per_file / 2)
    )

The input data is a generator since I'm taking the data from a Postgres database in batches. I am saving similar data into two different tables and I'm also saving a SUCCESS file for each uploaded partition.

I have around 25,000 partitions and most of them only have a single parquet file in them. The total number of rows that I've inserted is around 700,000,000. This incurred the following costs:

Class A operations: 127,000.
Class B operations: 109,856,507.
Download Worldwide Destinations: 300 gibibyte.
The number of class A operations makes sense to me when accounting for 2 writes per partition + an additional success file -- these are inserts. Some partitions probably have more than 1 file, so the number is a bit higher than 25,000 (number of partitions) x 3.

I can't figure out where so many class B operations and Download Worldwide Destinations. Is this to be expected or could it be a bug?

Can you provide any insights into why the costs are so high and how I would need to change the code to decrease them?

What you expected to happen:
Much lower costs for Class B operations on GCS.

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions