Skip to content

Z-Order with larger dataset resulting in memory error #2284

Closed as not planned
@pyjads

Description

@pyjads

Environment

Windows (8 GB RAM)

Delta-rs version: 0.16.0


Bug

What happened:

from datetime import timedelta

delta = timedelta(seconds=60)

dt.optimize.z_order(
    ["user_id", "product"],
    max_spill_size=4194304000,
    min_commit_interval=delta,
    max_concurrent_tasks=1,
)

I am trying to execute z-order on the partitioned data. There are 65 partitions and each partition contains approx. 900 MB of data in approx. 16 parquet files with approx. 55 mb file size of each parquet. It results into following error

DeltaError: Failed to parse parquet: Parquet error: Z-order failed while scanning data: ResourcesExhausted("Failed to allocate additional 403718240 bytes for ExternalSorter[2] with 0 bytes already allocated - maximum available is 381425355").

I am new to deltalake and don't have much knowledge on how z_order work. Is it due to the large amount of data? I am trying to run it on my local laptop with limited resources.

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions