Skip to content

[Databricks destination] Support for table optimization techniques in Databricks: partitioning, liquid clustering, data-skipping #2863

@nicor88

Description

@nicor88

Feature description

Databricks supports different techniques to optimize data-retrieval, query latency, and make queries more efficient in general.
Some of the most common optimization techniques are:

  1. Partitioning: it implies using PARTITIONED BY (col_1, col_2) in the table DDL statement - a similar concept to what is done in athena with the partition concept can be used
  2. Liquid Clustering: the modern way to deal with partition and z-ordering in Databricks. It imply to pass cluster by (col_1, col_2) - can be achieve in a similar way as partitioning. It's also possible to have automatic clustering, which can be achieved via CLUSTER BY AUTO - that is really relevant in case we want to let Databricks do the heavy lifting for us.
  3. data-skipping: more can be found here, it ca be defined as a table property; therefore to implement data-skipping the destination must be extend to support setting table properties of whatever sort

In general this feature request will allow dltHub to have more control on how Delta Tables are written.

Are you a dlt user?

Yes, I'm already a dlt user. - Run multiple PoC in the past, and now I'm currently pushing for dltHub adoption in the new working place

Use case

Optimize access pattern for tables in the raw layer, making queries faster and reducing cost.

Proposed solution

For each optimization techniques, I already provided some hints on how the feature can be implement.

Related issues

This issue: #2674 is related

Metadata

Metadata

Assignees

No one assigned

    Labels

    help wantedExtra attention is needed

    Type

    No type

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions