-
Couldn't load subscription status.
- Fork 355
Description
Feature description
Databricks supports different techniques to optimize data-retrieval, query latency, and make queries more efficient in general.
Some of the most common optimization techniques are:
- Partitioning: it implies using
PARTITIONED BY (col_1, col_2)in the table DDL statement - a similar concept to what is done in athena with the partition concept can be used - Liquid Clustering: the modern way to deal with partition and z-ordering in Databricks. It imply to pass
cluster by (col_1, col_2)- can be achieve in a similar way as partitioning. It's also possible to have automatic clustering, which can be achieved viaCLUSTER BY AUTO- that is really relevant in case we want to let Databricks do the heavy lifting for us. - data-skipping: more can be found here, it ca be defined as a table property; therefore to implement data-skipping the destination must be extend to support setting table properties of whatever sort
In general this feature request will allow dltHub to have more control on how Delta Tables are written.
Are you a dlt user?
Yes, I'm already a dlt user. - Run multiple PoC in the past, and now I'm currently pushing for dltHub adoption in the new working place
Use case
Optimize access pattern for tables in the raw layer, making queries faster and reducing cost.
Proposed solution
For each optimization techniques, I already provided some hints on how the feature can be implement.
Related issues
This issue: #2674 is related
Metadata
Metadata
Assignees
Labels
Type
Projects
Status