[spike] Clarify status of various Delta Table datasets #542

astrojuanlu · 2024-02-07T09:39:29Z

Description

We have several Delta Table datasets and it's a bit hard to understand the differences and also choose from them. A spike should be done to get clarity on what each of these delta table datasets does and how we could potentially merge the datasets or make them more consistent.

Context

spark.DeltaTableDataset was introduced in [KED-2891] Implement spark.DeltaTable dataset kedro#964 (2021). Understanding that PR is not for the faint of heart:

(related: kedro-org/kedro#3578)

Halfway through the implementation it was determined to not add a _save method and instead users are supposed to perform write operations inside the node themselves, introducing an undocumented exception to the Kedro philosophy.

databricks.ManagedTableDataset was introduced in feat: Add ManagedTableDataset for managed Delta Lake tables in Databricks #127 (2023). That one does have a _save method: "When saving data, you can specify one of three modes: overwrite(default), append, or upsert." It supports Databricks Unity Catalog ("the name of the catalog in Unity"). In the words of @dannyrfar, "ManagedTable was made for Databricks while deltatable was made for general Delta implementations (not specific to databricks), not sure why save wasn’t implemented on delta table but would probably have to do with save location and additional permissions for different storage locations (s3, gcs, adls, etc)" https://linen-slack.kedro.org/t/16366189/tldr-is-there-a-suggested-pattern-for-converting-vanilla-par#d1d2bc1e-42f2-41f2-af15-d1b4ef50268b
pandas.DeltaTableDataset was introduced in feat(datasets): Added DeltaTableDataSet #243 (2023). Interestingly, it does have _save capabilities and supports Databricks Unity Catalog as well as AWS Glue catalog, but writing capabilities are more restricted: "When saving data, you can specify one of two modes: overwrite(default), append." So it begs the question of why pandas.DeltaTableDataset can _save, but spark.DeltaTableDataset can't.

This has been an ongoing point of confusion, and the final straw was this Slack thread https://linen-slack.kedro.org/t/16366189/tldr-is-there-a-suggested-pattern-for-converting-vanilla-par#4c007d9d-68c7-46e3-bf86-b4b86755f7ca

The expected outcomes here are unclear, but for starters I hope at least folks agree that the current situation is a bit of a mess. Possible actions are:

Decide whether we could add interim _save capabilities to spark.DeltaTableDataset. Should that happen before or after [spike] Investigate suitability of Kedro for EL pipelines and incremental data loading kedro#3578? Good question.
Unify APIs. Why pandas.DeltaTableDataset supports AWS Glue and Databricks Unity Catalog but spark.DeltaTableDataset doesn't, if both are generic? Why databricks.ManagedTableDataset has 3 write modes, but pandas.DeltaTableDataset has 2?
❓

The text was updated successfully, but these errors were encountered:

julio-cmdr · 2024-03-06T18:01:27Z

Hi!
I've created a custom pandas.DeltaTableDataset, with upsert write mode. I'm sending it in here, just in case it's useful to anyone.

I took the opportunity to also add the pyarrow.Table.from_pandas() function within the _save() method, to fix the problems described in #3666

from deltalake.writer import write_deltalake
from deltalake import DeltaTable
from kedro_datasets.pandas import DeltaTableDataset
from kedro.io.core import DatasetError
from typing import List, Union, Any
import pandas as pd
import pyarrow as pa

# from https://github.com/inigohidalgo/prefect-polygon-etl/blob/main/delta-rs-etl/src/delta_rs_etl/upsert.py
def upsert(new_data: pa.Table, target_table: DeltaTable, primary_key: Union[str, List[str]]) -> dict:    
    predicate = (
        f"target.{primary_key} = source.{primary_key}"
        if type(primary_key) == str
        else " AND ".join([f"target.{col} = source.{col}" for col in primary_key])
    )
    
    return (
        target_table
        .merge(
            source=new_data,
            predicate=predicate,
            source_alias="source",
            target_alias="target"
        )
        .when_matched_update_all()
        .when_not_matched_insert_all()
        .execute()
    )


class CustomDeltaTableDataset(DeltaTableDataset):
    """
        This is a variation of pandas.DeltaTableDataset with support to upsert write mode
    """

    def __init__(self, primary_key: Union[str, List[str], None] = None, **kargs) -> None:
        self.primary_key = primary_key

        if kargs.get('save_args', {}).get('mode', '') == 'upsert':
            self.upsert_mode = True
            kargs['save_args']['mode'] = 'overwrite'
            if not self.primary_key:
                raise DatasetError(
                    "To use upsert write mode, you need to set the primare_key argument!"
                )
        else:
            self.upsert_mode = False
        
        super().__init__(**kargs)


    def _save(self, data: pd.DataFrame) -> None:
        data = pa.Table.from_pandas(data, preserve_index=False)

        if self.is_empty_dir:
            # first time creation of delta table
            write_deltalake(
                self._filepath,
                data,
                storage_options=self.fs_args,
                **self._save_args,
            )
            self.is_empty_dir = False
            self._delta_table = DeltaTable(
                table_uri=self._filepath,
                storage_options=self.fs_args,
                version=self._version,
            )
        elif self.upsert_mode:
            upsert(
                new_data=data,
                target_table=self._delta_table,
                primary_key=self.primary_key
            )
        else:            
            write_deltalake(
                self._delta_table,
                data,
                storage_options=self.fs_args,
                **self._save_args,
            )


    def _describe(self) -> dict[str, Any]:
        desc = super()._describe()
        desc['primary_key'] = self.primary_key

        if self.upsert_mode:
            desc['save_args']['mode'] = 'upsert'

        return desc

astrojuanlu · 2024-03-24T08:42:12Z

To add on top of that, just realized that months ago I had created my own Polars DeltaDataset: https://github.com/astrojuanlu/talk-kedro-huggingface/blob/b998371/src/social_summarizer/datasets/polars_delta.py

Mostly uninteresting, except for

        # HACK: If the table is empty, return an empty DataFrame
        try:
            return pl.read_delta(
                load_path, storage_options=self._storage_options, **self._load_args
            )
        except TableNotFoundError:
            return pl.DataFrame()

(related: kedro-org/kedro#3578)

astrojuanlu added this to Kedro Framework Feb 7, 2024

astrojuanlu mentioned this issue Feb 7, 2024

Update Databricks docs kedro-org/kedro#3360

Open

merelcht added this to the Individual dataset improvements milestone Mar 11, 2024

astrojuanlu mentioned this issue Jun 6, 2024

Conduct market research on versioning kedro-org/kedro#3933

Closed

astrojuanlu mentioned this issue Jun 18, 2024

Conduct market research on c̶a̶t̶a̶l̶o̶g̶s̶ metastores kedro-org/kedro-devrel#141

Open

deepyaman mentioned this issue Sep 14, 2024

Support for Upsert functionality in ibis.TableDataset #835

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[spike] Clarify status of various Delta Table datasets #542

[spike] Clarify status of various Delta Table datasets #542

astrojuanlu commented Feb 7, 2024 •

edited by merelcht

Loading

julio-cmdr commented Mar 6, 2024 •

edited by astrojuanlu

Loading

astrojuanlu commented Mar 24, 2024

[spike] Clarify status of various Delta Table datasets #542

[spike] Clarify status of various Delta Table datasets #542

Comments

astrojuanlu commented Feb 7, 2024 • edited by merelcht Loading

Description

Context

julio-cmdr commented Mar 6, 2024 • edited by astrojuanlu Loading

astrojuanlu commented Mar 24, 2024

astrojuanlu commented Feb 7, 2024 •

edited by merelcht

Loading

julio-cmdr commented Mar 6, 2024 •

edited by astrojuanlu

Loading