Access dataset filepath via public API for file-backed datasets #3929

ElenaKhaustova · 2024-06-05T15:36:54Z

Description

Users encounter challenges related to accessing and managing dataset filepaths. The absence of a mandatory filepath attribute in AbstractDataset and the lack of a standard API for accessing metadata hinder users' ability to reliably access dataset filepaths and understand which dataset version was loaded. Additionally, inconsistencies between APIs across different dataset types further complicate the process, requiring users to implement custom logic to handle dataset access and metadata retrieval.

We propose:

Explore the feasibility of implementing file-backed AbstractDataset and making the filepath attribute mandatory to provide users with a consistent and reliable way to access dataset filepaths.
Develop a standard API for accessing metadata across different dataset types, and decide what the standard metadata should include for each dataset type.

Relates to #1936

Context

Inconsistency of APIs between AbstractVersionedDataset and AbstractDataset - one has filepath attribute: "It's kind of weird that when I switch from AbstractDataset to the AbstractVersionedDataset, suddenly the file path appears at that point. Like that feels quite weird to me that doesn't feel right."
Users have to take into account the dataset type to be able to get the filepath:

https://github.com/Galileo-Galilei/kedro-mlflow/blob/64b8e94e1dafa02d979e7753dab9b9dfd4d7341c/kedro_mlflow/io/artifacts/mlflow_artifact_dataset.py#L48

Hard to get filepath and understand which dataset version was loaded: "It's crazy confusing to actually get the the correct file path"
Users want some standardised API to access different types of datasets, for example file-backed. They want to rely on API when using DataCatalog / Datasets which is not mandatory to follow now: "We have MlflowArtifactDataset which is a wrapper for any AbstractDataset which logs the dataset automatically in mlflow as an artifact when its save method is called. The lack of a formal AbstractDataset API for file paths leads to inconsistencies, relying heavily on the convention that file paths are included as a hidden property _file_path in the dataset’s implementation. Formalizing this attribute as a public property would enhance reliability and convenience across Kedro’s framework. Otherwise, the potential difficulties in maintaining this might arise with community-maintained or experimental datasets, as it would be super hard to enforce that"

https://kedro-mlflow.readthedocs.io/en/stable/source/07_python_objects/01_DataSets.html

Users find it challenging to access critical metadata such as file paths directly through the public API, which often requires delving into less transparent, potentially private API elements. This adds complexity to what could otherwise be straightforward data management tasks.

The text was updated successfully, but these errors were encountered:

ElenaKhaustova added the Issue: Feature Request New feature or improvement to existing feature label Jun 5, 2024

ElenaKhaustova mentioned this issue Jun 6, 2024

Research summary of insights for redesigning Kedro's data catalog API #3934

Open

merelcht mentioned this issue Jun 10, 2024

[DataCatalog]: Interface to request specific properties from catalog and datasets #3939

Open

github-actions bot mentioned this issue Jul 1, 2024

Monthly issue metrics report #3975

Open

datajoely mentioned this issue Jul 5, 2024

Typically expect catalog entries to have unique filepaths, protecting against overwrite #3993

Open

ElenaKhaustova mentioned this issue Aug 13, 2024

Design DataCatalog2.0 #3995

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Access dataset filepath via public API for file-backed datasets #3929

Access dataset filepath via public API for file-backed datasets #3929

ElenaKhaustova commented Jun 5, 2024 •

edited

Loading

Access dataset filepath via public API for file-backed datasets #3929

Access dataset filepath via public API for file-backed datasets #3929

Comments

ElenaKhaustova commented Jun 5, 2024 • edited Loading

Description

Context

ElenaKhaustova commented Jun 5, 2024 •

edited

Loading