You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Users encounter challenges related to accessing and managing dataset filepaths. The absence of a mandatory filepath attribute in AbstractDataset and the lack of a standard API for accessing metadata hinder users' ability to reliably access dataset filepaths and understand which dataset version was loaded. Additionally, inconsistencies between APIs across different dataset types further complicate the process, requiring users to implement custom logic to handle dataset access and metadata retrieval.
We propose:
Explore the feasibility of implementing file-backed AbstractDataset and making the filepath attribute mandatory to provide users with a consistent and reliable way to access dataset filepaths.
Develop a standard API for accessing metadata across different dataset types, and decide what the standard metadata should include for each dataset type.
Inconsistency of APIs between AbstractVersionedDataset and AbstractDataset - one has filepath attribute: "It's kind of weird that when I switch from AbstractDataset to the AbstractVersionedDataset, suddenly the file path appears at that point. Like that feels quite weird to me that doesn't feel right."
Users have to take into account the dataset type to be able to get the filepath:
Hard to get filepath and understand which dataset version was loaded: "It's crazy confusing to actually get the the correct file path"
Users want some standardised API to access different types of datasets, for example file-backed. They want to rely on API when using DataCatalog / Datasets which is not mandatory to follow now: "We have MlflowArtifactDataset which is a wrapper for any AbstractDataset which logs the dataset automatically in mlflow as an artifact when its save method is called. The lack of a formal AbstractDataset API for file paths leads to inconsistencies, relying heavily on the convention that file paths are included as a hidden property _file_path in the dataset’s implementation. Formalizing this attribute as a public property would enhance reliability and convenience across Kedro’s framework. Otherwise, the potential difficulties in maintaining this might arise with community-maintained or experimental datasets, as it would be super hard to enforce that"
Users find it challenging to access critical metadata such as file paths directly through the public API, which often requires delving into less transparent, potentially private API elements. This adds complexity to what could otherwise be straightforward data management tasks.
The text was updated successfully, but these errors were encountered:
Description
Users encounter challenges related to accessing and managing dataset filepaths. The absence of a mandatory filepath attribute in
AbstractDataset
and the lack of a standard API for accessing metadata hinder users' ability to reliably access dataset filepaths and understand which dataset version was loaded. Additionally, inconsistencies between APIs across different dataset types further complicate the process, requiring users to implement custom logic to handle dataset access and metadata retrieval.We propose:
AbstractDataset
and making the filepath attribute mandatory to provide users with a consistent and reliable way to access dataset filepaths.Relates to #1936
Context
Inconsistency of APIs between
AbstractVersionedDataset
andAbstractDataset
- one has filepath attribute: "It's kind of weird that when I switch fromAbstractDataset
to theAbstractVersionedDataset
, suddenly the file path appears at that point. Like that feels quite weird to me that doesn't feel right."Users have to take into account the dataset type to be able to get the filepath:
https://github.com/Galileo-Galilei/kedro-mlflow/blob/64b8e94e1dafa02d979e7753dab9b9dfd4d7341c/kedro_mlflow/io/artifacts/mlflow_artifact_dataset.py#L48
Hard to get filepath and understand which dataset version was loaded: "It's crazy confusing to actually get the the correct file path"
Users want some standardised API to access different types of datasets, for example file-backed. They want to rely on API when using
DataCatalog
/Datasets
which is not mandatory to follow now: "We haveMlflowArtifactDataset
which is a wrapper for anyAbstractDataset
which logs the dataset automatically in mlflow as an artifact when its save method is called. The lack of a formalAbstractDataset
API for file paths leads to inconsistencies, relying heavily on the convention that file paths are included as a hidden property_file_path
in the dataset’s implementation. Formalizing this attribute as a public property would enhance reliability and convenience across Kedro’s framework. Otherwise, the potential difficulties in maintaining this might arise with community-maintained or experimental datasets, as it would be super hard to enforce that"https://kedro-mlflow.readthedocs.io/en/stable/source/07_python_objects/01_DataSets.html
The text was updated successfully, but these errors were encountered: