[DataCatalog]: Spike - Catalog serialization and deserialization support #3932

ElenaKhaustova · 2024-06-05T23:43:57Z

Description

Users admit the lack of persistency in the add workflow, as there is no built-in functionality to save modified catalogs.
Users express the need for an API to save and load catalogs after compilation or modification by converting catalogs to YAML format and back.
Users encounter difficulties loading pickled DataCatalog objects when the Kedro version changes when loading, leading to compatibility issues. They require a solution to serialize and deserialize the DataCatalog object without dependency on Kedro versions.

We propose to explore the feasibility of implementing to_yaml() and from_yaml() methods for the DataCatalog object to facilitate serialization and deserialization without dependency on Kedro versions.

Context

User feedback:

Add workflow is missing persistency, so you can not save modified catalog: "You have a catalog and then you start adding extra stuff to it, currently we just throw away those added things when they close a notebook."
Catalog to YAML function is needed to save modified catalog: "People have always asked for it. Could I have a catalog to YAML function so that you could actually spit out the YAML files that are needed to do this again later on?"
Competitors provide the functionality to compile catalog and showcase the result: "I would point to the DPC compile workflow. And actually, if you do DBT run it does DBT compile first and then runs the compiled outputs. Whereas in Kedro, you have your very concise complicated YAML and it will all that compilation happens at run time and there's no way for the user to see it."
When pickling DataCatalog object they experience difficulties in loading it back if the kedro version is different: "Serialization is an issue because I often pickle a catalog (mostly as part of a mlflow model). Pickling the catalog is really something that leads to a lot of problems because if I don't have the exact same Kedro version when I want to load the catalog, if the object has any change inside - private method or attribute it will lead to error."

https://github.com/Galileo-Galilei/kedro-mlflow/blob/64b8e94e1dafa02d979e7753dab9b9dfd4d7341c/kedro_mlflow/mlflow/kedro_pipeline_model.py#L143

# pseudo code
pickle.dumps(catalog)
pickle.loads(catalog) # this will fail if I reload with a newer kedro version and any attributes (even private) has changed. This breaks much more often that we should expect.

"It would be much more robust to be able to do this":

# pseudo code
catalog.serialize("path/catalog.yml") # name TBD: serialize? to_config? to_yaml? to_json? to_dict? 
catalog.deserialize(catalog) # much more robust since it is not stored as python object -> maybe catalog.from_config?

Extra context: #3995 (comment)

The text was updated successfully, but these errors were encountered:

astrojuanlu · 2024-06-06T07:32:29Z

Very similar to DataCatalog.from_file proposal discussed in #2967

datajoely · 2024-10-17T15:49:49Z

I like to_yaml() and from_yaml() personally.

It would be nice if we preserved comments and the way the user organised their files before. I appreciate this increases complexity - but it does match the mental model of how the user things about their project.
I'm currently working with Pydantic a lot at the moment, I wonder if it makes sense to use or at least take some inspiration.

ElenaKhaustova · 2024-11-13T15:49:34Z

From the user feedback, we can define three main pain points to address:

Compiling catalog into some format allowing easy its assessment, for example, to make sure all factories are resolved as expected
Saving/loading catalog configuration only without pickling
Saving/loading modified catalog, including configuration and data

The first two pain points can be addressed by:

Implementing catalog.to_config() method (since we already have catalog.from_config()) - [DataCatalog]: Catalog to config #4329
Implementing a method to save and load catalog obtained from catalog.to_config() - [DataCatalog]: Save/load catalog obtained from to_config #4330

The third one requires 1 and 2 solved and solving data saving part.

The plan for now is to address 1 and 2 first.

ElenaKhaustova added the Issue: Feature Request New feature or improvement to existing feature label Jun 5, 2024

ElenaKhaustova added this to the Redesign the API for IO (catalog) milestone Jun 5, 2024

ElenaKhaustova added this to Kedro Framework Jun 5, 2024

iamelijahko mentioned this issue Jun 6, 2024

Research summary of insights for redesigning Kedro's data catalog API #3934

Open

This was referenced Jun 6, 2024

[DataCatalog]: Pretty printing #3913

Closed

[DataCatalog]: Make catalog a standalone package #3941

Open

github-actions bot mentioned this issue Jul 1, 2024

Monthly issue metrics report #3975

Open

astrojuanlu mentioned this issue Aug 2, 2024

Design DataCatalog2.0 #3995

Open

3 tasks

merelcht changed the title ~~[DataCatalog]: Catalog serialization and deserialization support~~ [DataCatalog]: Spike - Catalog serialization and deserialization support Oct 21, 2024

Galileo-Galilei mentioned this issue Nov 2, 2024

[DataCatalog]: Lazy dataset loading #4270

Merged

7 tasks

merelcht moved this to To Do in Kedro Framework Nov 4, 2024

astrojuanlu assigned ElenaKhaustova Nov 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DataCatalog]: Spike - Catalog serialization and deserialization support #3932

[DataCatalog]: Spike - Catalog serialization and deserialization support #3932

ElenaKhaustova commented Jun 5, 2024 •

edited by merelcht

Loading

astrojuanlu commented Jun 6, 2024

datajoely commented Oct 17, 2024

ElenaKhaustova commented Nov 13, 2024

[DataCatalog]: Spike - Catalog serialization and deserialization support #3932

[DataCatalog]: Spike - Catalog serialization and deserialization support #3932

Comments

ElenaKhaustova commented Jun 5, 2024 • edited by merelcht Loading

Description

Context

astrojuanlu commented Jun 6, 2024

datajoely commented Oct 17, 2024

ElenaKhaustova commented Nov 13, 2024

ElenaKhaustova commented Jun 5, 2024 •

edited by merelcht

Loading