You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Implement KedroDataCatalog.to_config()method as a part of catalog serialization/deserialization feature #3932
Context
Requirements:
Catalog already has from_config, so KedroDataCatalog.to_config() have to output configuration further used with the existing KedroDataCatalog.from_config() method to load it. method
We want to solve this problem at the framework level and avoid existing datasets' modifications where possible.
Implementation
Solution description
We consider 3 different ways of loading datasets:
Lazy datasets loaded from the config — in this case, we store the dataset configuration at the catalog level; the dataset object is not instantiated.
Materialized datasets loaded from the config — we store the dataset configuration at the catalog level and use dataset.from_config() method to instantiate dataset which calls the underlying dataset constructor.
Materialized datasets added to the catalog — instantiated datasets' objects are passed to the catalog, dataset configuration is not stored at the catalog level.
1 - can be solved at the catalog level 2 and 3 require retrieving dataset configuration from instantiated dataset object
Solution for 2 and 3 avoiding existing datasets' modifications (as per requirements)
Save call args at the level of AbstractDataset in the _init_args field.
Implement AbstractDataset.to_config() to retrieve configuration from the instantiated dataset object based on the object's _init_args.
Implement KedroDataCatalog.to_config
Once 2 and 3 are solved, we can implement a common solution at the catalog level. For that, we need to consider cases when we work with lazy and materialized datasets and retrieve configuration from the catalog or using AbstractDataset.to_config().
After the configuration is retrieved, we need to "unresolve" the credentials and keep them in a separate dictionary, as we did when instantiating the catalog. For that CatalogConfigResolver.unresolve_config_credentials() method can be implemented to undo the result of CatalogConfigResolver._resolve_config_credentials().
Excluding parameters
We need to exclude parameters as they're treated as MemoryDatasets at the catalog level.
Not covered cases
Non-serializable objects or objects required additional logic implemented at the level of dataset to save/load them:
Solution will require extending parent AbstractDataset.to_config() at the dataset level to serialize those objects. Can be addressed one by one in a separate PRs.
SharedMemoryDataset - not expected to be saved and loaded.
Modifications of datasets in the catalog except for replacement - we briefly discussed it with @idanov and agreed we do not plan to consider this case for now as we still insist on avoiding modifying datasets' properties in the catalog but rather replacing them.
Non-serializable objects or objects required additional logic implemented at the level of dataset to save/load them:
Wouldn't it be possible to force datasets to only have static, primitive properties in the __init__ method so that serialising them is trivial?
That would be an ideal option, as a common solution would work out of the box without corner cases. However, it would require more significant changes on the datasets' side.
As a temporal solution without breaking change, we can try extending parent AbstractDataset.to_config() at the dataset level for those datasets and serializing such objects. However, I cannot guarantee that we'll be able to cover all the cases.
Description
Implement
KedroDataCatalog.to_config()
method as a part of catalog serialization/deserialization feature #3932Context
Requirements:
from_config
, soKedroDataCatalog.to_config()
have to output configuration further used with the existingKedroDataCatalog.from_config()
method to load it. methodkedro/kedro/io/kedro_data_catalog.py
Line 268 in 9464dc7
Implementation
Solution description
We consider 3 different ways of loading datasets:
dataset.from_config()
method to instantiate dataset which calls the underlying dataset constructor.1 - can be solved at the catalog level
2 and 3 require retrieving dataset configuration from instantiated dataset object
Solution for 2 and 3 avoiding existing datasets' modifications (as per requirements)
AbstractDataset.__init_subclass__
which allows to change the behavior of subclasses from inside theAbstractDataset
: https://docs.python.org/3/reference/datamodel.html#customizing-class-creationAbstractDataset
in the_init_args
field.AbstractDataset.to_config()
to retrieve configuration from the instantiated dataset object based on the object's_init_args
.Implement
KedroDataCatalog.to_config
Once 2 and 3 are solved, we can implement a common solution at the catalog level. For that, we need to consider cases when we work with lazy and materialized datasets and retrieve configuration from the catalog or using
AbstractDataset.to_config()
.After the configuration is retrieved, we need to "unresolve" the credentials and keep them in a separate dictionary, as we did when instantiating the catalog. For that
CatalogConfigResolver.unresolve_config_credentials()
method can be implemented to undo the result ofCatalogConfigResolver._resolve_config_credentials()
.Excluding parameters
We need to exclude parameters as they're treated as
MemoryDataset
s at the catalog level.Not covered cases
Connection
(from google.oauth2.credentials import Credentials
) - https://docs.kedro.org/projects/kedro-datasets/en/kedro-datasets-5.1.0/_modules/kedro_datasets/pandas/gbq_dataset.html#GBQQueryDatasettype[AbstractDataset]
- https://docs.kedro.org/projects/kedro-datasets/en/kedro-datasets-5.1.0/_modules/kedro_datasets/partitions/incremental_dataset.html#IncrementalDataset
AbstractDataset.to_config()
at the dataset level to serialize those objects. Can be addressed one by one in a separate PRs.LambdaDataset
- not the case anymore since Can we removeLambdaDataset
? #4292SharedMemoryDataset
- not expected to be saved and loaded.Issues blocking further implementation
versioned
flag and dataset parameter #4326 - currently solved the problem by adding logic to updateVERSIONED_FLAG_KEY
ifversion
is provided.save_version
should we save and load back: Discrepancy between settingsave_version
via catalog constructor and when passing datasets #4327 - needs a dicussion.Tested with
CachedDataset
,PartitionedDataset
,IncrementalDataset
,MemoryDataset
and various other kedro datasetsHow to test
example.txt
The text was updated successfully, but these errors were encountered: