Allow new attributes to be added to DataSets #1076

yetudada · 2021-11-30T10:10:29Z

Problem

The Data Catalog does not allow users to add custom attributes or metadata to datasets. This means that users have to:

Create their own custom datasets (link)
Use comments to help them keep track of associated metadata.

This request has presented itself in different ways, most commonly in a request to add dataset parameters and not use parameters.yml for parameters that are directly related to a dataset (link). It has also shown itself in users wanting to add optional tags for datasets and have a way to interact with their user-defined tags (link).

User evidence

Create their own custom datasets
"I have certain attributes to track within my datasets and have created custom DataSets to get around this issue. Now that hooks are out most of my reasons for custom DataSets are gone, and I can achieve the same thing with an after_node_run hook, but I still cannot attach custom attributes to datasets." - Waylon Walker (link)

Use comments in their to help them keep track of associated metadata
"Am I the only one that leaves random notes in the catalog next to at least the raw datasets?
Was thinking if there would be any usage of having a meta or info field." - Andrej Marsic

Connect Parameters and the Data Catalog
"There could easily be other cases where a node might need to know what parameter(s) a dataset was loaded with. You can always just duplicate the list in both yaml files, but it's more ideal to have the parameter specified in only one place, especially it's some parameter you can play around with." - D. Sturgeon (link)

Possible implementation

My suggestion is for the following:

Users can add whatever metadata fields they want
Users can interact with the fields in code; helping the cases for (link) and (link)

It could look something like this:

dataset_1:
  type: ...
  filepath: ...
  attribute_1: ...
  attribute_2: ...

or even:

dataset_1:
  type: ...
  filepath: ...
  parameters: # or whatever this name is
    attribute_1: ...
    attribute_2: ...

Questions

Would layers be one of these metadata fields?
How can users abuse this?

The text was updated successfully, but these errors were encountered:

datajoely · 2021-11-30T10:33:35Z

I'll also highlight the great work dbt do with adding documentation and tests to their source and models:
https://docs.getdbt.com/docs/building-a-dbt-project/using-sources#testing-and-documenting-sources

…edro-org#1076)

WaylonWalker · 2022-04-03T15:38:55Z

I completely missed this issue. I'm excited to see this one happen. We had another use case this week. We are using some versioned datasets and would like to be able to create a plugin to clean up some of the old versions, inspired by sanoid. It would be nice if we could pop some extra config into the catalog to specify a policy of how many to keep, then we can create a separate plugin that can clean these up in a scheduled run. And keep all of our configurations for each dataset together.

my_dataset:
  versioned: true
  extras: # whatever place you want to give us to configure plugins/extras
    cleanup_old_versions_plugin:
      frequent: 4
      daily: 7
      year: 1
      month: 1

deepyaman · 2022-08-24T13:43:59Z

Several other libraries allow users to add a metadata key to datasets, including Kubeflow Pipelines. I thought Azure ML Pipelines also allowed this; will update if I find it.

antonymilne · 2022-08-24T14:03:35Z

DVC allows this through a meta key: https://dvc.org/doc/user-guide/project-structure/dvc-files#specification.

datajoely · 2022-08-24T14:07:14Z

So my next question is - then what? The user as provided some arbitrary metadata:

Is it only available via hooks or the code API at runtime?
Do we expose it in Viz? If so, how?
Can users access this information within Nodes? My gut feeling is that they will ask for it immediately.

merelcht · 2022-08-25T10:03:59Z

This issue was discussed in a Technical Design session:

The team agrees that we should allow new attributes to be added to DataSets
The best way to do this would be through an attribute key in the catalog entry to be dealt with by the plugin owner. We don't want loads of random options in the catalog that we need to deal with on the Kedro side. Data validation would also be the responsibility of the plugin that deals with the data.

Open questions that need to be answered:

Should we go for a top-level key or rather free-form keys e.g:

my_dataset:
     type: ...
     filepath: ...
     metadata: <- this would be the top-level key
             kedro-viz: 
                  layer: ...
             kedro-mlflow: ....
             kedro-great-expectations: ....

or

my_dataset:
     type: ...
     filepath: ...
     kedro-viz: 
          layer: ...
     kedro-mlflow: ....
     kedro-great-expectations: ....

A top-level key would make it clearer that any entries under it aren't core Kedro, but custom attributes that will be handled by a plugin. However, the top-level key does add extra nesting, and especially for the layer attribute this could be annoying for users.

If we go for a top-level key, what should this be called? metadata, custom_data, .... ?

Next step:

Design an implementation proposal for allowing new attributes to be added to DataSets. Make sure to address the questions above. Discuss this proposal with plugin developers e.g. @WaylonWalker and @Galileo-Galilei

foxale · 2022-09-30T15:09:09Z

This is exactly what we were looking for today. We were thinking about parametrizing pandas-profiling package, in a way that allows you to decide which dataset to profile and how.

The top-level key looks like a great solution. I would argue, that the layer parameter now makes more sense, especially since kedro-viz is in fact a separate plugin.

GabrieleCacchioni · 2022-10-06T16:56:50Z

This is something my team and me would love. Use case: display in kedro viz, possibly by displaying datasets with a certain tag close to each other (hard to do if you choose to allow for nested tags, I believe)

Galileo-Galilei · 2023-01-28T13:19:59Z

Hi, I'd love to see this feature implemented. My main use-case would be to add a schema (backed by pandera, pydantic, great expectation...) in my catalog. This would enable:

documenting data (anyone reading the catalog can see the schema)
validating data at runtime
running pipeline with fake (generated) data to perfom pipeline consistency tests on data.

I can create an extra configuration file where the user declares all these attributes, but this would be a much better user experience to be able to declare this schema in the catalog.

Regarding the open questions:

I'd go for a top level key rather than storing everything low level. This wil make explicit to users that the informations stored here are not provided by kedro (it will save you a lot of time due to questions not related to kedro, or possible "inconsistencies" between your documentation and their catalog due to third party plugins).
For this key, I prefer the name metadata which is more explicit than other propositions and kind of a standard but honestly I'll be fine with other propositions.

@merelcht Do you want me to draft a PR for this ? I don't want to waste my time and yours if you do not have time to review or if you think that technical design is not mature enough, but if you think we have a chance to include it to the core library in say, a couple of months I'd be glad to provide support for this.

merelcht · 2023-01-30T12:38:03Z

Hi @Galileo-Galilei, thanks for responding to this issue. I'd be very happy for you to start drafting an implementation. We agreed with the team that this is functionality we want to support. We simply haven't been able to prioritise this yet. So if we can speed this up by getting your help, I'm all for it! 👍 😄

merelcht · 2023-03-17T10:47:18Z

Hi @Galileo-Galilei , we'll be prioritising this task in our next sprint, so I just wanted to check if you'd made any progress or have any findings we should probably know about? In #2409 you mentioned you might need to change the AbstractDataSet, why is that?

Galileo-Galilei · 2023-03-27T19:25:47Z

Hi @merelcht, as usual I overestimated the time I can allocate for this PR, sorry 😅

I played a bit with the syntax, and I ultimately wanted to add an extra metadata argument to the AbstractDataSet.__init__ method, but I was afraid that it forces me to modify all the datasets.

merelcht · 2023-03-28T13:14:36Z

Hi @merelcht, as usual I overestimated the time I can allocate for this PR, sorry 😅

I played a bit with the syntax, and I ultimately wanted to add an extra metadata argument to the AbstractDataSet.__init__ method, but I was afraid that it forces me to modify all the datasets.

No worries @Galileo-Galilei and thanks for getting back to my question! I personally think it would make sense to have the attribute on the individual datasets, but not on the AbstractDataSet. What do you think about that?
It's still likely that we'll update all datasets, but at least it's not forced by the change of the AbstractDataSet.

inigohidalgo · 2023-03-30T13:09:44Z

Hiya, just jumping in to add some feedback that arose from a conversation with @datajoely:

At the moment I'm looking into adding validation to some datasets, potentially using pandera. Pandera supports yml syntax (like so) for defining these validations, so if I've understood the purpose of this feature correctly, it could be a good match for this. One potential source of friction I see is that things like validations could change a lot more often than the dataset definition itself, and are also very verbose, so I was wondering whether it would make sense to somehow provide these additional attributes from a different file (e.g. validations.yml in this case) which could then be merged with catalog.yml at catalog creation.

My perspective was something like this, relying on OmegaConf's subkey merging:

catalog.yml:

sales_intermediate:
  filename: ...
  layer: ...
  save_kwargs: ...

validations.yml:

sales_intermediate:
  expected_schema:
    col1: ...
    col2: ...

And Joel's suggestion was something like this:

sales_intermediate:
  filename: ...
  layer: ...
  save_kwargs: ...
  validations : ${../validations.yml}

antonymilne · 2023-03-30T16:57:33Z

Hi @inigohidalgo, I agree it would be great to do this and indeed OmegaConf should enable us to inject yaml like this. Since we merge together all files matching the catalog pattern (by default "catalog*", "catalog*/**", "**/catalog*") it shouldn't make any different which of those files you define your validation rules in, so long as it matches that pattern. So you'd have something like:

# catalog.yml
sales_intermediate:
  filename: ...
  metadata:
    validations: ${sales_intermediate_validation}  # no need to say which file this is defined in

# catalog_something.yml
sales_intermediate_validation:
 schema_type: dataframe
 version: {PANDERA_VERSION}
 columns:
   column1:
     title: null
     description: null
     dtype: int64
     nullable: false

The only catch I can think of here is that we might need something to prevent kedro from trying to process sales_intermediate_validation as a dataset definition (which would fail), e.g. you might need to prefix its name with _ so it gets ignored (we already do this for yaml anchors - see paragraph just above this).

inigohidalgo · 2023-03-30T17:13:07Z

In a way this is the same concept as yaml anchors which we already use in our usecases for dataset definitions, we could do this without omegaconf even. I was wondering if you'd be able to do this merge how I outlined it in my example, leveraging OmegaConf's nested sub-config merge. Laying it out like this should get around the issue of kedro processing it as a dataset (I'm making an assumption about the order of operations in Kedro: that it does all this config merging and interpolating first, and then with the final catalog dictionary, you instantiate the Catalog):

sales_intermediate:
  # from catalog.yml
  filename: ...
  layer: ...
  save_kwargs: ...
  # from validations.yml
  expected_schema: 
    col1: ...
    col2: ...

Note: I haven't read the entire issue in-depth, I notice you used a sub-key metadata: to indicate this additional metadata, which would mean the required validations.yml to match the structure you wrote would be:

sales_intermediate:
  metadata:
    pandera_schema:
      version: ...
      columns: ...

But the point is the same :)

antonymilne · 2023-03-30T21:47:45Z

Oh I see - sorry I misunderstood your previous example. You're right that what I wrote above is just a cross-file extension of what yaml anchors can do.

I didn't think of doing it the way you suggested here but it makes perfect sense. Aside from the fact that metadata doesn't exist yet, this almost works perfectly already and should be easy for us to enable. Here's a toy example where you split the definition of a catalog entry into two files:

# catalog_types.yml
example_iris_data:
  type: pandas.CSVDataSet

# catalog_filepaths.yml
example_iris_data:
  filepath: data/01_raw/iris.csv

Right now the config loader will throw an error about duplicate keys but if we remove omegaconfigloader's call to self._check_duplicates(seen_file_to_keys) (which probably is just inherited from the original configloader and shouldn't be there for exactly this reason) then it will merge these two entries together to give the full one as you'd expect:

example_iris_data:
  type: pandas.CSVDataSet
  filepath: data/01_raw/iris.csv

So yes, in future, I think this should work exactly like you say 🙂 The only thing you'd need to be careful of is to make sure that your validations.yml file gets picked up by the config loader by naming it something that matches the pattern (by default "catalog*", "catalog*/", "/catalog*" but should be customisable).

yetudada · 2023-05-26T09:44:58Z

@WaylonWalker @inigohidalgo @foxale @GabrieleCacchioni We're excited to announce that this feature has been added and will be released soon. #2537 introduced the changes, this metadata attribute is ignored by Kedro, but may be consumed by users or external plugins.

It works in the following way:

shuttles:
  type: pandas.ExcelDataSet
  filepath: data/01_raw/shuttles.xlsx
  metadata:
    plugin_1: 
      attribute: ...
    plugin_2:
      attribute: ...

I'll close this issue but please shout when you use it ❤️ Thank you so much for giving us feedback on this.

yetudada added the Issue: Feature Request New feature or improvement to existing feature label Nov 30, 2021

antonymilne mentioned this issue Dec 2, 2021

Visualize size of processed datasets kedro-org/kedro-viz#662

Closed

drtorchwood mentioned this issue Jan 18, 2022

Declare datasets as 'only for debugging' and don't save them in standard kedro run #1160

Closed

This was referenced Feb 11, 2022

Using kedro for a data dictionary #1234

Closed

Put kedro catalog on-line #1239

Closed

Galileo-Galilei pushed a commit to Galileo-Galilei/kedro that referenced this issue Feb 19, 2022

[FEATURE][KED-2577] Adds DeprecationWarning to Journal constructor (k…

d5f3327

…edro-org#1076)

merelcht added this to the 0.18+ milestone Mar 7, 2022

merelcht added this to Kedro Framework Mar 7, 2022

merelcht added Stage: Technical Design 🎨 Ticket needs to undergo technical design before implementation Stage: User Research 🔬 Ticket needs to undergo user research before implementation labels Mar 21, 2022

merelcht mentioned this issue Mar 24, 2022

Allow new attributes to be added to DataSets #400

Closed

antonymilne mentioned this issue Jun 14, 2022

Kedro-Viz to show preview of data kedro-org/kedro-viz#907

Closed

merelcht removed this from the 0.18+ milestone Jul 13, 2022

antonymilne mentioned this issue Aug 24, 2022

Provide simple mechanism for adding icons to datasets kedro-org/kedro-viz#480

Closed

1 task

noklam mentioned this issue Aug 24, 2022

Clean versioned dataset automatically/periodically #1658

Closed

merelcht added this to the Redesign Catalog and Datasets milestone Feb 6, 2023

Galileo-Galilei mentioned this issue Mar 13, 2023

Move AbstractDataSet to Kedro-Plugins #2409

Closed

merelcht moved this to To Do in Kedro Framework Mar 16, 2023

merelcht mentioned this issue Mar 20, 2023

Decide on syntax to allow adding new attributes #2439

Closed

2 tasks

merelcht added the Type: Parent Issue label Mar 20, 2023

merelcht removed the status in Kedro Framework Mar 20, 2023

merelcht mentioned this issue Mar 20, 2023

Enable adding new attributes to datasets #2440

Closed

antonymilne mentioned this issue May 22, 2023

feat: Add metadata attribute to datasets kedro-org/kedro-plugins#189

Merged

4 tasks

yetudada closed this as completed May 26, 2023

github-project-automation bot moved this to Done in Kedro Framework May 26, 2023

yetudada mentioned this issue May 26, 2023

Optional tags for DataSets and accessing through the catalog #324

Closed

NeroOkwa mentioned this issue Jun 21, 2023

Create a blog post on the use of metadata (compared to layers) in the data catalog and with Kedro-Viz. kedro-org/kedro-devrel#93

Closed

astrojuanlu removed this from Kedro Framework Jan 24, 2024

merelcht removed this from the Redesign the API for IO (catalog) milestone Feb 27, 2024

ankatiyar mentioned this issue Jun 12, 2024

fix(datasets): add metadata parameter to datasets kedro-org/kedro-plugins#708

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow new attributes to be added to DataSets #1076

Allow new attributes to be added to DataSets #1076

yetudada commented Nov 30, 2021

datajoely commented Nov 30, 2021

WaylonWalker commented Apr 3, 2022

deepyaman commented Aug 24, 2022

antonymilne commented Aug 24, 2022

datajoely commented Aug 24, 2022 •

edited

Loading

merelcht commented Aug 25, 2022 •

edited

Loading

foxale commented Sep 30, 2022 •

edited

Loading

GabrieleCacchioni commented Oct 6, 2022

Galileo-Galilei commented Jan 28, 2023

merelcht commented Jan 30, 2023 •

edited

Loading

merelcht commented Mar 17, 2023

Galileo-Galilei commented Mar 27, 2023

merelcht commented Mar 28, 2023

inigohidalgo commented Mar 30, 2023

antonymilne commented Mar 30, 2023

inigohidalgo commented Mar 30, 2023 •

edited

Loading

antonymilne commented Mar 30, 2023 •

edited

Loading

yetudada commented May 26, 2023

Allow new attributes to be added to DataSets #1076

Allow new attributes to be added to DataSets #1076

Comments

yetudada commented Nov 30, 2021

Problem

User evidence

Possible implementation

Questions

datajoely commented Nov 30, 2021

WaylonWalker commented Apr 3, 2022

deepyaman commented Aug 24, 2022

antonymilne commented Aug 24, 2022

datajoely commented Aug 24, 2022 • edited Loading

merelcht commented Aug 25, 2022 • edited Loading

Next step:

foxale commented Sep 30, 2022 • edited Loading

GabrieleCacchioni commented Oct 6, 2022

Galileo-Galilei commented Jan 28, 2023

merelcht commented Jan 30, 2023 • edited Loading

merelcht commented Mar 17, 2023

Galileo-Galilei commented Mar 27, 2023

merelcht commented Mar 28, 2023

inigohidalgo commented Mar 30, 2023

antonymilne commented Mar 30, 2023

inigohidalgo commented Mar 30, 2023 • edited Loading

antonymilne commented Mar 30, 2023 • edited Loading

yetudada commented May 26, 2023

datajoely commented Aug 24, 2022 •

edited

Loading

merelcht commented Aug 25, 2022 •

edited

Loading

foxale commented Sep 30, 2022 •

edited

Loading

merelcht commented Jan 30, 2023 •

edited

Loading

inigohidalgo commented Mar 30, 2023 •

edited

Loading

antonymilne commented Mar 30, 2023 •

edited

Loading