Improve `catalog.list` or alternative for dataset factory? #3312

noklam · 2023-11-15T11:48:07Z

Description

Background: https://linen-slack.kedro.org/t/16064885/when-i-say-catalog-list-in-a-kedro-jupter-lab-instance-it-do#ad3bb4aa-f6f9-44c6-bb84-b25163bfe85c

With dataset factory, the "defintion" of a dataset is not known until the pipeline is run. When user is using a Jupyter notebook, they expected to see the full list of dataset with catalog.list.

Current workaround to see the datasets for __default__ pipeline look like this:

for dataset in pipeline["__default__"].data_sets():
  catalog.exists(dataset)

Context

When using the CLI commands, e.g. kedro catalog list we do matching to figure out which factory mentions in the catalog match the datasets used in the pipeline, but when going through the interactive flow no such checking has been done yet.

Possible Implementation

Could check dataset existence when the session is created. We need to verify if that has any unexpected side effects.

This ticket is still open scope and we don't have a specify implementation in mind. The person who pick up can evaluate different approaches, with considerations of side-effect, avoid coupling with other components.

Possible Alternatives

catalog.list( pipeline=<name>) - not a good solution because catalog wouldn't have access to a pipeline
Do something similar to what's happening when kedro catalog list is called.

The text was updated successfully, but these errors were encountered:

datajoely · 2023-11-15T13:13:31Z

could we have something like catalog.resolve(pipeline:Optional[str]).list()?

merelcht · 2023-11-20T14:56:44Z

This Viz issue is related: kedro-org/kedro-viz#1480

MarcelBeining · 2024-02-16T09:36:36Z

could we have something like catalog.resolve(pipeline:Optional[str]).list()?

That would be perfect! We would need such a thing

noklam · 2024-02-16T17:00:06Z

@MarcelBeining Can you explains a bit more why you need this? I am thinking about this again because I am trying to build a plugin for kedro and this would come in handy to compile a static version of configuration.

MarcelBeining · 2024-03-12T14:08:46Z

@noklam We try to find kedro datasets for which we have not written a data test, hence we iterate over catalog.list(). However, if we use dataset factories, the datasets captured with a factory is not listed in catalog.list()

noklam · 2024-03-12T14:48:55Z

@MarcelBeining Did I understand this question correctly as:

Find which datasets is not written in catalog.yml yet? I have some WIP in https://github.com/noklam/kedro-inspect which explores this idea but I haven't finished it.

Does kedro catalog resolve or kedro catalog list helps you? If not what are missing?

MarcelBeining · 2024-03-12T16:04:11Z

@noklam "Find which datasets is not written in catalog.yml including dataset factory resolves, yet" , yes

kedro catalog resolve shows what I need, but it is a CLI command and I need it within Python (of course one could use os.system etc, but a simple extension of catalog.list() should not be that hard)

noklam · 2024-03-12T16:06:58Z

@MarcelBeining Are you integrating this with some extra functionalities? How do you consume this information if this is ok to share?

ianwhale · 2024-05-17T12:57:57Z

@noklam

Adding on from our discussion on slack,

kedro catalog resolve does what I'd want.

But I'd also like that information easily consumable in a notebook (for example).

So if my catalog stores models like:

"{experiment}.model":
  type: pickle.PickleDataset
  filepath: data/06_models/{experiment}/model.pickle
  versioned: true

I would want to be able to (somehow) do something like:

models = {}

for model_dataset in [d for d in catalog.list(*~*magic*~*) if ".model" in d]:
    models[model_dataset] = catalog.load(model_dataset)

Its a small thing. But I was kind of surprised to not see my {experiment}.model entries not listed at all in catalog.list().

noklam · 2024-05-24T11:34:52Z

Another one, bumped to high priority as discussed in slack.
https://linen-slack.kedro.org/t/18841749/hi-i-have-a-dataset-factory-specified-and-when-i-do-catalog-#eb609fb2-fce6-434d-a652-ffb62eb41e7b

noklam · 2024-06-03T13:50:37Z

What if DataCatalog is iterable?

for datasets in data_catalog:
   ...

datajoely · 2024-06-03T15:01:29Z

I think it's neat @noklam , but I don't know if it's discoverable.

To me DataCatalog.list() feels more powerful in the IDE than list(DataCatalog)...

astrojuanlu · 2024-06-03T15:39:39Z

why_not_both.gif

def list(self):
    ...

def __iter__(self):
    return self.list()

Galileo-Galilei · 2024-06-03T20:42:45Z

I've also wanted to be able to iterate through the datasets for a while, but it raises some unanswered questions:

should we iterate on catalog (maybe more intuitive) or catalog.datasets as described in [DataCatalog]: Iterate through datasets objects in the catalog #3916 (maybe more accurate, especially in regard to the "resolving" issue discussed below) ?
How does this loop would handle dataset factory? We could eventually replace :
- catalog.list() by [dataset.name for dataset in catalog]
- catalog.search (which does not exist but is suggested in [DataCatalog]: Add functionality to search datasets in the catalog #3917) by [dataset.name for dataset in catalog if re.match(dataset.name, regex)]

But we always face the same issue: we would need to "resolve" the dataset factory first relatively to a pipeline. it would eventually give: [dataset.name for dataset in catalog.resolve(pipeline)], but is it really a better / more intuitive syntax ? I personnaly find it quite neat, but arguably beginners would prefer a "native" method.

The real advantage of doing so is that we do not need to create a search method with all type of supported search (by extension, by regex... as suggested in the corresponding issue) because it's easily customizable, so it's less maintenance burden in the end.

noklam · 2024-06-03T21:33:24Z

Catalog.list already support regex, isn't that identical to what you suggest as catalog.search?

datajoely · 2024-06-04T00:09:27Z

@noklam you can only search by name, namespaces aren't really supported and you can't search by attribute

noklam · 2024-06-04T14:02:34Z

namespace is just a prefix string so it works pretty well. I do believe there are benefits to improve it, but I think we should at least add an example for existing feature since @Galileo-Galilei told me he is not aware of it and most likely very few do.

#3924

noklam added the Issue: Feature Request New feature or improvement to existing feature label Nov 15, 2023

noklam added this to Kedro Framework Nov 15, 2023

github-actions bot mentioned this issue Dec 1, 2023

Monthly issue metrics report #3375

Open

takikadiri mentioned this issue Dec 1, 2023

Best-effort dataset factories takikadiri/kedro-boot#3

Merged

ankatiyar added this to the Dataset Factory Improvements milestone Mar 4, 2024

astrojuanlu mentioned this issue Jun 3, 2024

[DataCatalog]: Add functionality to search datasets in the catalog #3917

Open

noklam mentioned this issue Jun 4, 2024

Add an example of catalog.list(<regex>) and replace io to catalog in docs #3924

Merged

7 tasks

kacper-ki mentioned this issue Jul 16, 2024

Factory datasets are not getting validated Galileo-Galilei/kedro-pandera#80

Closed

This was referenced Oct 22, 2024

ThreadRunner Dataset DatasetAlreadyExistsError: Dataset has already been registered #4250

Closed

DataCatalog to support listing generic datasets #4184

Closed

noklam changed the title ~~How to improve catalog.list or alternative for dataset factory?~~ Improve catalog.list or alternative for dataset factory? Oct 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve `catalog.list` or alternative for dataset factory? #3312

Improve `catalog.list` or alternative for dataset factory? #3312

noklam commented Nov 15, 2023 •

edited

Loading

datajoely commented Nov 15, 2023

merelcht commented Nov 20, 2023

MarcelBeining commented Feb 16, 2024

noklam commented Feb 16, 2024

MarcelBeining commented Mar 12, 2024

noklam commented Mar 12, 2024 •

edited

Loading

MarcelBeining commented Mar 12, 2024

noklam commented Mar 12, 2024 •

edited

Loading

ianwhale commented May 17, 2024

noklam commented May 24, 2024 •

edited by astrojuanlu

Loading

noklam commented Jun 3, 2024

datajoely commented Jun 3, 2024

astrojuanlu commented Jun 3, 2024

Galileo-Galilei commented Jun 3, 2024 •

edited

Loading

noklam commented Jun 3, 2024

datajoely commented Jun 4, 2024

noklam commented Jun 4, 2024 •

edited

Loading

Improve catalog.list or alternative for dataset factory? #3312

Improve catalog.list or alternative for dataset factory? #3312

Comments

noklam commented Nov 15, 2023 • edited Loading

Description

Context

Possible Implementation

Possible Alternatives

datajoely commented Nov 15, 2023

merelcht commented Nov 20, 2023

MarcelBeining commented Feb 16, 2024

noklam commented Feb 16, 2024

MarcelBeining commented Mar 12, 2024

noklam commented Mar 12, 2024 • edited Loading

MarcelBeining commented Mar 12, 2024

noklam commented Mar 12, 2024 • edited Loading

ianwhale commented May 17, 2024

noklam commented May 24, 2024 • edited by astrojuanlu Loading

noklam commented Jun 3, 2024

datajoely commented Jun 3, 2024

astrojuanlu commented Jun 3, 2024

Galileo-Galilei commented Jun 3, 2024 • edited Loading

noklam commented Jun 3, 2024

datajoely commented Jun 4, 2024

noklam commented Jun 4, 2024 • edited Loading

Improve `catalog.list` or alternative for dataset factory? #3312

Improve `catalog.list` or alternative for dataset factory? #3312

noklam commented Nov 15, 2023 •

edited

Loading

noklam commented Mar 12, 2024 •

edited

Loading

noklam commented Mar 12, 2024 •

edited

Loading

noklam commented May 24, 2024 •

edited by astrojuanlu

Loading

Galileo-Galilei commented Jun 3, 2024 •

edited

Loading

noklam commented Jun 4, 2024 •

edited

Loading