Figure out a better convention for parsing ML metadata from containers #906

neuromage · 2019-03-05T04:52:40Z

Right now, we parse a specific output file /output/mlmetadata/*. This is prone to errors, as users may choose this filename for something else. We should look at another way of parsing metadata produced by container steps.

Ark-kun · 2019-03-07T22:04:03Z

If we use product-prefixed name like kfp_metadata, I think it's pretty safe.

Generally, I'd prefer to treat the mlmetadata as any other parameter/artifact output. Then we can detect it by the output name.
I'd prefer if there are no hardcoded paths inside the container. The program should be receiving all paths from the caller. If someone still wants to hardcode some path, it's better to do that in the component.yaml file as it's easier to change later.

WDYT?

neuromage · 2019-03-26T14:20:55Z

Yep, hardcoded paths aren't great. This is going away soon now that we've settled on using an SDK to record output artifacts. Let's leave this open until that's fixed so users are aware of it.

I'm curious though, why was it ever ok to do this with the artifacts like metrics and ui metadata? Won't we run into the same problem in those cases as well?

Ark-kun · 2019-04-17T20:05:46Z

I'm curious though, why was it ever ok to do this with the artifacts like metrics and ui metadata? Won't we run into the same problem in those cases as well?

It was never OK for me. I objected that in both cases. I also spend quite a lot of time trying to undo the damage (e.g. make the UI metadata and Metrics just normal artifacts), but some of my efforts were blocked.

Ark-kun · 2019-04-17T20:14:11Z

If we only want per-task metadata support, I'd like it to just be an explicit output with name and type "metadata":

outputs:
- name: Metadata
  type: Metadata

The program then just writes the metadata to the provided path.

If we want per-output metadata support, I'd like it to work as follows:

Variant 1: Explicit extra outputs with the "metadata" type and suffix:

outputs:
- name: Trained model
  type: TFModel
- name: Trained model metadata
  type: Metadata

The program then writes each output to the provided path. Also the metadata can be easily passed to downstream components that can examine it (e.g. look at accuracy and do additional training).

Variant 2: Implicit metadata outputs

If you program writes the output data to model_path, the it should write metadata to model_path + ".metadata.json". The system will notice and take care of the file. With GCS or volumes, the file is available in realtime.

With these approaches we can support all programming languages without writing SDKs for them and forcing every component author to install our code inside their containers.

Signed-off-by: Theofilos Papapanagiotou <theofilos@gmail.com>

neuromage mentioned this issue Mar 5, 2019

Record TFX output artifacts in Metadata store #884

Merged

vicaire assigned neuromage Mar 26, 2019

vicaire added kind/bug area/backend priority/p1 labels Mar 26, 2019

neuromage closed this as completed Feb 14, 2020

Linchin pushed a commit to Linchin/pipelines that referenced this issue Apr 11, 2023

update kaniko image to the official (kubeflow#906)

6df1135

Signed-off-by: Theofilos Papapanagiotou <theofilos@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Figure out a better convention for parsing ML metadata from containers #906

Figure out a better convention for parsing ML metadata from containers #906

neuromage commented Mar 5, 2019

Ark-kun commented Mar 7, 2019 •

edited

Loading

neuromage commented Mar 26, 2019

Ark-kun commented Apr 17, 2019

Ark-kun commented Apr 17, 2019 •

edited

Loading

Figure out a better convention for parsing ML metadata from containers #906

Figure out a better convention for parsing ML metadata from containers #906

Comments

neuromage commented Mar 5, 2019

Ark-kun commented Mar 7, 2019 • edited Loading

neuromage commented Mar 26, 2019

Ark-kun commented Apr 17, 2019

Ark-kun commented Apr 17, 2019 • edited Loading

If we only want per-task metadata support, I'd like it to just be an explicit output with name and type "metadata":

If we want per-output metadata support, I'd like it to work as follows:

Variant 1: Explicit extra outputs with the "metadata" type and suffix:

Variant 2: Implicit metadata outputs

Ark-kun commented Mar 7, 2019 •

edited

Loading

Ark-kun commented Apr 17, 2019 •

edited

Loading