Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Figure out a better convention for parsing ML metadata from containers #906

Closed
neuromage opened this issue Mar 5, 2019 · 4 comments
Closed

Comments

@neuromage
Copy link
Contributor

Right now, we parse a specific output file /output/mlmetadata/*. This is prone to errors, as users may choose this filename for something else. We should look at another way of parsing metadata produced by container steps.

@Ark-kun
Copy link
Contributor

Ark-kun commented Mar 7, 2019

If we use product-prefixed name like kfp_metadata, I think it's pretty safe.

Generally, I'd prefer to treat the mlmetadata as any other parameter/artifact output. Then we can detect it by the output name.
I'd prefer if there are no hardcoded paths inside the container. The program should be receiving all paths from the caller. If someone still wants to hardcode some path, it's better to do that in the component.yaml file as it's easier to change later.

WDYT?

@neuromage
Copy link
Contributor Author

Yep, hardcoded paths aren't great. This is going away soon now that we've settled on using an SDK to record output artifacts. Let's leave this open until that's fixed so users are aware of it.

I'm curious though, why was it ever ok to do this with the artifacts like metrics and ui metadata? Won't we run into the same problem in those cases as well?

@Ark-kun
Copy link
Contributor

Ark-kun commented Apr 17, 2019

I'm curious though, why was it ever ok to do this with the artifacts like metrics and ui metadata? Won't we run into the same problem in those cases as well?

It was never OK for me. I objected that in both cases. I also spend quite a lot of time trying to undo the damage (e.g. make the UI metadata and Metrics just normal artifacts), but some of my efforts were blocked.

@Ark-kun
Copy link
Contributor

Ark-kun commented Apr 17, 2019

If we only want per-task metadata support, I'd like it to just be an explicit output with name and type "metadata":

outputs:
- name: Metadata
  type: Metadata

The program then just writes the metadata to the provided path.

If we want per-output metadata support, I'd like it to work as follows:

Variant 1: Explicit extra outputs with the "metadata" type and suffix:

outputs:
- name: Trained model
  type: TFModel
- name: Trained model metadata
  type: Metadata

The program then writes each output to the provided path. Also the metadata can be easily passed to downstream components that can examine it (e.g. look at accuracy and do additional training).

Variant 2: Implicit metadata outputs

If you program writes the output data to model_path, the it should write metadata to model_path + ".metadata.json". The system will notice and take care of the file. With GCS or volumes, the file is available in realtime.

With these approaches we can support all programming languages without writing SDKs for them and forcing every component author to install our code inside their containers.

Linchin pushed a commit to Linchin/pipelines that referenced this issue Apr 11, 2023
Signed-off-by: Theofilos Papapanagiotou <theofilos@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants