Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metadata not working in Kubeflow Pipelines? #216

Closed
rummens opened this issue Jun 7, 2019 · 6 comments
Closed

Metadata not working in Kubeflow Pipelines? #216

rummens opened this issue Jun 7, 2019 · 6 comments

Comments

@rummens
Copy link
Contributor

rummens commented Jun 7, 2019

Problem

I have deployed the taxi_simple example pipeline in KFP running on K8s outside of GCP. The artefacts are shared using a PV because of the current Beam limitation for S3.

I have noticed that the metadata features are not working. For example, every time I rerun the same pipeline (no code or data changes), every component is executed again and they are recomputing everything. I assumed that metadata would recognize, that the pipeline already ran and skip it (using the outputs of the "old" run).

Metadata is integrated into the Kubeflow ([884])(kubeflow/pipelines#884).

Ideas

I have the following ideas:

  1. I have to configure something that I am not aware of
  2. For some reason metadata only work when artefacts are stored within object storage.

Can someone explain me how metadata works on a high level? Who is checking if there was a previous run? I assume it is each component for itself or is there another "metadata" component that runs these checks?

Thank you very much for your support!

@neuromage
Copy link

@rummens This is a current limitation of how metadata is integrated in Kubeflow Pipelines. We're actively working on improving the metadata support so it's on par with the Airflow version. Currently, we record output artifacts, but we don't actually support caching using the recorded metadata. This will change very soon though. We're expecting to have metadata + caching mechanisms converge in both the Airflow and KFP versions of TFX pipelines over the next couple of months.

@rummens
Copy link
Contributor Author

rummens commented Jun 8, 2019

Thank you for the update. Would you be so kind and explain me how metadata works on a high level?
Every component stores a reference to its inputs/outputs in the metadata store and then when run again checks if there is an entry of itself? Then compares the entry input data with the actual one, and if they are the same stops, and the next component is started?

Or is there something else on top of the whole pipeline? Because if every component does it’s own checks, every component has to be started (pod creation etc.), which consumes time and resources.

Is my understanding kind of correct? ;-)
Thanks

@1025KB
Copy link
Collaborator

1025KB commented Jun 9, 2019

each component has driver and executor, the metadata part are handled in driver, if cache hits, executor will be skipped. For kubeflow, we don't have driver fully supported yet.

@zhitaoli
Copy link
Contributor

This is our plan about addressing this issue:

  1. To support another upcoming orchestrator (local) we are going to refactor the driver/executor/publisher interaction of Airflow, so their invocation sequence does not require Airflow based operators;
  2. The container in TFX/Kubeflow will become "fatter" to also run driver and publisher in the container; It will use a database connection string
  3. In the long run, we will make sure that a) Metadata can also be passed as a RPC address instead of DB connection string, and b) we will use multiple side car containers for driver and publisher (to allow "bring your own image").

cc + @ruoyu90 @neuromage @rcrowe-google

@rummens
Copy link
Contributor Author

rummens commented Jun 11, 2019

Awe some thanks for the update. Is there any ETA you can provide?
Also if there is anything I can do to help, don’t hesitate to ask ;-)

@neuromage
Copy link

Since 0.14.0, TFX on Kubeflow Pipelines uses the same metadata-driven orchestration as the Airflow version, so caching should now work. I'll close this issue given that this is now done.

One missing piece is ensuring the container version is recorded as part of the execution metadata, so that caching will take this into account. We're working on adding that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants