-
Notifications
You must be signed in to change notification settings - Fork 709
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Metadata not working in Kubeflow Pipelines? #216
Comments
@rummens This is a current limitation of how metadata is integrated in Kubeflow Pipelines. We're actively working on improving the metadata support so it's on par with the Airflow version. Currently, we record output artifacts, but we don't actually support caching using the recorded metadata. This will change very soon though. We're expecting to have metadata + caching mechanisms converge in both the Airflow and KFP versions of TFX pipelines over the next couple of months. |
Thank you for the update. Would you be so kind and explain me how metadata works on a high level? Or is there something else on top of the whole pipeline? Because if every component does it’s own checks, every component has to be started (pod creation etc.), which consumes time and resources. Is my understanding kind of correct? ;-) |
each component has driver and executor, the metadata part are handled in driver, if cache hits, executor will be skipped. For kubeflow, we don't have driver fully supported yet. |
This is our plan about addressing this issue:
|
Awe some thanks for the update. Is there any ETA you can provide? |
Since 0.14.0, TFX on Kubeflow Pipelines uses the same metadata-driven orchestration as the Airflow version, so caching should now work. I'll close this issue given that this is now done. One missing piece is ensuring the container version is recorded as part of the execution metadata, so that caching will take this into account. We're working on adding that. |
Problem
I have deployed the taxi_simple example pipeline in KFP running on K8s outside of GCP. The artefacts are shared using a PV because of the current Beam limitation for S3.
I have noticed that the metadata features are not working. For example, every time I rerun the same pipeline (no code or data changes), every component is executed again and they are recomputing everything. I assumed that metadata would recognize, that the pipeline already ran and skip it (using the outputs of the "old" run).
Metadata is integrated into the Kubeflow ([884])(kubeflow/pipelines#884).
Ideas
I have the following ideas:
Can someone explain me how metadata works on a high level? Who is checking if there was a previous run? I assume it is each component for itself or is there another "metadata" component that runs these checks?
Thank you very much for your support!
The text was updated successfully, but these errors were encountered: