-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] Supports parameterized S3Artifactory for Pipeline and ContainerOp in kfp package #1064
[Feature] Supports parameterized S3Artifactory for Pipeline and ContainerOp in kfp package #1064
Conversation
Hi @eterna2. Thanks for your PR. I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
1 similar comment
Hi @eterna2. Thanks for your PR. I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
5fbbefe
to
7327b9f
Compare
@eterna2, I am interested in your use case here. Are you going to export your own data from the container? Or this change is only for metadata and metrics? Have you checked that argo supports to configure the default artifact repo in configmap? What happens if we allow user to change the config in KFP cluster? What else is missing from your perspective? |
Unfortunately, the hard-coded values in the pipeline.yaml override that config. Hard-coding strikes again...
That would be great. |
Do you need different configs for different ops? We're gearing more towards having a configurable default storage solution which will be used by all ops (and probably all pipelines). Maybe we should move this artifact config to the compiler arguments?
Or, like @hongye-sun mentioned, moving the artifact config application to the backend. What do you think? |
For now only metadata and metrics, but ideally is for all the artifacts in the pipeline (I hope the dsl for persistent volume gets thru). I am supporting different teams which may be using the same pipeline. However, I what to separate all artifacts and side effects for different teams into different buckets. So I want to parameterize the artifact config at run time.
Realistically, I don't need diff config for diff ops, but depending on use case I might want to do it at pipeline, or containerop level. For compiler args, I am thinking why not set the fields blank for default (hence we can use the config map for workflow controller). Pipeline level config is needed just because I want to parameterize them so I can set where to save at run time to support different teams. ContainerOps is more for running single component ( there are some use cases where we just want to submit a single job ). I will probably be building some UI over (or customizng) the default kf UI so some of these artifactory settings are not exposed (automatically filled via the individual jwt cookie). |
Yes. I'm in favor of that. |
Just to add more context and my wishlist (actually what I wanted to do after this pr): Currently, I am using kf pipelines to run simulations. But because there is currently no way to declare the output artifacts, the logs and results are separately saved to some bucket by some scripts in the container. This creates unnecessary complexity as i need to maintain a config to s3 client inside the container. Probably solved by #998. And #801 in the future (for downstream processing of the simulation results which can be very big). I also hope there is a easier way to declare ui-metadata and integrate with the front-end. Currently, I have to write a scripts in my container to generate these JSON. Preferably, I hold I can make it more declarative in containerOp rather than in the light weight python components or inside the container - it mixes the business logic with the operation code. Something like this, which additional scripts are added to convert files to metadata or metrics or ... ContainerOp(..., output_artifacts={'/results/abc.csv': TableView(name='foo',columns=[ ]),
'/metrics/f1.txt': Metrics('f1', ...),
'/logs/*': Tarball('logs') }) |
cc @neuromage @eterna2, thanks for sharing your thoughts here. We also have some internal discussion about artifact and metadata. Argo's artifact is great for archiving purpose. However, it's not so great for data passing and realtime debugging/visualization purpose. The way it implements is inefficient for passing large data between steps. We recommend to use volume to pass large data. For data visualization, Argo artifact is only available until the step is finished. In a data processing pipeline, a single step can take a long time to finish. We'd like the visualization to be in realtime, so that user can see the data earlier and decide whether to stop or continue the execution. Another aspect is that the artifacts are not always produced from a file/folder inside a container. It can be produced by external jobs, it can be a S3 file, a HDFS directory, or a SQL table, etc. I don't have an answer yet, but with all the above requirements in mind, I am leaning toward letting container code to write the artifact to a remote storage (GCS, S3, NFS, BigQuery table, etc) and report the location and extra metadata to KFP via API or DSL definition. If we go with this approach, we will probably follow what tensorflow filesystem SDK does to abstract the file IO API from the underlying storage layer. We might continue support artifact feature as an option if user don't care about realtime visualization and just want to use native file IO API. WDYT? |
I totally agree with the requirements - in fact I needed them yesterday. Especially, a button to kill the job. We are currently looking at the streaming log to see how the simulation is running, and I have to manually kill the pod if it is going south.
Yup. This is probably a better way. My main concerns were mostly about having 2 diff places to specify where and how to upload the artifacts. I have some initial thoughts thou. Could we orchestra the io with a configmap volume mount? Mostly because I just want to have a single version of docker image without embedding io codes inside. (Actually, it is to use an existing images without building another one - easier to convince my data scientists to transit to kf). We probably want to use the kfp sdk to generate this configmap to export the artifacts? As for streaming vis, this is something we really wanted to. We are doing a combination of geospatial and resource allocation simulation. Originally, I was thinking of just adding a sidecar with either a grpc or just a cron to read from some log output and serve as a proxy. Then build a dashboard and point to the sidecar proxy. |
The terminate run button is recently added in #1039. It should be available in recent KFP release. Could you elaborate a little bit about the configmap volume mount? I don't understand how it works. Have you considered to have an NFS volume to be mounted in both task pod and UI pod to solve the realtime viz problem? You can also use the same volume to pass data between steps. |
The important note is that volume-based artifact passing can still be used even when Argo-style artifacts are specified. Argo pipeline can just specify that component1 has an artifact output (data grabbed from local path) and component2 has and artifact input (data placed into local path) and the data should be passed between them. Removing the explicit storage configuration (but not the artifact definitions or local paths) from the shared pipeline.yaml files is the first step to the storage system flexibility. It can be even possible to switch the storage solution (volume-based vs. GCS vs. Minio) right before launching the run. |
The motivation is to be able to create ContainerOp with existing images (pre-kubeflow and pre-argo era), instead of building a new image with the io codes required to handle the artifacts - I want some way to execute these IO codes without needing to build a new image. Ok I just figure out I don't need a configmap anymore. I can do it with the exit handler instead. But does this mean I would need to use a NFS? import kfp.components as comp
from kfp.dsl import ContainerOp, ExitHandler
def io_handler():
"""custom handling of artifacts"""
import json
with open("/metrics.json", "w") as f:
json.dump(f, {})
io_op = comp.func_to_container_op(io_handler)
with ExitHandler(exit_op=io_op):
op = ContainerOp(image="some_image_wo_io_codes")
Never thought of that. But I think this is a good idea. Might need to try it out - just worried about the write speed, AWS EFS is notoriously slow. |
1941469
to
f791406
Compare
bcdc841
to
470b49d
Compare
470b49d
to
80cda79
Compare
@Ark-kun Empty |
@eterna2 Please remove the stray temporary file and I guess we're done. I still do not think that per-op artifact configuration is a good idea, but I won't hold this PR any more. I'd like to work with you in future to switch to the configMap-based per-run configuration so that nothing gets hard-coded in the pipeline package. |
@gaoning777 @hongye-sun Do you see any remaining problems? |
/lgtm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
@eterna2, thanks for the PR.
|
||
|
||
@dsl.pipeline(name='foo', description='hello world') | ||
def foo_pipeline(tag: str, namespace: str = "kubeflow", bucket: str = "foobar"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mind adding a sample similar with this one in the basic sample folder in another PR? Thanks.
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: hongye-sun The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
1 similar comment
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: hongye-sun The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/hold Waiting for #1331 release to be merged |
/hold cancel |
Motivation
I am running a kubeflow pipeline deployment with my custom helm chart and a minio s3 gateway to my custom bucket. This bucket has a different name from the default one in
kfp
, hence I need some way to parameterize the s3 artifact configs.Status
Features
TLDR changes
argo-models
is now a dependency insetup.py
(argov2.2.1
)ArtifactLocation
PipelineConf
to support artifact locationargo-models
)ArtifactLocation
kfp.aws
(Found that it has a bug, and was not imported into the unit test)This change is