Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Supports parameterized S3Artifactory for Pipeline and ContainerOp in kfp package #1064

Merged

Conversation

eterna2
Copy link
Contributor

@eterna2 eterna2 commented Mar 29, 2019

Motivation

I am running a kubeflow pipeline deployment with my custom helm chart and a minio s3 gateway to my custom bucket. This bucket has a different name from the default one in kfp, hence I need some way to parameterize the s3 artifact configs.

Status

  • Waiting for Review

Features

  • kfp can now declare a custom artifact location inside a pipeline or containerop.
from kfp import dsl
from kubernetes.client.models import V1SecretKeySelector


@dsl.pipeline( name='foo', description='hello world')
def foo_pipeline(namespace: str):

    # configures artifact location
    artifact_location = dsl.ArtifactLocation.s3(
                            bucket="foobar",
                            endpoint="minio-service.%s:9000" % namespace,  # parameterized namespace
                            insecure=True,
                            access_key_secret=V1SecretKeySelector(name="minio", key="accesskey"),
                            secret_key_secret={"name": "minio", "key": "secretkey"}  # accepts dict also
    )

    # set pipeline level artifact location
    conf = dsl.get_pipeline_conf().set_artifact_location(artifact_location)
    
    # use pipeline level artifact location (i.e. minio-service)
    op1 = dsl.ContainerOp(name='foo', image='bash:latest')

    # use containerop level artifact location (i.e. aws)
    op2 = dsl.ContainerOp(
                        name='foo', 
                        image='bash:latest',
                        # configures artifact location
                        artifact_location=dsl.ArtifactLocation.s3(
                            bucket="foobar",
                            endpoint="s3.amazonaws.com",
                            insecure=False,
                            access_key_secret=V1SecretKeySelector(name="s3-secret", key="accesskey"),
                            secret_key_secret=V1SecretKeySelector(name="s3-secret", key="secretkey"))
    )

TLDR changes

  • argo-models is now a dependency in setup.py (argo v2.2.1)
  • Added static class ArtifactLocation
    • to help generate artifact location for s3
    • to help generate artifact for workflow templates
  • Updated PipelineConf to support artifact location
  • Updated k8s helper and related, to support openapi objects (I accidentally used openapi generator instead of swagger codegen for argo-models)
  • Added unit test for ArtifactLocation
  • Fixed unit test for kfp.aws (Found that it has a bug, and was not imported into the unit test)

This change is Reviewable

@k8s-ci-robot
Copy link
Contributor

Hi @eterna2. Thanks for your PR.

I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

1 similar comment
@k8s-ci-robot
Copy link
Contributor

Hi @eterna2. Thanks for your PR.

I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@hongye-sun
Copy link
Contributor

@eterna2, I am interested in your use case here. Are you going to export your own data from the container? Or this change is only for metadata and metrics?

Have you checked that argo supports to configure the default artifact repo in configmap? What happens if we allow user to change the config in KFP cluster? What else is missing from your perspective?

@Ark-kun
Copy link
Contributor

Ark-kun commented Mar 29, 2019

Have you checked that argo supports to configure the default artifact repo in configmap?

Unfortunately, the hard-coded values in the pipeline.yaml override that config. Hard-coding strikes again...

What happens if we allow user to change the config in KFP cluster? What else is missing from your perspective?

That would be great.

@Ark-kun
Copy link
Contributor

Ark-kun commented Mar 29, 2019

I need some way to parameterize the s3 artifact configs.

Do you need different configs for different ops? We're gearing more towards having a configurable default storage solution which will be used by all ops (and probably all pipelines).

Maybe we should move this artifact config to the compiler arguments?

compiler.Compile(pipeline_func, storage_system=ArgoArtifacts(S3Artifactory(...))

Or, like @hongye-sun mentioned, moving the artifact config application to the backend.

What do you think?

@eterna2
Copy link
Contributor Author

eterna2 commented Mar 29, 2019

@hongye-sun @Ark-kun

@eterna2, I am interested in your use case here. Are you going to export your own data from the container? Or this change is only for metadata and metrics?

For now only metadata and metrics, but ideally is for all the artifacts in the pipeline (I hope the dsl for persistent volume gets thru).

I am supporting different teams which may be using the same pipeline. However, I what to separate all artifacts and side effects for different teams into different buckets. So I want to parameterize the artifact config at run time.

Do you need different configs for different ops We're gearing more towards having a configurable default storage solution which will be used by all ops (and probably all pipelines).

Maybe we should move this artifact config to the compiler arguments?

Realistically, I don't need diff config for diff ops, but depending on use case I might want to do it at pipeline, or containerop level.

For compiler args, I am thinking why not set the fields blank for default (hence we can use the config map for workflow controller).

Pipeline level config is needed just because I want to parameterize them so I can set where to save at run time to support different teams.

ContainerOps is more for running single component ( there are some use cases where we just want to submit a single job ).

I will probably be building some UI over (or customizng) the default kf UI so some of these artifactory settings are not exposed (automatically filled via the individual jwt cookie).

@Ark-kun
Copy link
Contributor

Ark-kun commented Mar 29, 2019

I am thinking why not set the fields blank for default

Yes. I'm in favor of that.
That's what Argo has traditionally been doing. https://github.com/argoproj/argo/blob/master/examples/artifact-passing.yaml

@eterna2
Copy link
Contributor Author

eterna2 commented Mar 30, 2019

Just to add more context and my wishlist (actually what I wanted to do after this pr):

Currently, I am using kf pipelines to run simulations. But because there is currently no way to declare the output artifacts, the logs and results are separately saved to some bucket by some scripts in the container.

This creates unnecessary complexity as i need to maintain a config to s3 client inside the container. Probably solved by #998. And #801 in the future (for downstream processing of the simulation results which can be very big).

I also hope there is a easier way to declare ui-metadata and integrate with the front-end. Currently, I have to write a scripts in my container to generate these JSON.

Preferably, I hold I can make it more declarative in containerOp rather than in the light weight python components or inside the container - it mixes the business logic with the operation code.

Something like this, which additional scripts are added to convert files to metadata or metrics or ...

ContainerOp(..., output_artifacts={'/results/abc.csv': TableView(name='foo',columns=[ ]),
 '/metrics/f1.txt': Metrics('f1', ...),
'/logs/*': Tarball('logs') })

@hongye-sun
Copy link
Contributor

cc @neuromage

@eterna2, thanks for sharing your thoughts here. We also have some internal discussion about artifact and metadata. Argo's artifact is great for archiving purpose. However, it's not so great for data passing and realtime debugging/visualization purpose. The way it implements is inefficient for passing large data between steps. We recommend to use volume to pass large data. For data visualization, Argo artifact is only available until the step is finished. In a data processing pipeline, a single step can take a long time to finish. We'd like the visualization to be in realtime, so that user can see the data earlier and decide whether to stop or continue the execution. Another aspect is that the artifacts are not always produced from a file/folder inside a container. It can be produced by external jobs, it can be a S3 file, a HDFS directory, or a SQL table, etc.

I don't have an answer yet, but with all the above requirements in mind, I am leaning toward letting container code to write the artifact to a remote storage (GCS, S3, NFS, BigQuery table, etc) and report the location and extra metadata to KFP via API or DSL definition. If we go with this approach, we will probably follow what tensorflow filesystem SDK does to abstract the file IO API from the underlying storage layer.

We might continue support artifact feature as an option if user don't care about realtime visualization and just want to use native file IO API.

WDYT?

@eterna2
Copy link
Contributor Author

eterna2 commented Apr 4, 2019

@hongye-sun

I totally agree with the requirements - in fact I needed them yesterday. Especially, a button to kill the job. We are currently looking at the streaming log to see how the simulation is running, and I have to manually kill the pod if it is going south.

requirements in mind, I am leaning toward letting container code to write the artifact to a remote storage (GCS, S3, NFS, BigQuery table, etc) and report the location and extra metadata to KFP via API or DSL definition. If we go with this approach, we will probably follow what tensorflow filesystem SDK does to abstract the file IO API from the underlying storage layer.

Yup. This is probably a better way. My main concerns were mostly about having 2 diff places to specify where and how to upload the artifacts.

I have some initial thoughts thou. Could we orchestra the io with a configmap volume mount? Mostly because I just want to have a single version of docker image without embedding io codes inside. (Actually, it is to use an existing images without building another one - easier to convince my data scientists to transit to kf).

We probably want to use the kfp sdk to generate this configmap to export the artifacts?

As for streaming vis, this is something we really wanted to. We are doing a combination of geospatial and resource allocation simulation. Originally, I was thinking of just adding a sidecar with either a grpc or just a cron to read from some log output and serve as a proxy.

Then build a dashboard and point to the sidecar proxy.

@hongye-sun
Copy link
Contributor

@eterna2

The terminate run button is recently added in #1039. It should be available in recent KFP release.

Could you elaborate a little bit about the configmap volume mount? I don't understand how it works.

Have you considered to have an NFS volume to be mounted in both task pod and UI pod to solve the realtime viz problem? You can also use the same volume to pass data between steps.

@Ark-kun
Copy link
Contributor

Ark-kun commented Apr 4, 2019

Argo's artifact is great for archiving purpose. However, it's not so great for data passing and realtime debugging/visualization purpose. The way it implements is inefficient for passing large data between steps. We recommend to use volume to pass large data.

The important note is that volume-based artifact passing can still be used even when Argo-style artifacts are specified. Argo pipeline can just specify that component1 has an artifact output (data grabbed from local path) and component2 has and artifact input (data placed into local path) and the data should be passed between them.
In the backend, instead of using Argo way of artifact passing implementation, we can actually use volumes to pass the data around. That's without any changes to the pipeline.yaml.

Removing the explicit storage configuration (but not the artifact definitions or local paths) from the shared pipeline.yaml files is the first step to the storage system flexibility. It can be even possible to switch the storage solution (volume-based vs. GCS vs. Minio) right before launching the run.

@eterna2
Copy link
Contributor Author

eterna2 commented Apr 5, 2019

@hongye-sun

Could you elaborate a little bit about the configmap volume mount? I don't understand how it works.

The motivation is to be able to create ContainerOp with existing images (pre-kubeflow and pre-argo era), instead of building a new image with the io codes required to handle the artifacts - I want some way to execute these IO codes without needing to build a new image.

Ok I just figure out I don't need a configmap anymore. I can do it with the exit handler instead. But does this mean I would need to use a NFS?

import kfp.components as comp
from kfp.dsl import ContainerOp, ExitHandler

def io_handler():
  """custom handling of artifacts"""
  import json
  
  with open("/metrics.json", "w") as f:
    json.dump(f, {})

io_op = comp.func_to_container_op(io_handler)

with ExitHandler(exit_op=io_op):
  op = ContainerOp(image="some_image_wo_io_codes")
Have you considered to have an NFS volume to be mounted in both task pod and UI pod to solve the realtime viz problem? You can also use the same volume to pass data between steps.

Never thought of that. But I think this is a good idea. Might need to try it out - just worried about the write speed, AWS EFS is notoriously slow.

@eterna2 eterna2 force-pushed the eterna2/parameterized-artifact-builder branch from 1941469 to f791406 Compare April 5, 2019 12:30
@eterna2 eterna2 force-pushed the eterna2/parameterized-artifact-builder branch from bcdc841 to 470b49d Compare May 14, 2019 02:58
@eterna2 eterna2 force-pushed the eterna2/parameterized-artifact-builder branch from 470b49d to 80cda79 Compare May 14, 2019 03:04
@eterna2
Copy link
Contributor Author

eterna2 commented May 14, 2019

@Ark-kun
Rebased with latest HEAD and fixed merge conflicts and removed PendingDeprecation warning.

Empty artifact_location will now return an empty artifact config.

sdk/python/pipeline.yaml Outdated Show resolved Hide resolved
@Ark-kun
Copy link
Contributor

Ark-kun commented May 14, 2019

@eterna2 Please remove the stray temporary file and I guess we're done.
Your PR is very nice and clean now.

I still do not think that per-op artifact configuration is a good idea, but I won't hold this PR any more.

I'd like to work with you in future to switch to the configMap-based per-run configuration so that nothing gets hard-coded in the pipeline package.

@Ark-kun
Copy link
Contributor

Ark-kun commented May 14, 2019

@gaoning777 @hongye-sun Do you see any remaining problems?

@Ark-kun
Copy link
Contributor

Ark-kun commented May 14, 2019

/lgtm
/ok-to-test

Copy link
Contributor

@hongye-sun hongye-sun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

@eterna2, thanks for the PR.



@dsl.pipeline(name='foo', description='hello world')
def foo_pipeline(tag: str, namespace: str = "kubeflow", bucket: str = "foobar"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mind adding a sample similar with this one in the basic sample folder in another PR? Thanks.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hongye-sun

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

1 similar comment
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hongye-sun

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@Ark-kun
Copy link
Contributor

Ark-kun commented May 14, 2019

/hold

Waiting for #1331 release to be merged

@Ark-kun
Copy link
Contributor

Ark-kun commented May 14, 2019

/hold cancel

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants