-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CRD for scheduling Argo workflows (API spec) #2
Conversation
We named this CRD a "schedule": It schedules workflows for now but we will add support for other K8 resources (e.g., jobs). It supports: - executing pipelines using a cron schedule - executing pipelines using a periodic interval - An optional start date and an end date. - Backfilling (by specifying a start date in the past). - A maximum number of concurrent workflows.
/cc @jessesuen |
@jlewi: GitHub didn't allow me to request PR reviews from the following users: jessesuen. Note that only kubeflow members and repo collaborators can review this PR, and authors cannot review their own PRs. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Review status: 0 of 33 files reviewed at latest revision, all discussions resolved. pkg/apis/schedule/register.go, line 18 at r1 (raw file):
"Schedule" seems like an incomplete name? Why not WorkflowSchedule since you use Workflow in other places (e.g. the Type). pkg/apis/schedule/v1alpha1/types.go, line 27 at r1 (raw file):
I believe phases have been dropped in favor of conditions. pkg/apis/schedule/v1alpha1/types.go, line 62 at r1 (raw file):
What's meant by human readable? Does this mean its not used for anything? pkg/apis/schedule/v1alpha1/types.go, line 65 at r1 (raw file):
Why would you enable/disable the schedule as opposed to just deleting it to disable it and creating it to enable it? pkg/apis/schedule/v1alpha1/types.go, line 79 at r1 (raw file):
Why is there a limit on history? Whose keeping track of history? pkg/apis/schedule/v1alpha1/types.go, line 97 at r1 (raw file):
Why are you exposing Index? Why would operation/parameterization of a workflow depend on this? pkg/apis/schedule/v1alpha1/types.go, line 124 at r1 (raw file):
Do you really need PeriodicSchedule? Is CronSchedule not sufficient? pkg/apis/schedule/v1alpha1/types.go, line 134 at r1 (raw file):
no end time pkg/apis/schedule/v1alpha1/types.go, line 207 at r1 (raw file):
What is this for? pkg/apis/schedule/v1alpha1/types.go, line 213 at r1 (raw file):
Why do you have phase? Why not just conditions as noted above? Comments from Reviewable |
Review status: 0 of 45 files reviewed at latest revision, 10 unresolved discussions, all commit checks successful. pkg/apis/schedule/register.go, line 18 at r1 (raw file): Previously, jlewi (Jeremy Lewi) wrote…
Done. Renamed to ScheduledWorkflow as we discussed. pkg/apis/schedule/v1alpha1/types.go, line 27 at r1 (raw file): Previously, jlewi (Jeremy Lewi) wrote…
Done. Switched to using conditions. pkg/apis/schedule/v1alpha1/types.go, line 62 at r1 (raw file): Previously, jlewi (Jeremy Lewi) wrote…
Removing it. I don't think it belongs in a CRD spec. pkg/apis/schedule/v1alpha1/types.go, line 65 at r1 (raw file): Previously, jlewi (Jeremy Lewi) wrote…
This is needed for the following use case:
Here is another example:
This is similar to the backfill functionality of Airflow. pkg/apis/schedule/v1alpha1/types.go, line 79 at r1 (raw file):
This is very convenient when using kubectl so that the user does not have to query the list of workflows separately with the correct labels. This is similar to the way the Argo resource returns the graph of PODs so that users can easily investigate the ones that failed. We need to limit the size of the history returned by the "kubectl get" call since ScheduledWorkflows may run forever In the future, we may also decide to garbage collect the workflows that are completed, and only keep the X most recent ones in the K8 key value store. This is what the CronJob does: https://github.com/kubernetes/kubernetes/blob/master/pkg/apis/batch/types.go#L276 pkg/apis/schedule/v1alpha1/types.go, line 97 at r1 (raw file): Previously, jlewi (Jeremy Lewi) wrote…
This is one feature explicitly requested by users working on ML workflows. They want a value, substituted in the parameters of the workflow, that is unique, short, and indicating the order in which the workflows executed. A unique time can be very verbose: 20180504121645 A UUID is way too long. This index (suffixed to the workflow name in the Argo workflow spec), is ideal for these users. They use it to generate unique directories to store data in GCS for instance. pkg/apis/schedule/v1alpha1/types.go, line 124 at r1 (raw file): Previously, jlewi (Jeremy Lewi) wrote…
We only provided Cron at first, but discovered an issue. Cron does not let you schedule a job, for instance, every 5 hours. It can handle every 6 hours: 0, 6, 12, 18, 0, 6, 12 But if you try for every 5 hours, you end up with: 0, 5, 10, 15, 20, <--- NOT 5 HOURS --> 0, 5, 10 This may be why Airflow supports both Cron and an interval as well: https://airflow.apache.org/scheduler.html#scheduling-triggers pkg/apis/schedule/v1alpha1/types.go, line 134 at r1 (raw file): Previously, jlewi (Jeremy Lewi) wrote…
Fixed. pkg/apis/schedule/v1alpha1/types.go, line 207 at r1 (raw file): Previously, jlewi (Jeremy Lewi) wrote…
These are fields copied from the Argo status for convenience (and so that we can have a fancy looking CLI like Argo). The ScheduledWorkflowStatus provides a WorkflowHistory object. The WorkflowHistory object has a list of WorkflowStatuses for active Argo Workflows. The Workflow Status contains fields copied directly from the Argo status (except for the scheduledTime). pkg/apis/schedule/v1alpha1/types.go, line 213 at r1 (raw file): Previously, jlewi (Jeremy Lewi) wrote…
The field is copied from Argo so we cannot change the type. It is very convenient for the user to have "kubectl get ScheduledWorkflow" return a list of recently executed workflows and their status. I changed to a Condition above for our own CRD. Comments from Reviewable |
Review status: 0 of 45 files reviewed at latest revision, 4 unresolved discussions. pkg/apis/schedule/v1alpha1/types.go, line 27 at r1 (raw file): Previously, vicaire (Pascal Vicaire) wrote…
Done. pkg/apis/schedule/v1alpha1/types.go, line 62 at r1 (raw file): Previously, vicaire (Pascal Vicaire) wrote…
Done. pkg/apis/schedule/v1alpha1/types.go, line 65 at r1 (raw file): Previously, vicaire (Pascal Vicaire) wrote…
But I thought the whole point of the CRD and the database was that you are persisting actual job runs to a database. So when you recreate it you should check the database for existing runs. I think this goes to the point about the CRD being edge triggered vs. level triggered Suppose a ScheduledWorkflow has a schedule of If you're level triggered then on May 10 we know we want the following Workflows to exist It doesn't matter who/when/or how those runs were created. If any of them are missing we know that we should go and start them. Thus, it doesn't matter whether the schedule was deleted and then recreated. I think your implementation is currently edge triggered. You are relying on observing transitions from one time to another. I think this is going to be more brittle. If you with a level design then I don't think you need the on/off bit. pkg/apis/schedule/v1alpha1/types.go, line 97 at r1 (raw file): Previously, vicaire (Pascal Vicaire) wrote…
Why not use a random number? Or it least make it look random? If you return a monotonically increasing number users will likely start to make inferences about what that index means and that will limit your implementation. e.g. users will likely make assumptions like the numbers should be continuous and that they should match the trigger times of the runs. Comments from Reviewable |
Review status: 0 of 45 files reviewed at latest revision, 4 unresolved discussions. pkg/apis/schedule/v1alpha1/types.go, line 65 at r1 (raw file): Previously, jlewi (Jeremy Lewi) wrote…
We want this implementation to be independent of the DB. Additionally, we need it to scale. The user could have a ScheduledWorkflow running for 3 years for every hour. As time progresses, verifying the complete list would become less and less efficient and slow down the controller (particularly if many instances of ScheduledWorkflow are running). Finally, the workflow resources may be garbage collected (they should be). If the ScheduledWorkflow runs for 3 years, I don't think the correct functioning of this CRD should require that all the workflow objects be preserved in the K8 key-value store. This is why we implemented this in a way similar to CronJob, keeping track only of scheduledTime of the latest workflow. We can implement something to address your use case later on, but that does not seem to fit this CRD very well. I would see it more as a CRD that executes workflows based on a list of possible values for the parameters of the workflow, where these values are times. The number of items in the list would have to be limited. pkg/apis/schedule/v1alpha1/types.go, line 97 at r1 (raw file): Previously, jlewi (Jeremy Lewi) wrote…
Returning a monotonically increasing number is here the requirement. Say the user creates a unique directory for each run of the workflow. They want the directories to be:
We can easily add an substitution for a predictable random looking number (based on a hash). But we do need the monotically increasing, guaranteed to be unique index. Comments from Reviewable |
Review status: 0 of 45 files reviewed at latest revision, 4 unresolved discussions. pkg/apis/schedule/v1alpha1/types.go, line 65 at r1 (raw file): Previously, vicaire (Pascal Vicaire) wrote…
As discussed offline the real issue here is making backfilling reliable. Right now we update last scheduled time as soon as a new workflow is started. But we shouldn't update it until that workflow completed (successfully or failed) so that controller can ensure work was actually completed. As discussed offline please file an issue address this in a future PR. pkg/apis/schedule/v1alpha1/types.go, line 97 at r1 (raw file): Previously, vicaire (Pascal Vicaire) wrote…
SGTM Comments from Reviewable |
/lgtm Please file an issue to track making backfilling reliable. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jlewi The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Issue filed to make backfills reliable: |
* add metric sample code in roc component * Change mlpipeline artifacts output path to tmp folder * Revert /tmp path change and add output_dir in notebook sample * Revert /tmp path #2
* add metric sample code in roc component * Change mlpipeline artifacts output path to tmp folder * Revert /tmp path change and add output_dir in notebook sample * Revert /tmp path kubeflow#2
* initial watson pipeline * updates * updates * Update creds * updated dockerhub base files * update watson pipeline (kubeflow#2) Removing references to wml-base image, and recreating it with python-slim as base image * address review comment (kubeflow#3) * change s3 to cos (kubeflow#4)
# This is the 1st commit message: Add initial scripts # This is the commit message kubeflow#2: Add working pytest script # This is the commit message kubeflow#3: Add initial scripts # This is the commit message kubeflow#4: Add environment variable files # This is the commit message kubeflow#5: Remove old cluster script
# This is the 1st commit message: Add initial scripts # This is the commit message kubeflow#2: Add working pytest script # This is the commit message kubeflow#3: Add initial scripts # This is the commit message kubeflow#4: Add environment variable files # This is the commit message kubeflow#5: Remove old cluster script
* # This is a combination of 5 commits. # This is the 1st commit message: Add initial scripts # This is the commit message #2: Add working pytest script # This is the commit message #3: Add initial scripts # This is the commit message #4: Add environment variable files # This is the commit message #5: Remove old cluster script * Add initial scripts Add working pytest script Add initial scripts Add environment variable files Remove old cluster script Update pipeline credentials to OIDC Add initial scripts Add working pytest script Add initial scripts Add working pytest script * Remove debugging mark * Update example EKS cluster name * Remove quiet from Docker build * Manually pass env * Update env list vars as string * Update use array directly * Update variable array to export * Update to using read for splitting * Move to helper script * Update export from CodeBuild * Add wait for minio * Update kubectl wait timeout * Update minor changes for PR * Update integration test buildspec to quiet build * Add region to delete EKS * Add wait for pods * Updated README * Add fixed interval wait * Fix CodeBuild step order * Add file lock for experiment ID * Fix missing pytest parameter * Update run create only once * Add filelock to conda env * Update experiment name ensuring creation each time * Add try/catch with create experiment * Remove caching from KFP deployment * Remove disable KFP caching * Move .gitignore changes to inside component * Add blank line to default .gitignore
# This is the 1st commit message: parent 1551e34 author Dustin Luong <dusluong@amazon.com> 1593625846 -0700 committer Dustin Luong <dusluong@amazon.com> 1595961236 -0700 Unit test passes Updated unit tests, working on integration tests Updated READMEs, Logged rules status and errors to UI squash a94f809 Unfinished sample pipeline for debugger squash 8128572 Edge case: empty rules config Edge case: empty rules config Cleaned up example pipeline and fixed empty case of debug rules Finished integration test, debug rule logs only print after training has completed, refactored various code Unhardcoded rules registry New sample pipeline, minor changes to utils Refactored wait_for_debug_rules, added unit tests, updated readme for debugger demo, fixed typos and small errors rm .gz Changed defaults for train.template.yaml, updated example pipeline, removed exceptions from utils which are handled by boto3 removed trust.json minor clean ups Minor fixes Refactored code to incorporate changes from design review, notably removing collection_config # This is the commit message kubeflow#2: custom_rules.py # This is the commit message kubeflow#3: Added tensorboard # This is the commit message kubeflow#4: restored run_integration_tests # This is the commit message kubeflow#5: removed custom rule
* # This is a combination of 5 commits. # This is the 1st commit message: Add initial scripts # This is the commit message kubeflow#2: Add working pytest script # This is the commit message kubeflow#3: Add initial scripts # This is the commit message kubeflow#4: Add environment variable files # This is the commit message kubeflow#5: Remove old cluster script * Add initial scripts Add working pytest script Add initial scripts Add environment variable files Remove old cluster script Update pipeline credentials to OIDC Add initial scripts Add working pytest script Add initial scripts Add working pytest script * Remove debugging mark * Update example EKS cluster name * Remove quiet from Docker build * Manually pass env * Update env list vars as string * Update use array directly * Update variable array to export * Update to using read for splitting * Move to helper script * Update export from CodeBuild * Add wait for minio * Update kubectl wait timeout * Update minor changes for PR * Update integration test buildspec to quiet build * Add region to delete EKS * Add wait for pods * Updated README * Add fixed interval wait * Fix CodeBuild step order * Add file lock for experiment ID * Fix missing pytest parameter * Update run create only once * Add filelock to conda env * Update experiment name ensuring creation each time * Add try/catch with create experiment * Remove caching from KFP deployment * Remove disable KFP caching * Move .gitignore changes to inside component * Add blank line to default .gitignore
Added namespaced pipelines, with UI and API changes, as well as the ability to share pipelines. Fixing API Tests. npm formatting end 2 end tests fixes No Shared Pipelines Created name_namespace_index This solves for quite a big bug ... dropping index on pipelines.name there will be a composite index on name and namespace Kubeflow PIpelines Multi-User Backend fixing ui tests fixed formatting issues formatting code Adding KFP SDK changes correcting test cases Fixes for integration tests (single-user) npm formatting kubeflow#2 corrected more test cases new test cases & code quality correcting nits npm format go mod tidy added update to view-edit role
change argocompiler dependency to my own repo
* # This is a combination of 6 commits. # This is the 1st commit message: Modify agent test case for code coverage (kubeflow#1849) * Modifies the test case for sync models config Signed-off-by: Andrews Arokiam <andrews.arokiam@ideas2it.com> # This is the commit message kubeflow#2: Add test cases for agent storage utils (kubeflow#1849) * Add test case for FileExists function * Add test case for RemoveDir function Signed-off-by: Andrews Arokiam <andrews.arokiam@ideas2it.com> # This is the commit message kubeflow#3: Add test case for agent storage utils * Add test case for GetProvider function Signed-off-by: Andrews Arokiam <andrews.arokiam@ideas2it.com> # This is the commit message kubeflow#4: Add test case for gcs model downloader (kubeflow#1849) * Add test case for gcs model downloader in agent Signed-off-by: Andrews Arokiam <andrews.arokiam@ideas2it.com> # This is the commit message kubeflow#5: Add test cases for agent downloader Signed-off-by: Andrews Arokiam <andrews.arokiam@ideas2it.com> # This is the commit message kubeflow#6: Add test cases for configmap (kubeflow#1849) * Add test cases for v1beta1 configmap Signed-off-by: Andrews Arokiam <andrews.arokiam@ideas2it.com> * Add test cases for inference service defaults (kubeflow#1849) * Add test cases for all model runtimes * Add test cases for all runtime defaults Signed-off-by: Andrews Arokiam <andrews.arokiam@ideas2it.com> * fmt (kubeflow#1849) Signed-off-by: Andrews Arokiam <andrews.arokiam@ideas2it.com> * Modify agent test case for code coverage (kubeflow#1849) * Modifies the test case for sync models config Signed-off-by: Andrews Arokiam <andrews.arokiam@ideas2it.com> Add test cases for agent storage utils (kubeflow#1849) * Add test case for FileExists function * Add test case for RemoveDir function Signed-off-by: Andrews Arokiam <andrews.arokiam@ideas2it.com> Add test case for agent storage utils * Add test case for GetProvider function Signed-off-by: Andrews Arokiam <andrews.arokiam@ideas2it.com> Add test case for gcs model downloader (kubeflow#1849) * Add test case for gcs model downloader in agent Signed-off-by: Andrews Arokiam <andrews.arokiam@ideas2it.com> Add test cases for agent downloader Signed-off-by: Andrews Arokiam <andrews.arokiam@ideas2it.com> Add test cases for configmap (kubeflow#1849) * Add test cases for v1beta1 configmap Signed-off-by: Andrews Arokiam <andrews.arokiam@ideas2it.com> Add test cases for inference service defaults (kubeflow#1849) * Add test cases for all model runtimes * Add test cases for all runtime defaults Signed-off-by: Andrews Arokiam <andrews.arokiam@ideas2it.com> Add test cases for predictor model (kubeflow#1849) * Add test cases for isFrameworkSupported function Signed-off-by: Andrews Arokiam <andrews.arokiam@ideas2it.com> Add test cases for sklearn predictor (kubeflow#1849) * Add test cases for GetProtocol function Signed-off-by: Andrews Arokiam <andrews.arokiam@ideas2it.com> Add test cases for configmap (kubeflow#1849) * Add test case for creating empty model config Signed-off-by: Andrews Arokiam <andrews.arokiam@ideas2it.com> Add test cases for utils (kubeflow#1849) * Add test cases for IncludesArg function * Add test cases for IsGPUEnabled function * Add test cases for FirstNonNilError function * Add test cases for RemoveString function * Add test cases for IsPrefixSupported function Signed-off-by: Andrews Arokiam <andrews.arokiam@ideas2it.com> * Add test cases for creds_utils (kubeflow#1849) * Add test cases for set_gcs_credentials function * Add test cases for create_secret function * Add test cases for set_service_account function * Add test cases for create_service_account function * Add test cases for patch_service_account function * Add test cases for get_creds_name_from_config_map function Signed-off-by: Andrews Arokiam <andrews.arokiam@ideas2it.com> * Add test cases for creds_utils (kubeflow#1849) * Add test cases for set_gcs_credentials function * Add test cases for create_secret function * Add test cases for set_service_account function * Add test cases for create_service_account function * Add test cases for patch_service_account function * Add test cases for get_creds_name_from_config_map function Signed-off-by: Andrews Arokiam <andrews.arokiam@ideas2it.com> Add test cases for creds_utils (kubeflow#1849) * Add test cases for set_s3_credentials function * Add test cases for set_azure_credentials function Signed-off-by: Andrews Arokiam <andrews.arokiam@ideas2it.com> Add test cases for v1beta1 component (kubeflow#1849) * Add test cases for validateStorageSpec function * Add test cases for validateLogger function * Add test cases for FirstNonNilComponent function Signed-off-by: Andrews Arokiam <andrews.arokiam@ideas2it.com> Add test cases for inference service status (kubeflow#1849) * Add test cases for PropagateRawStatus function * Add test cases for PropagateStatus function Signed-off-by: Andrews Arokiam <andrews.arokiam@ideas2it.com> Add test cases for predictor (kubeflow#1849) * Add test cases for GetPredictorImplementations function * Add test cases for GetPredictorImplementation function Signed-off-by: Andrews Arokiam <andrews.arokiam@ideas2it.com> Add test cases for agent_injector (kubeflow#1849) * Add test cases for getLoggerConfigs function * Add test cases for getAgentConfigs function Signed-off-by: Andrews Arokiam <andrews.arokiam@ideas2it.com> Add test cases for batcher_injector (kubeflow#1849) * Add test cases for getBatcherConfigs function Signed-off-by: Andrews Arokiam <andrews.arokiam@ideas2it.com> * Add test cases for storage initializer injector (kubeflow#1849) * Add test cases for getStorageInitializerConfigs function * Add test cases for parsePvcUri function Signed-off-by: Andrews Arokiam <andrews.arokiam@ideas2it.com> Add test cases for storage initializer injector (kubeflow#1849) * Add test cases for parsePvcUri function Signed-off-by: Andrews Arokiam <andrews.arokiam@ideas2it.com> fmt (kubeflow#1849) Signed-off-by: Andrews Arokiam <andrews.arokiam@ideas2it.com> Add test cases for controller utils (kubeflow#1849) * Add test cases for GetDeploymentMode function Signed-off-by: Andrews Arokiam <andrews.arokiam@ideas2it.com> Remove double import of same package (kubeflow#1849) Signed-off-by: Andrews Arokiam <andrews.arokiam@ideas2it.com> temp commit Signed-off-by: Andrews Arokiam <andrews.arokiam@ideas2it.com> Updated coverage for inference_service_default_test Signed-off-by: Andrews Arokiam <andrews.arokiam@ideas2it.com> Added scripts for code coverage Signed-off-by: Andrews Arokiam <andrews.arokiam@ideas2it.com> Updated make to track coverage including subpackages Added more coverage Added ignore to client package - generated code Added coverage script to workflow Signed-off-by: Andrews Arokiam <andrews.arokiam@ideas2it.com> * Temporarily commenting couple of test cases Signed-off-by: Andrews Arokiam <andrews.arokiam@ideas2it.com> Temporary changes to debug e2e Signed-off-by: Andrews Arokiam <andrews.arokiam@ideas2it.com> Commented configmap test Reverted accidental commit of generated code. Signed-off-by: Andrews Arokiam <andrews.arokiam@ideas2it.com> Added -v to debug failing tests Signed-off-by: Andrews Arokiam <andrews.arokiam@ideas2it.com> Updated tests to remove dependency on k8s cluster Signed-off-by: Andrews Arokiam <andrews.arokiam@ideas2it.com> * Updated readme to show coverage
We named this CRD a "schedule": It schedules workflows for now but we will add support for other K8 resources (e.g., jobs). It supports:
Most of the files are automatically generated except for the files in:
./hack/custom-boilerplate.go.txt
./hack/update-codegen.sh
./hack/verify-codegen.sh
./pkg/apis/schedule/register.go
./pkg/apis/schedule/v1alpha1/doc.go
./pkg/apis/schedule/v1alpha1/register.go
./pkg/apis/schedule/v1alpha1/types.go
This change is