You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: PCF-ADR-0005-mcad-api-groups-migration.md
+16-63Lines changed: 16 additions & 63 deletions
Original file line number
Diff line number
Diff line change
@@ -37,69 +37,7 @@ Downstream projects and end-users will have to address the impacts manually to p
37
37
38
38
## How
39
39
40
-
A project like [TorchX](https://github.com/pytorch/torchx), which integrates with MCAD[^1] via the AppWrapper API, has its own release cadence, and it'd be challenging to have it migrated within a single migration cycle, without breaking compatibility.
The following plan proposes a progressive migration path, implemented in three phases, as follows:
45
-
46
-
### Phase 1: Add Dual API Groups Support
47
-
48
-
The new API groups are introduced in MCAD, along with the existing ones.
49
-
Functionally, the MCAD and InstaScale controllers are updated, so they can equally reconcile the old and new APIs.
50
-
51
-
To implement that dual API groups support, it may be necessary to duplicate the MCAD API packages, as well as the generated client packages.
52
-
Indeed, [controller-runtime](https://github.com/kubernetes-sigs/controller-runtime) APIs, which are used by InstaScale, are designed to use the API structs, and require the mapping between the API structs and the Group/Version/Kind(GVK)s to be an injection.
53
-
Otherwise stated, controller-runtime mandates the scheme holding the mapping between the API structs and the GVKs to have a single GVK entry per API struct.
54
-
55
-
This boils down to the following tasks:
56
-
57
-
* Duplicate the API packages, with the new API groups, and mark the old ones as deprecated
58
-
* Generate the new CRDs by leveraging: https://github.com/project-codeflare/multi-cluster-app-dispatcher/pull/456
59
-
* Mark the old CRDs as internal, using the `operators.operatorframework.io/internal-objects` annotation
60
-
* Generate the new client packages, by leveraging: https://github.com/project-codeflare/multi-cluster-app-dispatcher/issues/514
61
-
* Duplicate the MCAD controllers, to reconcile the new APIs
62
-
* Add RBAC for the new API groups to MCAD manifests (both in Helm charts and in the CodeFlare operator)
63
-
* Update tests, documentation and samples to use the new API groups
64
-
* Upgrade InstaScale to the newer MCAD version
65
-
* Duplicate the InstaScale controller, to reconcile the new AppWrapper API
66
-
* Add RBAC for the new API groups to InstaScale manifests (both in Helm charts and in the CodeFlare operator)
67
-
68
-
Note: all the duplications will be cleaned up in Phase 3.
69
-
70
-
The internal APIs may directly be migrated to the new API groups, without implementing dual support for them.
71
-
72
-
### Phase 2: Progressive Migration
73
-
74
-
The migration should be announced on the usual communication channels.
75
-
76
-
The following downstream projects must be migrated to using the new API groups:
77
-
78
-
* The TorchX MCAD scheduler: https://github.com/pytorch/torchx/blob/main/torchx/schedulers/kubernetes_mcad_scheduler.py
79
-
* The CodeFlare SDK: https://github.com/project-codeflare/codeflare-sdk
80
-
* ODH Data Science Pipeline operator: https://github.com/opendatahub-io/data-science-pipelines-operator
81
-
* KubeRay documentation: https://ray-project.github.io/kuberay/guidance/kuberay-with-MCAD (no mention of the AppWrapper API group, but the links to the project materials can be updated)
82
-
83
-
### Phase 3: Decommission old API Groups
84
-
85
-
Once all the downstream projects have migrated, and existing users acknowledged the migration plan, the following clean-up tasks should be performed:
86
-
87
-
* Delete the old MCAD API packages
88
-
* Delete the old MCAD client packages
89
-
* Delete the old MCAD CRDs
90
-
* Remove RBAC for the old API groups from MCAD manifests (both in Helm charts and in the CodeFlare operator)
91
-
* Delete the duplicated MCAD controllers
92
-
* Delete the duplicated InstaScale controllers
93
-
* Remove RBAC for the old API groups from InstaScale manifests
94
-
95
-
## Open Questions
96
-
97
-
* Is dual mode support necessary for the quota APIs?
98
-
While they'll eventually be public APIs, these are rather new, and may not be actually used yet.
99
-
100
-
## Alternatives
101
-
102
-
Given the TorchX MCAD scheduler is currently delivered as part of the [CodeFlare fork of TorchX](https://github.com/project-codeflare/torchx), all the impacted components are managed internally.
40
+
Given the [TorchX MCAD scheduler](https://pytorch.org/torchx/latest/schedulers/kubernetes_mcad.html) is currently delivered as part of the [CodeFlare fork of TorchX](https://github.com/project-codeflare/torchx), all the impacted components are managed internally.
103
41
That means a _one-shot_ migration can be a viable alternative, assuming we accept the development branches of these components may transiently break, within the span of that _one-shot_ migration release cycle.
104
42
105
43
As an example, this _one-shot_ migration could be achieved during the next development cycle of the CodeFlare stack, i.e., the upcoming v0.7.0 release, in the following order:
@@ -115,6 +53,21 @@ The ODH Data Science Pipeline operator update can be done as soon as ODH upgrade
115
53
116
54
The KubeRay documentation can be updated independently.
117
55
56
+
## Alternatives
57
+
58
+
A _progressive_ migration path could be implemented, i.e., one that does not require all the downstream projects to handle the impacts in locked steps with MCAD.
59
+
The goal would be to have that migration path started as soon as possible, and have each downstream project, and end-user, walking the migration path at their own pace, when they decide to upgrade their dependency.
60
+
61
+
That strategy would require the new API groups to be introduced in MCAD, along with the existing ones.
62
+
Functionally, the MCAD and InstaScale controllers would be updated, so they can equally reconcile the old and new APIs.
63
+
64
+
To implement that dual API groups support, it would be necessary to duplicate the MCAD API packages, as well as the generated client packages.
65
+
Indeed, [controller-runtime](https://github.com/kubernetes-sigs/controller-runtime) APIs, which are used by InstaScale, are designed to use the API structs, and require the mapping between the API structs and the Group/Version/Kind(GVK)s to be a bijection.
66
+
Otherwise stated, controller-runtime mandates the scheme holding the mapping between the API structs and the GVKs to have a single GVK entry per API struct.
67
+
68
+
While that'd give extra flexibility for downstream projects, and end-users, to migrate, this would increase the complexity of the migration.
69
+
Duplicated code would have to live in the MCAD codebase, for a potentially unbounded period of time.
0 commit comments