Skip to content

Commit 71afd67

Browse files
astefanuttianishasthana
authored andcommitted
Promote one-shot migration as proposed plan
1 parent 2e051b7 commit 71afd67

File tree

1 file changed

+16
-63
lines changed

1 file changed

+16
-63
lines changed

PCF-ADR-0005-mcad-api-groups-migration.md

Lines changed: 16 additions & 63 deletions
Original file line numberDiff line numberDiff line change
@@ -37,69 +37,7 @@ Downstream projects and end-users will have to address the impacts manually to p
3737

3838
## How
3939

40-
A project like [TorchX](https://github.com/pytorch/torchx), which integrates with MCAD[^1] via the AppWrapper API, has its own release cadence, and it'd be challenging to have it migrated within a single migration cycle, without breaking compatibility.
41-
42-
[^1]: https://pytorch.org/torchx/latest/schedulers/kubernetes_mcad.html
43-
44-
The following plan proposes a progressive migration path, implemented in three phases, as follows:
45-
46-
### Phase 1: Add Dual API Groups Support
47-
48-
The new API groups are introduced in MCAD, along with the existing ones.
49-
Functionally, the MCAD and InstaScale controllers are updated, so they can equally reconcile the old and new APIs.
50-
51-
To implement that dual API groups support, it may be necessary to duplicate the MCAD API packages, as well as the generated client packages.
52-
Indeed, [controller-runtime](https://github.com/kubernetes-sigs/controller-runtime) APIs, which are used by InstaScale, are designed to use the API structs, and require the mapping between the API structs and the Group/Version/Kind(GVK)s to be an injection.
53-
Otherwise stated, controller-runtime mandates the scheme holding the mapping between the API structs and the GVKs to have a single GVK entry per API struct.
54-
55-
This boils down to the following tasks:
56-
57-
* Duplicate the API packages, with the new API groups, and mark the old ones as deprecated
58-
* Generate the new CRDs by leveraging: https://github.com/project-codeflare/multi-cluster-app-dispatcher/pull/456
59-
* Mark the old CRDs as internal, using the `operators.operatorframework.io/internal-objects` annotation
60-
* Generate the new client packages, by leveraging: https://github.com/project-codeflare/multi-cluster-app-dispatcher/issues/514
61-
* Duplicate the MCAD controllers, to reconcile the new APIs
62-
* Add RBAC for the new API groups to MCAD manifests (both in Helm charts and in the CodeFlare operator)
63-
* Update tests, documentation and samples to use the new API groups
64-
* Upgrade InstaScale to the newer MCAD version
65-
* Duplicate the InstaScale controller, to reconcile the new AppWrapper API
66-
* Add RBAC for the new API groups to InstaScale manifests (both in Helm charts and in the CodeFlare operator)
67-
68-
Note: all the duplications will be cleaned up in Phase 3.
69-
70-
The internal APIs may directly be migrated to the new API groups, without implementing dual support for them.
71-
72-
### Phase 2: Progressive Migration
73-
74-
The migration should be announced on the usual communication channels.
75-
76-
The following downstream projects must be migrated to using the new API groups:
77-
78-
* The TorchX MCAD scheduler: https://github.com/pytorch/torchx/blob/main/torchx/schedulers/kubernetes_mcad_scheduler.py
79-
* The CodeFlare SDK: https://github.com/project-codeflare/codeflare-sdk
80-
* ODH Data Science Pipeline operator: https://github.com/opendatahub-io/data-science-pipelines-operator
81-
* KubeRay documentation: https://ray-project.github.io/kuberay/guidance/kuberay-with-MCAD (no mention of the AppWrapper API group, but the links to the project materials can be updated)
82-
83-
### Phase 3: Decommission old API Groups
84-
85-
Once all the downstream projects have migrated, and existing users acknowledged the migration plan, the following clean-up tasks should be performed:
86-
87-
* Delete the old MCAD API packages
88-
* Delete the old MCAD client packages
89-
* Delete the old MCAD CRDs
90-
* Remove RBAC for the old API groups from MCAD manifests (both in Helm charts and in the CodeFlare operator)
91-
* Delete the duplicated MCAD controllers
92-
* Delete the duplicated InstaScale controllers
93-
* Remove RBAC for the old API groups from InstaScale manifests
94-
95-
## Open Questions
96-
97-
* Is dual mode support necessary for the quota APIs?
98-
While they'll eventually be public APIs, these are rather new, and may not be actually used yet.
99-
100-
## Alternatives
101-
102-
Given the TorchX MCAD scheduler is currently delivered as part of the [CodeFlare fork of TorchX](https://github.com/project-codeflare/torchx), all the impacted components are managed internally.
40+
Given the [TorchX MCAD scheduler](https://pytorch.org/torchx/latest/schedulers/kubernetes_mcad.html) is currently delivered as part of the [CodeFlare fork of TorchX](https://github.com/project-codeflare/torchx), all the impacted components are managed internally.
10341
That means a _one-shot_ migration can be a viable alternative, assuming we accept the development branches of these components may transiently break, within the span of that _one-shot_ migration release cycle.
10442

10543
As an example, this _one-shot_ migration could be achieved during the next development cycle of the CodeFlare stack, i.e., the upcoming v0.7.0 release, in the following order:
@@ -115,6 +53,21 @@ The ODH Data Science Pipeline operator update can be done as soon as ODH upgrade
11553

11654
The KubeRay documentation can be updated independently.
11755

56+
## Alternatives
57+
58+
A _progressive_ migration path could be implemented, i.e., one that does not require all the downstream projects to handle the impacts in locked steps with MCAD.
59+
The goal would be to have that migration path started as soon as possible, and have each downstream project, and end-user, walking the migration path at their own pace, when they decide to upgrade their dependency.
60+
61+
That strategy would require the new API groups to be introduced in MCAD, along with the existing ones.
62+
Functionally, the MCAD and InstaScale controllers would be updated, so they can equally reconcile the old and new APIs.
63+
64+
To implement that dual API groups support, it would be necessary to duplicate the MCAD API packages, as well as the generated client packages.
65+
Indeed, [controller-runtime](https://github.com/kubernetes-sigs/controller-runtime) APIs, which are used by InstaScale, are designed to use the API structs, and require the mapping between the API structs and the Group/Version/Kind(GVK)s to be a bijection.
66+
Otherwise stated, controller-runtime mandates the scheme holding the mapping between the API structs and the GVKs to have a single GVK entry per API struct.
67+
68+
While that'd give extra flexibility for downstream projects, and end-users, to migrate, this would increase the complexity of the migration.
69+
Duplicated code would have to live in the MCAD codebase, for a potentially unbounded period of time.
70+
11871
## Stakeholder Impacts
11972

12073
| Group | Key Contacts | Date | Impacted? |

0 commit comments

Comments
 (0)