Skip to content

Commit cfe55f4

Browse files
authored
Document RHOAI 2.21 setup (#191)
1 parent b76eec8 commit cfe55f4

20 files changed

+769
-13
lines changed

SETUP.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -45,16 +45,16 @@ Instructions are provided for the following Red Hat OpenShift AI ***stable*** re
4545
+ [RHOAI 2.16 Uninstall](./setup.RHOAI-v2.16/UNINSTALL.md)
4646

4747
Instructions are provided for the following Red Hat OpenShift AI ***fast*** releases:
48+
+ Red Hat OpenShift AI 2.21
49+
+ [RHOAI 2.21 Cluster Setup](./setup.RHOAI-v2.21/CLUSTER-SETUP.md)
50+
+ [RHOAI 2.21 Team Setup](./setup.RHOAI-v2.21/TEAM-SETUP.md)
51+
+ [UPGRADING from RHOAI 2.20](./setup.RHOAI-v2.21/UPGRADE-FAST.md)
52+
+ [RHOAI 2.21 Uninstall](./setup.RHOAI-v2.21/UNINSTALL.md)
4853
+ Red Hat OpenShift AI 2.20
4954
+ [RHOAI 2.20 Cluster Setup](./setup.RHOAI-v2.20/CLUSTER-SETUP.md)
5055
+ [RHOAI 2.20 Team Setup](./setup.RHOAI-v2.20/TEAM-SETUP.md)
5156
+ [UPGRADING from RHOAI 2.19](./setup.RHOAI-v2.20/UPGRADE-FAST.md)
5257
+ [RHOAI 2.20 Uninstall](./setup.RHOAI-v2.20/UNINSTALL.md)
53-
+ Red Hat OpenShift AI 2.19
54-
+ [RHOAI 2.19 Cluster Setup](./setup.RHOAI-v2.19/CLUSTER-SETUP.md)
55-
+ [RHOAI 2.19 Team Setup](./setup.RHOAI-v2.19/TEAM-SETUP.md)
56-
+ [UPGRADING from RHOAI 2.18](./setup.RHOAI-v2.19/UPGRADE-FAST.md)
57-
+ [RHOAI 2.19 Uninstall](./setup.RHOAI-v2.19/UNINSTALL.md)
5858

5959
## Kubernetes
6060

setup.RHOAI-v2.21/CLUSTER-SETUP.md

Lines changed: 171 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,171 @@
1+
# Cluster Setup
2+
3+
The cluster setup installs Red Hat OpenShift AI and configures Scheduler Plugins, Kueue,
4+
cluster roles, and priority classes.
5+
6+
## Priorities
7+
8+
Create `default-priority`, `high-priority`, and `low-priority` priority classes:
9+
```sh
10+
oc apply -f setup.RHOAI-v2.21/mlbatch-priorities.yaml
11+
```
12+
13+
## Scheduler Configuration
14+
15+
MLBatch configures Kubernetes scheduling to accomplish two objectives:
16+
+ Obtaining gang (all or nothing) scheduling for multi-Pod workloads.
17+
+ Packing Pods whose GPU request is less than the number of GPUs on a Node to
18+
maximize the number of Nodes available for Pods that request all the GPUs on a Node.
19+
20+
This is done by installing the Coscheduling out-of-tree scheduler plugin and configuring
21+
the default NodeResourcesFit scheduler plugin to pack in the GPU dimension.
22+
23+
24+
```sh
25+
helm install scheduler-plugins --namespace scheduler-plugins --create-namespace \
26+
scheduler-plugins/manifests/install/charts/as-a-second-scheduler/ \
27+
--set-json pluginConfig='[{"args":{"scoringStrategy":{"resources":[{"name":"nvidia.com/gpu","weight":1}],"requestedToCapacityRatio":{"shape":[{"utilization":0,"score":0},{"utilization":100,"score":10}]},"type":"RequestedToCapacityRatio"}},"name":"NodeResourcesFit"},{"args":{"permitWaitingTimeSeconds":300},"name":"Coscheduling"}]'
28+
```
29+
Patch scheduler-plugins pod priorities:
30+
```sh
31+
oc patch deployment -n scheduler-plugins --type=json --patch-file setup.RHOAI-v2.21/scheduler-priority-patch.yaml scheduler-plugins-controller
32+
oc patch deployment -n scheduler-plugins --type=json --patch-file setup.RHOAI-v2.21/scheduler-priority-patch.yaml scheduler-plugins-scheduler
33+
```
34+
35+
36+
37+
## Red Hat OpenShift AI
38+
39+
Create the Red Hat OpenShift AI subscription:
40+
```sh
41+
oc apply -f setup.RHOAI-v2.21/mlbatch-subscription.yaml
42+
```
43+
Create the mlbatch NetworkPolicy in the redhat-ods-applications namespace.
44+
```sh
45+
oc apply -f setup.RHOAI-v2.21/mlbatch-network-policy.yaml
46+
```
47+
Identify install plan:
48+
```sh
49+
oc get ip -n redhat-ods-operator
50+
```
51+
```
52+
NAMESPACE NAME CSV APPROVAL APPROVED
53+
redhat-ods-operator install-kmh8w rhods-operator.2.21.0 Manual false
54+
```
55+
Approve install plan replacing the generated plan name below with the actual
56+
value:
57+
```sh
58+
oc patch ip -n redhat-ods-operator --type merge --patch '{"spec":{"approved":true}}' install-kmh8w
59+
```
60+
Create DSC Initialization:
61+
```sh
62+
oc apply -f setup.RHOAI-v2.21/mlbatch-dsci.yaml
63+
```
64+
Create Data Science Cluster:
65+
```sh
66+
oc apply -f setup.RHOAI-v2.21/mlbatch-dsc.yaml
67+
```
68+
The provided DSCI and DSC are intended to install a minimal set of Red Hat OpenShift
69+
AI managed components: `codeflare`, `kueue`, `ray`, and `trainingoperator`. The
70+
remaining components such as `dashboard` can be optionally enabled.
71+
72+
The configuration of the managed components differs from the default Red Hat OpenShift
73+
AI configuration as follows:
74+
- Kubeflow Training Operator:
75+
- `gang-scheduler-name` is set to `scheduler-plugins-scheduler`,
76+
- Kueue:
77+
- `manageJobsWithoutQueueName` is enabled,
78+
- `batch/job` integration is disabled,
79+
- `waitForPodsReady` is disabled,
80+
- `fairSharing` is enabled,
81+
- `enableClusterQueueResources` metrics is enabled,
82+
- Codeflare operator:
83+
- the AppWrapper controller is enabled and configured as follows:
84+
- `userRBACAdmissionCheck` is disabled,
85+
- `schedulerName` is set to `scheduler-plugins-scheduler`,
86+
- `queueName` is set to `default-queue`,
87+
- `slackQueueName` is set to `slack-cluster-queue`
88+
89+
## Autopilot
90+
91+
Helm charts values and how-to for customization can be found [in the official documentation](https://github.com/IBM/autopilot/blob/main/helm-charts/autopilot/README.md). As-is, Autopilot will run on GPU nodes.
92+
93+
- Add the Autopilot Helm repository
94+
95+
```bash
96+
helm repo add autopilot https://ibm.github.io/autopilot/
97+
helm repo update
98+
```
99+
100+
- Install the chart (idempotent command). The config file is for customizing the helm values and it is optional.
101+
102+
```bash
103+
helm upgrade autopilot autopilot/autopilot --install --namespace=autopilot --create-namespace -f your-config.yml
104+
```
105+
106+
### Enabling Prometheus metrics
107+
108+
After completing the installation, manually label the namespace to enable metrics to be scraped by Prometheus with the following command:
109+
110+
```bash
111+
oc label ns autopilot openshift.io/cluster-monitoring=true
112+
```
113+
114+
The `ServiceMonitor` labeling is not required.
115+
116+
## Kueue Configuration
117+
118+
Create Kueue's default flavor:
119+
```sh
120+
oc apply -f setup.RHOAI-v2.21/default-flavor.yaml
121+
```
122+
123+
## Cluster Role
124+
125+
Create `mlbatch-edit` role:
126+
```sh
127+
oc apply -f setup.RHOAI-v2.21/mlbatch-edit-role.yaml
128+
```
129+
130+
## Slack Cluster Queue
131+
132+
Create the designated slack `ClusterQueue` which will be used to automate
133+
minor adjustments to cluster capacity caused by node failures and
134+
scheduler maintanence.
135+
```sh
136+
oc apply -f- << EOF
137+
apiVersion: kueue.x-k8s.io/v1beta1
138+
kind: ClusterQueue
139+
metadata:
140+
name: slack-cluster-queue
141+
spec:
142+
namespaceSelector: {}
143+
cohort: default-cohort
144+
preemption:
145+
withinClusterQueue: LowerOrNewerEqualPriority
146+
reclaimWithinCohort: Any
147+
borrowWithinCohort:
148+
policy: Never
149+
resourceGroups:
150+
- coveredResources: ["cpu", "memory", "nvidia.com/gpu", "nvidia.com/roce_gdr", "pods"]
151+
flavors:
152+
- name: default-flavor
153+
resources:
154+
- name: "cpu"
155+
nominalQuota: 8000m
156+
- name: "memory"
157+
nominalQuota: 128Gi
158+
- name: "nvidia.com/gpu"
159+
nominalQuota: 8
160+
- name: "nvidia.com/roce_gdr"
161+
nominalQuota: 1
162+
- name: "pods"
163+
nominalQuota: 100
164+
EOF
165+
```
166+
Edit the above quantities to adjust the quota to the desired
167+
values. Pod counts are optional and can be omitted from the list of
168+
covered resources. The `lendingLimit` for each resource will be
169+
dynamically adjusted by the MLBatch system to reflect reduced cluster
170+
capacity. See [QUOTA_MAINTENANCE.md](../QUOTA_MAINTENANCE.md) for a
171+
detailed discussion of the role of the slack `ClusterQueue`.

setup.RHOAI-v2.21/TEAM-SETUP.md

Lines changed: 91 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,91 @@
1+
# Team Setup
2+
3+
A *team* in MLBatch is a group of users that share a resource quota.
4+
5+
Before setting up your teams and quotas, please read [QUOTA_MAINTENANCE.md](../QUOTA_MAINTENANCE.md)
6+
for a discussion of our recommended best practices.
7+
8+
9+
Setting up a new team requires the cluster admin to create a project,
10+
a user group, a quota, a queue, and the required role bindings as described below.
11+
12+
Create project:
13+
```sh
14+
oc new-project team1
15+
```
16+
Create user group:
17+
```sh
18+
oc adm groups new team1-edit-group
19+
```
20+
Add users to group for example:
21+
```sh
22+
oc adm groups add-users team1-edit-group user1
23+
```
24+
Bind cluster role to group in namespace:
25+
```sh
26+
oc adm policy add-role-to-group mlbatch-edit team1-edit-group --role-namespace="" --namespace team1
27+
```
28+
29+
Specify the intended quota for the namespace by creating a `ClusterQueue`:
30+
```sh
31+
oc apply -f- << EOF
32+
apiVersion: kueue.x-k8s.io/v1beta1
33+
kind: ClusterQueue
34+
metadata:
35+
name: team1-cluster-queue
36+
spec:
37+
namespaceSelector: {}
38+
cohort: default-cohort
39+
preemption:
40+
withinClusterQueue: LowerOrNewerEqualPriority
41+
reclaimWithinCohort: Any
42+
borrowWithinCohort:
43+
policy: Never
44+
resourceGroups:
45+
- coveredResources: ["cpu", "memory", "nvidia.com/gpu", "nvidia.com/roce_gdr", "pods"]
46+
flavors:
47+
- name: default-flavor
48+
resources:
49+
- name: "cpu"
50+
nominalQuota: 8000m
51+
# borrowingLimit: 0
52+
# lendingLimit: 0
53+
- name: "memory"
54+
nominalQuota: 128Gi
55+
# borrowingLimit: 0
56+
# lendingLimit: 0
57+
- name: "nvidia.com/gpu"
58+
nominalQuota: 16
59+
# borrowingLimit: 0
60+
# lendingLimit: 0
61+
- name: "nvidia.com/roce_gdr"
62+
nominalQuota: 4
63+
# borrowingLimit: 0
64+
# lendingLimit: 0
65+
- name: "pods"
66+
nominalQuota: 100
67+
# borrowingLimit: 0
68+
# lendingLimit: 0
69+
EOF
70+
```
71+
Edit the above quantities to adjust the quota to the desired values. Pod counts
72+
are optional and can be omitted from the list of covered resources.
73+
74+
Uncomment all `borrowingLimit` lines to prevent this namespace from borrowing
75+
quota from other namespaces. Uncomment all `lendingLimit` lines to prevent other
76+
namespaces from borrowing quota from this namespace.
77+
78+
Create a `LocalQueue` to bind the `ClusterQueue` to the namespace:
79+
```sh
80+
oc apply -n team1 -f- << EOF
81+
apiVersion: kueue.x-k8s.io/v1beta1
82+
kind: LocalQueue
83+
metadata:
84+
name: default-queue
85+
spec:
86+
clusterQueue: team1-cluster-queue
87+
EOF
88+
```
89+
We recommend naming the local queue `default-queue` as `AppWrappers` will
90+
default to this queue name.
91+

setup.RHOAI-v2.21/UNINSTALL.md

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
# Uninstall
2+
3+
***First, remove all team projects and corresponding cluster queues.***
4+
5+
Then to uninstall the MLBatch controllers and reclaim the corresponding
6+
namespaces, run:
7+
```sh
8+
# OpenShift AI uninstall
9+
oc delete dsc mlbatch-dsc
10+
oc delete dsci mlbatch-dsci
11+
oc delete subscription -n redhat-ods-operator rhods-operator
12+
oc delete csv -n redhat-ods-operator -l operators.coreos.com/rhods-operator.redhat-ods-operator
13+
oc delete crd featuretrackers.features.opendatahub.io \
14+
dscinitializations.dscinitialization.opendatahub.io \
15+
datascienceclusters.datasciencecluster.opendatahub.io
16+
oc delete operators rhods-operator.redhat-ods-operator
17+
oc delete operatorgroup -n redhat-ods-operator rhods-operator
18+
oc delete namespace redhat-ods-applications redhat-ods-monitoring redhat-ods-operator
19+
20+
# Coscheduler uninstall
21+
helm uninstall -n scheduler-plugins scheduler-plugins
22+
oc delete namespace scheduler-plugins
23+
```

setup.RHOAI-v2.21/UPGRADE-FAST.md

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
# Upgrading from RHOAI 2.20
2+
3+
These instructions assume you installed and configured RHOAI 2.20 following
4+
the MLBatch [install instructions for RHOAI-v2.20](../setup.RHOAI-v2.20/CLUSTER-SETUP.md).
5+
6+
Your subscription will have automatically created an unapproved
7+
install plan to upgrade to RHOAI 2.21.
8+
9+
Before beginning, verify that the expected install plan exists:
10+
```sh
11+
oc get ip -n redhat-ods-operator
12+
```
13+
Typical output would be:
14+
```sh
15+
NAME CSV APPROVAL APPROVED
16+
install-kpzzl rhods-operator.2.21.0 Manual false
17+
install-nqrbp rhods-operator.2.20.0 Manual true
18+
```
19+
20+
There are no MLBatch modifications to the default RHOAI configuration maps
21+
beyond those already made in previous installs. Therefore, you can simply
22+
approve the install plan replacing the example plan name below with the actual
23+
value on your cluster:
24+
```sh
25+
oc patch ip -n redhat-ods-operator --type merge --patch '{"spec":{"approved":true}}' install-kpzzl
26+
```

setup.RHOAI-v2.21/default-flavor.yaml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
apiVersion: kueue.x-k8s.io/v1beta1
2+
kind: ResourceFlavor
3+
metadata:
4+
name: default-flavor

setup.RHOAI-v2.21/mlbatch-dsc.yaml

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
apiVersion: datasciencecluster.opendatahub.io/v1
2+
kind: DataScienceCluster
3+
metadata:
4+
name: mlbatch-dsc
5+
spec:
6+
components:
7+
codeflare:
8+
managementState: Managed
9+
dashboard:
10+
managementState: Removed
11+
datasciencepipelines:
12+
managementState: Removed
13+
kserve:
14+
managementState: Removed
15+
serving:
16+
ingressGateway:
17+
certificate:
18+
type: SelfSigned
19+
managementState: Removed
20+
name: knative-serving
21+
kueue:
22+
managementState: Managed
23+
modelmeshserving:
24+
managementState: Removed
25+
ray:
26+
managementState: Managed
27+
trainingoperator:
28+
managementState: Managed
29+
trustyai:
30+
managementState: Removed
31+
workbenches:
32+
managementState: Removed

setup.RHOAI-v2.21/mlbatch-dsci.yaml

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
apiVersion: dscinitialization.opendatahub.io/v1
2+
kind: DSCInitialization
3+
metadata:
4+
name: mlbatch-dsci
5+
spec:
6+
applicationsNamespace: redhat-ods-applications
7+
monitoring:
8+
managementState: Managed
9+
namespace: redhat-ods-monitoring
10+
serviceMesh:
11+
managementState: Removed
12+
trustedCABundle:
13+
customCABundle: ""
14+
managementState: Managed

0 commit comments

Comments
 (0)