project-codeflare
diff --git a/‎SETUP.md
Lines changed: 5 additions & 5 deletions b/‎SETUP.md
Lines changed: 5 additions & 5 deletions
diff --git a/‎setup.RHOAI-v2.21/CLUSTER-SETUP.md
Lines changed: 171 additions & 0 deletions b/‎setup.RHOAI-v2.21/CLUSTER-SETUP.md
Lines changed: 171 additions & 0 deletions
diff --git a/‎setup.RHOAI-v2.21/TEAM-SETUP.md
Lines changed: 91 additions & 0 deletions b/‎setup.RHOAI-v2.21/TEAM-SETUP.md
Lines changed: 91 additions & 0 deletions
diff --git a/‎setup.RHOAI-v2.21/UNINSTALL.md
Lines changed: 23 additions & 0 deletions b/‎setup.RHOAI-v2.21/UNINSTALL.md
Lines changed: 23 additions & 0 deletions
diff --git a/‎setup.RHOAI-v2.21/UPGRADE-FAST.md
Lines changed: 26 additions & 0 deletions b/‎setup.RHOAI-v2.21/UPGRADE-FAST.md
Lines changed: 26 additions & 0 deletions
diff --git a/‎setup.RHOAI-v2.21/default-flavor.yaml
Lines changed: 4 additions & 0 deletions b/‎setup.RHOAI-v2.21/default-flavor.yaml
Lines changed: 4 additions & 0 deletions
diff --git a/‎setup.RHOAI-v2.21/mlbatch-dsc.yaml
Lines changed: 32 additions & 0 deletions b/‎setup.RHOAI-v2.21/mlbatch-dsc.yaml
Lines changed: 32 additions & 0 deletions
diff --git a/‎setup.RHOAI-v2.21/mlbatch-dsci.yaml
Lines changed: 14 additions & 0 deletions b/‎setup.RHOAI-v2.21/mlbatch-dsci.yaml
Lines changed: 14 additions & 0 deletions
@@ -45,16 +45,16 @@ Instructions are provided for the following Red Hat OpenShift AI ***stable*** re
    + [RHOAI 2.16 Uninstall](./setup.RHOAI-v2.16/UNINSTALL.md)
 
 Instructions are provided for the following Red Hat OpenShift AI ***fast*** releases:
++ Red Hat OpenShift AI 2.21
+   + [RHOAI 2.21 Cluster Setup](./setup.RHOAI-v2.21/CLUSTER-SETUP.md)
+   + [RHOAI 2.21 Team Setup](./setup.RHOAI-v2.21/TEAM-SETUP.md)
+   + [UPGRADING from RHOAI 2.20](./setup.RHOAI-v2.21/UPGRADE-FAST.md)
+   + [RHOAI 2.21 Uninstall](./setup.RHOAI-v2.21/UNINSTALL.md)
 + Red Hat OpenShift AI 2.20
    + [RHOAI 2.20 Cluster Setup](./setup.RHOAI-v2.20/CLUSTER-SETUP.md)
    + [RHOAI 2.20 Team Setup](./setup.RHOAI-v2.20/TEAM-SETUP.md)
    + [UPGRADING from RHOAI 2.19](./setup.RHOAI-v2.20/UPGRADE-FAST.md)
    + [RHOAI 2.20 Uninstall](./setup.RHOAI-v2.20/UNINSTALL.md)
-+ Red Hat OpenShift AI 2.19
-   + [RHOAI 2.19 Cluster Setup](./setup.RHOAI-v2.19/CLUSTER-SETUP.md)
-   + [RHOAI 2.19 Team Setup](./setup.RHOAI-v2.19/TEAM-SETUP.md)
-   + [UPGRADING from RHOAI 2.18](./setup.RHOAI-v2.19/UPGRADE-FAST.md)
-   + [RHOAI 2.19 Uninstall](./setup.RHOAI-v2.19/UNINSTALL.md)
 
 ## Kubernetes
 
 
@@ -0,0 +1,171 @@
+# Cluster Setup
+
+The cluster setup installs Red Hat OpenShift AI and configures Scheduler Plugins, Kueue,
+cluster roles, and priority classes.
+
+## Priorities
+
+Create `default-priority`, `high-priority`, and `low-priority` priority classes:
+```sh
+oc apply -f setup.RHOAI-v2.21/mlbatch-priorities.yaml
+```
+
+## Scheduler Configuration
+
+MLBatch configures Kubernetes scheduling to accomplish two objectives:
++ Obtaining gang (all or nothing) scheduling for multi-Pod workloads.
++ Packing Pods whose GPU request is less than the number of GPUs on a Node to
+  maximize the number of Nodes available for Pods that request all the GPUs on a Node.
+
+This is done by installing the Coscheduling out-of-tree scheduler plugin and configuring
+the default NodeResourcesFit scheduler plugin to pack in the GPU dimension.
+
+
+```sh
+helm install scheduler-plugins --namespace scheduler-plugins --create-namespace \
+  scheduler-plugins/manifests/install/charts/as-a-second-scheduler/ \
+  --set-json pluginConfig='[{"args":{"scoringStrategy":{"resources":[{"name":"nvidia.com/gpu","weight":1}],"requestedToCapacityRatio":{"shape":[{"utilization":0,"score":0},{"utilization":100,"score":10}]},"type":"RequestedToCapacityRatio"}},"name":"NodeResourcesFit"},{"args":{"permitWaitingTimeSeconds":300},"name":"Coscheduling"}]'
+```
+Patch scheduler-plugins pod priorities:
+```sh
+oc patch deployment -n scheduler-plugins --type=json --patch-file setup.RHOAI-v2.21/scheduler-priority-patch.yaml scheduler-plugins-controller
+oc patch deployment -n scheduler-plugins --type=json --patch-file setup.RHOAI-v2.21/scheduler-priority-patch.yaml scheduler-plugins-scheduler
+```
+
+
+
+## Red Hat OpenShift AI
+
+Create the Red Hat OpenShift AI subscription:
+```sh
+oc apply -f setup.RHOAI-v2.21/mlbatch-subscription.yaml
+```
+Create the mlbatch NetworkPolicy in the redhat-ods-applications namespace.
+```sh
+oc apply -f setup.RHOAI-v2.21/mlbatch-network-policy.yaml
+```
+Identify install plan:
+```sh
+oc get ip -n redhat-ods-operator
+```
+```
+NAMESPACE             NAME            CSV                     APPROVAL   APPROVED
+redhat-ods-operator   install-kmh8w   rhods-operator.2.21.0   Manual     false
+```
+Approve install plan replacing the generated plan name below with the actual
+value:
+```sh
+oc patch ip -n redhat-ods-operator --type merge --patch '{"spec":{"approved":true}}' install-kmh8w
+```
+Create DSC Initialization:
+```sh
+oc apply -f setup.RHOAI-v2.21/mlbatch-dsci.yaml
+```
+Create Data Science Cluster:
+```sh
+oc apply -f setup.RHOAI-v2.21/mlbatch-dsc.yaml
+```
+The provided DSCI and DSC are intended to install a minimal set of Red Hat OpenShift
+AI managed components: `codeflare`, `kueue`, `ray`, and `trainingoperator`. The
+remaining components such as `dashboard` can be optionally enabled.
+
+The configuration of the managed components differs from the default Red Hat OpenShift
+AI configuration as follows:
+- Kubeflow Training Operator:
+  - `gang-scheduler-name` is set to `scheduler-plugins-scheduler`,
+- Kueue:
+  - `manageJobsWithoutQueueName` is enabled,
+  - `batch/job` integration is disabled,
+  - `waitForPodsReady` is disabled,
+  - `fairSharing` is enabled,
+  - `enableClusterQueueResources` metrics is enabled,
+- Codeflare operator:
+  - the AppWrapper controller is enabled and configured as follows:
+    - `userRBACAdmissionCheck` is disabled,
+    - `schedulerName` is set to `scheduler-plugins-scheduler`,
+    - `queueName` is set to `default-queue`,
+    - `slackQueueName` is set to `slack-cluster-queue`
+
+## Autopilot
+
+Helm charts values and how-to for customization can be found [in the official documentation](https://github.com/IBM/autopilot/blob/main/helm-charts/autopilot/README.md). As-is, Autopilot will run on GPU nodes.
+
+- Add the Autopilot Helm repository
+
+```bash
+helm repo add autopilot https://ibm.github.io/autopilot/
+helm repo update
+```
+
+- Install the chart (idempotent command). The config file is for customizing the helm values and it is optional.
+
+```bash
+helm upgrade autopilot autopilot/autopilot --install --namespace=autopilot --create-namespace -f your-config.yml
+```
+
+### Enabling Prometheus metrics
+
+After completing the installation, manually label the namespace to enable metrics to be scraped by Prometheus with the following command:
+
+```bash
+oc label ns autopilot openshift.io/cluster-monitoring=true
+```
+
+The `ServiceMonitor` labeling is not required.
+
+## Kueue Configuration
+
+Create Kueue's default flavor:
+```sh
+oc apply -f setup.RHOAI-v2.21/default-flavor.yaml
+```
+
+## Cluster Role
+
+Create `mlbatch-edit` role:
+```sh
+oc apply -f setup.RHOAI-v2.21/mlbatch-edit-role.yaml
+```
+
+## Slack Cluster Queue
+
+Create the designated slack `ClusterQueue` which will be used to automate
+minor adjustments to cluster capacity caused by node failures and
+scheduler maintanence.
+```sh
+oc apply -f- << EOF
+apiVersion: kueue.x-k8s.io/v1beta1
+kind: ClusterQueue
+metadata:
+  name: slack-cluster-queue
+spec:
+  namespaceSelector: {}
+  cohort: default-cohort
+  preemption:
+    withinClusterQueue: LowerOrNewerEqualPriority
+    reclaimWithinCohort: Any
+    borrowWithinCohort:
+      policy: Never
+  resourceGroups:
+  - coveredResources: ["cpu", "memory", "nvidia.com/gpu", "nvidia.com/roce_gdr", "pods"]
+    flavors:
+    - name: default-flavor
+      resources:
+      - name: "cpu"
+        nominalQuota: 8000m
+      - name: "memory"
+        nominalQuota: 128Gi
+      - name: "nvidia.com/gpu"
+        nominalQuota: 8
+      - name: "nvidia.com/roce_gdr"
+        nominalQuota: 1
+      - name: "pods"
+        nominalQuota: 100
+EOF
+```
+Edit the above quantities to adjust the quota to the desired
+values. Pod counts are optional and can be omitted from the list of
+covered resources.  The `lendingLimit` for each resource will be
+dynamically adjusted by the MLBatch system to reflect reduced cluster
+capacity. See [QUOTA_MAINTENANCE.md](../QUOTA_MAINTENANCE.md) for a
+detailed discussion of the role of the slack `ClusterQueue`.
@@ -0,0 +1,91 @@
+# Team Setup
+
+A *team* in MLBatch is a group of users that share a resource quota.
+
+Before setting up your teams and quotas, please read [QUOTA_MAINTENANCE.md](../QUOTA_MAINTENANCE.md)
+for a discussion of our recommended best practices.
+
+
+Setting up a new team requires the cluster admin to create a project,
+a user group, a quota, a queue, and the required role bindings as described below.
+
+Create project:
+```sh
+oc new-project team1
+```
+Create user group:
+```sh
+oc adm groups new team1-edit-group
+```
+Add users to group for example:
+```sh
+oc adm groups add-users team1-edit-group user1
+```
+Bind cluster role to group in namespace:
+```sh
+oc adm policy add-role-to-group mlbatch-edit team1-edit-group --role-namespace="" --namespace team1
+```
+
+Specify the intended quota for the namespace by creating a `ClusterQueue`:
+```sh
+oc apply -f- << EOF
+apiVersion: kueue.x-k8s.io/v1beta1
+kind: ClusterQueue
+metadata:
+  name: team1-cluster-queue
+spec:
+  namespaceSelector: {}
+  cohort: default-cohort
+  preemption:
+    withinClusterQueue: LowerOrNewerEqualPriority
+    reclaimWithinCohort: Any
+    borrowWithinCohort:
+      policy: Never
+  resourceGroups:
+  - coveredResources: ["cpu", "memory", "nvidia.com/gpu", "nvidia.com/roce_gdr", "pods"]
+    flavors:
+    - name: default-flavor
+      resources:
+      - name: "cpu"
+        nominalQuota: 8000m
+        # borrowingLimit: 0
+        # lendingLimit: 0
+      - name: "memory"
+        nominalQuota: 128Gi
+        # borrowingLimit: 0
+        # lendingLimit: 0
+      - name: "nvidia.com/gpu"
+        nominalQuota: 16
+        # borrowingLimit: 0
+        # lendingLimit: 0
+      - name: "nvidia.com/roce_gdr"
+        nominalQuota: 4
+        # borrowingLimit: 0
+        # lendingLimit: 0
+      - name: "pods"
+        nominalQuota: 100
+        # borrowingLimit: 0
+        # lendingLimit: 0
+EOF
+```
+Edit the above quantities to adjust the quota to the desired values. Pod counts
+are optional and can be omitted from the list of covered resources.
+
+Uncomment all `borrowingLimit` lines to prevent this namespace from borrowing
+quota from other namespaces. Uncomment all `lendingLimit` lines to prevent other
+namespaces from borrowing quota from this namespace.
+
+Create a `LocalQueue` to bind the `ClusterQueue` to the namespace:
+```sh
+oc apply -n team1 -f- << EOF
+apiVersion: kueue.x-k8s.io/v1beta1
+kind: LocalQueue
+metadata:
+  name: default-queue
+spec:
+  clusterQueue: team1-cluster-queue
+EOF
+```
+We recommend naming the local queue `default-queue` as `AppWrappers` will
+default to this queue name.
+
@@ -0,0 +1,23 @@
+# Uninstall
+
+***First, remove all team projects and corresponding cluster queues.***
+
+Then to uninstall the MLBatch controllers and reclaim the corresponding
+namespaces, run:
+```sh
+# OpenShift AI uninstall
+oc delete dsc mlbatch-dsc
+oc delete dsci mlbatch-dsci
+oc delete subscription -n redhat-ods-operator rhods-operator
+oc delete csv -n redhat-ods-operator -l operators.coreos.com/rhods-operator.redhat-ods-operator
+oc delete crd featuretrackers.features.opendatahub.io \
+  dscinitializations.dscinitialization.opendatahub.io \
+  datascienceclusters.datasciencecluster.opendatahub.io
+oc delete operators rhods-operator.redhat-ods-operator
+oc delete operatorgroup -n redhat-ods-operator rhods-operator
+oc delete namespace redhat-ods-applications redhat-ods-monitoring redhat-ods-operator
+
+# Coscheduler uninstall
+helm uninstall -n scheduler-plugins scheduler-plugins
+oc delete namespace scheduler-plugins
+```
@@ -0,0 +1,26 @@
+# Upgrading from RHOAI 2.20
+
+These instructions assume you installed and configured RHOAI 2.20 following
+the MLBatch [install instructions for RHOAI-v2.20](../setup.RHOAI-v2.20/CLUSTER-SETUP.md).
+
+Your subscription will have automatically created an unapproved
+install plan to upgrade to RHOAI 2.21.
+
+Before beginning, verify that the expected install plan exists:
+```sh
+oc get ip -n redhat-ods-operator
+```
+Typical output would be:
+```sh
+NAME            CSV                     APPROVAL   APPROVED
+install-kpzzl   rhods-operator.2.21.0   Manual     false
+install-nqrbp   rhods-operator.2.20.0   Manual     true
+```
+
+There are no MLBatch modifications to the default RHOAI configuration maps
+beyond those already made in previous installs. Therefore, you can simply
+approve the install plan replacing the example plan name below with the actual
+value on your cluster:
+```sh
+oc patch ip -n redhat-ods-operator --type merge --patch '{"spec":{"approved":true}}' install-kpzzl
+```
@@ -0,0 +1,4 @@
+apiVersion: kueue.x-k8s.io/v1beta1
+kind: ResourceFlavor
+metadata:
+  name: default-flavor
@@ -0,0 +1,32 @@
+apiVersion: datasciencecluster.opendatahub.io/v1
+kind: DataScienceCluster
+metadata:
+  name: mlbatch-dsc
+spec:
+  components:
+    codeflare:
+      managementState: Managed
+    dashboard:
+      managementState: Removed
+    datasciencepipelines:
+      managementState: Removed
+    kserve:
+      managementState: Removed
+      serving:
+        ingressGateway:
+          certificate:
+            type: SelfSigned
+        managementState: Removed
+        name: knative-serving
+    kueue:
+      managementState: Managed
+    modelmeshserving:
+      managementState: Removed
+    ray:
+      managementState: Managed
+    trainingoperator:
+      managementState: Managed
+    trustyai:
+      managementState: Removed
+    workbenches:
+      managementState: Removed
@@ -0,0 +1,14 @@
+apiVersion: dscinitialization.opendatahub.io/v1
+kind: DSCInitialization
+metadata:
+  name: mlbatch-dsci
+spec:
+  applicationsNamespace: redhat-ods-applications
+  monitoring:
+    managementState: Managed
+    namespace: redhat-ods-monitoring
+  serviceMesh:
+    managementState: Removed
+  trustedCABundle:
+    customCABundle: ""
+    managementState: Managed