Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rhoai 210 #3

Merged
merged 49 commits into from
Jul 17, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
49 commits
Select commit Hold shift + click to select a range
50942b2
210 updates
dmarcus-wire Jul 3, 2024
0c2cd69
updates for 210
dmarcus-wire Jul 3, 2024
627bd17
lint
dmarcus-wire Jul 3, 2024
d5a7f85
update docs for 2.10 from 2.9
dmarcus-wire Jul 3, 2024
c2bae4f
updated readme
dmarcus-wire Jul 3, 2024
e0def2f
fix: lint
codekow Jul 5, 2024
7b1394a
Merge branch 'main' into rhoai-210
codekow Jul 5, 2024
d57bdcb
fix: lint
codekow Jul 5, 2024
c4662c0
update: minor text edits
codekow Jul 5, 2024
4484617
update: docs
codekow Jul 5, 2024
17a4bc0
update: web-term
codekow Jul 5, 2024
ecd27b7
update: no startingCSV
codekow Jul 5, 2024
e9f8acb
update: docs
codekow Jul 5, 2024
77d52b2
update: wordlist
codekow Jul 5, 2024
eff5e24
update: yaml
codekow Jul 5, 2024
9b71b7a
update: docs
codekow Jul 5, 2024
78488f6
update: nvidia configs
codekow Jul 5, 2024
8c6b998
fix: htpasswd name
codekow Jul 5, 2024
f9d724b
update: serverless og
codekow Jul 5, 2024
5582bc1
fix: lint
codekow Jul 5, 2024
c99d0b5
update: all the things
codekow Jul 5, 2024
58b02f7
fix: one path
codekow Jul 5, 2024
8a12c75
update: dsci
codekow Jul 5, 2024
0234077
add: answer key
codekow Jul 6, 2024
e3791b4
update: docs
codekow Jul 6, 2024
84062a3
update: gpu sample in sandbox
codekow Jul 8, 2024
2ee2df9
add: authorino instance ns
codekow Jul 8, 2024
324162f
add: datasci label
codekow Jul 8, 2024
49a58ce
update: comment out patch
codekow Jul 8, 2024
49ede99
add: link
codekow Jul 8, 2024
0fc63d4
update: docs
codekow Jul 8, 2024
4dd3afe
fix: lint
codekow Jul 8, 2024
52b2202
fix: kubeadmin
codekow Jul 8, 2024
f2c0f04
fix: kubeadmin
codekow Jul 8, 2024
2952674
add: shortcut functions
codekow Jul 8, 2024
8d92ebb
add: upgrade note
codekow Jul 8, 2024
e02740f
update: easy button
codekow Jul 8, 2024
052883a
update: easy button
codekow Jul 8, 2024
f52b1b5
fix: JSON to YAML
codekow Jul 11, 2024
a74803e
cleanup: machineset
codekow Jul 11, 2024
7988518
clean: nfd instance
codekow Jul 11, 2024
fd606a2
clean: nfd instance
codekow Jul 11, 2024
ea22ba1
fix: htpasswd
codekow Jul 11, 2024
b6eb786
cleanup: gpu stuff
codekow Jul 15, 2024
14a3a38
update procedure
dmarcus-wire Jul 17, 2024
636d68f
update expected output
dmarcus-wire Jul 17, 2024
fdb31b1
update expected output label gpu
dmarcus-wire Jul 17, 2024
ea46bff
wordlist
dmarcus-wire Jul 17, 2024
6ca3393
wordlist
dmarcus-wire Jul 17, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
364 changes: 256 additions & 108 deletions .wordlist-md

Large diffs are not rendered by default.

14 changes: 7 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,19 +3,19 @@
This is the Hobbyist Guide to Installing and Configuring RHOAI for customers. Bring your towel. This repo is intentionally imperative to aggregate the various official docs into a single markdown and paves the way for declarative automation in the [ai-gitops-catalog](https://github.com/redhat-na-ssa/demo-ai-gitops-catalog).

- OCP Instance: AWS with OpenShift Open Environment
- OCP Version: 4.14.27
- RHOAI Version: 2.9.1
- OCP Version: 4.15
- RHOAI Version: 2.10

```shell
.
├── LICENSE
├── README.md
├── notes
│ ├── 00_FEATURES.md # Overview of the features in RHOAI 2.9
│ ├── 01_DASHBOARD.md # Deep dive into the RHOAI 2.9 Dashboard
│ ├── 00_FEATURES.md # Overview of the features in RHOAI
│ ├── 01_DASHBOARD.md # Deep dive into the RHOAI dashboard
│ ├── 02_CHECKLIST.md # Technical overview for RHOAI install/config
│ ├── 03_CHECKLIST_PROCEDURE.md # Detailed steps that accompany 02_CHECKLIST.md
│ ├── 04_TUTORIAL_FRAUD.md # Notes for the Fraud Detection demo
│ ├── 05_TUTORIAL_DISTR_WORKLOADS.md # Notes for the Distributed Workloads demo
│ ├── 03_CHECKLIST_PROCEDURE.md # Additional detailed steps
│ ├── 04_TUTORIAL_FRAUD.md # Notes for the fraud detection demo
│ ├── 05_TUTORIAL_DISTR_WORKLOADS.md # Notes for the distributed workloads demo
│ └── configs # These are the config files used in the 03_CHECKLIST_PROCEDURE.md
```
4 changes: 4 additions & 0 deletions configs/authorino-instance-ns.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
apiVersion: v1
kind: Namespace
metadata:
name: redhat-ods-applications-auth-provider
2 changes: 1 addition & 1 deletion configs/authorino-subscription.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -9,4 +9,4 @@ spec:
name: authorino-operator
source: redhat-operators
sourceNamespace: openshift-marketplace
startingCSV: authorino-operator.v1.0.1
# startingCSV: authorino-operator.v1.0.1
73 changes: 73 additions & 0 deletions configs/files/ocp-machineset.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
apiVersion: machine.openshift.io/v1beta1
kind: MachineSet
metadata:
annotations:
capacity.cluster-autoscaler.kubernetes.io/labels: kubernetes.io/arch=amd64
machine.openshift.io/GPU: "0"
machine.openshift.io/memoryMb: "16384"
machine.openshift.io/vCPU: "4"
creationTimestamp: "2024-05-28T17:18:56Z"
generation: 2
labels:
machine.openshift.io/cluster-api-cluster: rhoai29-cd8g7
name: rhoai29-cd8g7-worker-us-east-2a-gpu
namespace: openshift-machine-api
resourceVersion: "629586"
uid: eeb16140-46fa-4363-8792-1a0022699bb8
spec:
replicas: 2
selector:
matchLabels:
machine.openshift.io/cluster-api-cluster: rhoai29-cd8g7
machine.openshift.io/cluster-api-machineset: rhoai29-cd8g7-worker-us-east-2a-gpu
template:
metadata:
labels:
machine.openshift.io/cluster-api-cluster: rhoai29-cd8g7
machine.openshift.io/cluster-api-machine-role: worker
machine.openshift.io/cluster-api-machine-type: worker
machine.openshift.io/cluster-api-machineset: rhoai29-cd8g7-worker-us-east-2a-gpu
spec:
lifecycleHooks: {}
metadata: {}
providerSpec:
value:
ami:
id: ami-049d8fda91038a0fd
apiVersion: machine.openshift.io/v1beta1
blockDevices:
- ebs:
encrypted: true
iops: 0
kmsKey:
arn: ""
volumeSize: 120
volumeType: gp3
credentialsSecret:
name: aws-cloud-credentials
deviceIndex: 0
iamInstanceProfile:
id: rhoai29-cd8g7-worker-profile
instanceType: g4dn.xlarge
kind: AWSMachineProviderConfig
metadata:
creationTimestamp: null
metadataServiceOptions: {}
placement:
availabilityZone: us-east-2a
region: us-east-2
securityGroups:
- filters:
- name: tag:Name
values:
- rhoai29-cd8g7-worker-sg
subnet:
filters:
- name: tag:Name
values:
- rhoai29-cd8g7-private-us-east-2a
tags:
- name: kubernetes.io/cluster/rhoai29-cd8g7
value: owned
userDataSecret:
name: worker-user-data
12 changes: 12 additions & 0 deletions configs/fix-kubeadmin.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: fix-rhoai-kubeadmin
subjects:
- kind: User
apiGroup: rbac.authorization.k8s.io
name: 'kube:admin'
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: cluster-admin
153 changes: 153 additions & 0 deletions configs/functions.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,153 @@
#!/bin/bash

ocp_control_nodes_not_schedulable(){
oc patch schedulers.config.openshift.io/cluster --type merge --patch '{"spec":{"mastersSchedulable": false}}'
}

ocp_control_nodes_schedulable(){
oc patch schedulers.config.openshift.io/cluster --type merge --patch '{"spec":{"mastersSchedulable": true}}'
}

ocp_gpu_taint_nodes(){
oc adm taint node -l node-role.kubernetes.io/gpu nvidia-gpu-only=:NoSchedule --overwrite
oc adm drain -l node-role.kubernetes.io/gpu --ignore-daemonsets --delete-emptydir-data
oc adm uncordon -l node-role.kubernetes.io/gpu
}

ocp_gpu_untaint_nodes(){
oc adm taint node -l node-role.kubernetes.io/gpu nvidia-gpu-only=:NoSchedule-
}

ocp_gpu_label_nodes_from_nfd(){
oc label node -l nvidia.com/gpu.machine node-role.kubernetes.io/gpu=''
}

ocp_aws_clone_worker_machineset(){
[ -z "${1}" ] && \
echo "
usage: ocp_aws_clone_worker_machineset < instance type, default g4dn.4xlarge > < machine set name >
"

INSTANCE_TYPE=${1:-g4dn.4xlarge}
SHORT_NAME=${2:-${INSTANCE_TYPE%.*}}

MACHINE_SET_NAME=$(oc -n openshift-machine-api get machinesets.machine.openshift.io -o name | grep "${SHORT_NAME}" | head -n1)
MACHINE_SET_WORKER=$(oc -n openshift-machine-api get machinesets.machine.openshift.io -o name | grep worker | head -n1)

# check for an existing instance machine set
if [ -n "${MACHINE_SET_NAME}" ]; then
echo "Exists: machineset - ${MACHINE_SET_NAME}"
else
echo "Creating: machineset - ${SHORT_NAME}"
oc -n openshift-machine-api \
get "${MACHINE_SET_WORKER}" -o yaml | \
sed '/machine/ s/-worker/-'"${INSTANCE_TYPE}"'/g
/^ name:/ s/cluster-.*/'"${SHORT_NAME}"'/g
/name/ s/-worker/-'"${SHORT_NAME}"'/g
s/instanceType.*/instanceType: '"${INSTANCE_TYPE}"'/
/cluster-api-autoscaler/d
s/replicas.*/replicas: 0/' | \
oc apply -f -
fi

# cosmetic pretty
oc -n openshift-machine-api \
patch "${MACHINE_SET_NAME}" \
--type=merge --patch '{"spec":{"template":{"spec":{"metadata":{"labels":{"node-role.kubernetes.io/'"${SHORT_NAME}"'":""}}}}}}'
}

ocp_aws_cluster_autoscaling(){
oc apply -k https://github.com/redhat-na-ssa/demo-ai-gitops-catalog/components/configs/cluster/autoscale/overlays/gpus-accelerator-label?ref=v0.04

ocp_aws_create_gpu_machineset g4dn.4xlarge
ocp_create_machineset_autoscale 0 3

# scale workers to 1
WORKER_MS="$(oc -n openshift-machine-api get machineset -o name | grep worker)"
ocp_scale_machineset 1 "${WORKER_MS}"

ocp_control_nodes_not_schedulable
}

ocp_aws_create_gpu_machineset(){
# https://aws.amazon.com/ec2/instance-types/g4
# single gpu: g4dn.{2,4,8,16}xlarge
# multi gpu: g4dn.12xlarge
# practical: g4ad.4xlarge
# a100 (MIG): p4d.24xlarge
# h100 (MIG): p5.48xlarge

# https://aws.amazon.com/ec2/instance-types/dl1
# 8 x gaudi: dl1.24xlarge

INSTANCE_TYPE=${1:-g4dn.4xlarge}

ocp_aws_clone_worker_machineset "${INSTANCE_TYPE}"

MACHINE_SET_TYPE=$(oc -n openshift-machine-api get machinesets.machine.openshift.io -o name | grep "${INSTANCE_TYPE%.*}" | head -n1)

echo "Patching: ${MACHINE_SET_TYPE}"

# cosmetic
oc -n openshift-machine-api \
patch "${MACHINE_SET_TYPE}" \
--type=merge --patch '{"spec":{"template":{"spec":{"metadata":{"labels":{"node-role.kubernetes.io/gpu":""}}}}}}'

# taint nodes for gpu-only workloads
oc -n openshift-machine-api \
patch "${MACHINE_SET_TYPE}" \
--type=merge --patch '{"spec":{"template":{"spec":{"taints":[{"key":"nvidia-gpu-only","value":"","effect":"NoSchedule"}]}}}}'

# should use the default profile
# oc -n openshift-machine-api \
# patch "${MACHINE_SET_TYPE}" \
# --type=merge --patch '{"spec":{"template":{"spec":{"metadata":{"labels":{"nvidia.com/device-plugin.config":"no-time-sliced"}}}}}}'

# should help auto provisioner
oc -n openshift-machine-api \
patch "${MACHINE_SET_TYPE}" \
--type=merge --patch '{"spec":{"template":{"spec":{"metadata":{"labels":{"cluster-api/accelerator":"nvidia-gpu"}}}}}}'

oc -n openshift-machine-api \
patch "${MACHINE_SET_TYPE}" \
--type=merge --patch '{"metadata":{"labels":{"cluster-api/accelerator":"nvidia-gpu"}}}'

oc -n openshift-machine-api \
patch "${MACHINE_SET_TYPE}" \
--type=merge --patch '{"spec":{"template":{"spec":{"providerSpec":{"value":{"instanceType":"'"${INSTANCE_TYPE}"'"}}}}}}'
}

ocp_create_machineset_autoscale(){
MACHINE_MIN=${1:-0}
MACHINE_MAX=${2:-4}
MACHINE_SETS=${3:-$(oc -n openshift-machine-api get machinesets.machine.openshift.io -o name | sed 's@.*/@@' )}

for set in ${MACHINE_SETS}
do
cat << YAML | oc apply -f -
apiVersion: "autoscaling.openshift.io/v1beta1"
kind: "MachineAutoscaler"
metadata:
name: "${set}"
namespace: "openshift-machine-api"
spec:
minReplicas: ${MACHINE_MIN}
maxReplicas: ${MACHINE_MAX}
scaleTargetRef:
apiVersion: machine.openshift.io/v1beta1
kind: MachineSet
name: "${set}"
YAML
done
}

ocp_scale_machineset(){
REPLICAS=${1:-1}
MACHINE_SETS=${2:-$(oc -n openshift-machine-api get machineset -o name)}

# scale workers
echo "${MACHINE_SETS}" | \
xargs \
oc -n openshift-machine-api \
scale --replicas="${REPLICAS}"
}
9 changes: 0 additions & 9 deletions configs/htpass-secret.yaml

This file was deleted.

4 changes: 2 additions & 2 deletions configs/htpass-cr.yaml → configs/htpasswd-cr.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -5,11 +5,11 @@ metadata:
spec:
identityProviders:
# This provider name is prefixed to provider user names to form an identity name.
- name: my_htpasswd_provider
- name: htpasswd
# Controls how mappings are established between this provider’s identities and User objects.
mappingMethod: claim
type: HTPasswd
htpasswd:
fileData:
# An existing secret containing a file generated using htpasswd.
name: htpass-secret
name: htpasswd-secret
9 changes: 9 additions & 0 deletions configs/htpasswd-secret.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
apiVersion: v1
kind: Secret
metadata:
name: htpasswd-secret
namespace: openshift-config
type: Opaque
stringData:
htpasswd: |
# <htpasswd_file_contents>
2 changes: 0 additions & 2 deletions configs/nfd-instance.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,6 @@ metadata:
name: nfd-instance
namespace: openshift-nfd
spec:
customConfig:
configData:
operand:
image: 'registry.redhat.io/openshift4/ose-node-feature-discovery@sha256:96984b49c21fa4b76e8ca26735521a0a32daa4c5e330397641fe47ae4d774df4'
servicePort: 12000
Expand Down
Loading