Skip to content

Commit

Permalink
Update gitops approach to cover latest features & readme (#377)
Browse files Browse the repository at this point in the history
* Update gitops approach to cover latest features & readme
  • Loading branch information
sarvesh-cast authored Sep 26, 2024
1 parent 7707f93 commit 65f04d9
Show file tree
Hide file tree
Showing 6 changed files with 351 additions and 41 deletions.
154 changes: 137 additions & 17 deletions examples/eks/eks_cluster_gitops/README.md
Original file line number Diff line number Diff line change
@@ -1,33 +1,153 @@
## EKS and CAST AI example for GitOps onboarding flow

Following example shows how to onboard EKS cluster to CAST AI using GitOps flow.
In GitOps flow CAST AI Node Configuration, Node Templates and Autoscaler policies are managed using Terraform, but all Castware components such as `castai-agent`, `castai-cluster-controller`, `castai-evictor`, `castai-spot-handler`, `castai-kvisor` are to be installed using other means (e.g ArgoCD, manual Helm releases, etc.)
## GitOps flow

Terraform Managed ==> IAM roles, CAST AI Node Configuration, CAST Node Templates and CAST Autoscaler policies

Helm Managed ==> All Castware components such as `castai-agent`, `castai-cluster-controller`, `castai-evictor`, `castai-spot-handler`, `castai-kvisor`, `castai-workload-autoscaler`, `castai-pod-pinner`, `castai-egressd` are to be installed using other means (e.g ArgoCD, manual Helm releases, etc.)


+-------------------------+
| Start |
+-------------------------+
|
| AWS CLI
+-------------------------+
| 1.Check EKS Auth Mode is API/API_CONFIGMAP
|
+-------------------------+
|
|
-----------------------------------------------------
| YES | NO
| |
+-------------------------+ +-----------------------------------------+
No action needed from User 2. User to add cast role in aws-auth configmap
+-------------------------+ +-----------------------------------------+
| |
| |
-----------------------------------------------------
|
|
| TERRAFORM
+-------------------------+
| 3. Update TF.VARS
4. Terraform Init & Apply|
+-------------------------+
|
|GITOPS
+-------------------------+
| 5. Deploy Helm chart of castai-agent castai-cluster-controller`, `castai-evictor`, `castai-spot-handler`, `castai-kvisor`, `castai-workload-autoscaler`, `castai-pod-pinner`
+-------------------------+
|
|
+-------------------------+
| END |
+-------------------------+

Steps to take to successfully onboard EKS cluster to CAST AI using GitOps flow:

Prerequisites:
- CAST AI account
- Obtained CAST AI [API Access key](https://docs.cast.ai/docs/authentication#obtaining-api-access-key) with Full Access

1. Configure `tf.vars.example` file with required values. If EKS cluster is already managed by Terraform you could instead directly reference those resources.
2. Run `terraform init`
3. Run `terraform apply` and make a note of `cluster_id` and `cluster_token` output values. At this stage you would see that your cluster is in `Connecting` state in CAST AI console
4. Install CAST AI components using Helm. Use `cluster_id` and `cluster_token` values to configure Helm releases:
- Set `castai.apiKey` property to `cluster_token` for following CAST AI components: `castai-cluster-controller`, `castai-kvisor`.
- Set `additionalEnv.STATIC_CLUSTER_ID` property to `cluster_id` and `apiKey` property to `cluster_token` for `castai-agent`.
- Set `castai.clusterID` property to for `castai-cluster-controller`, `castai-spot-handler`, `castai-kvisor`
Example Helm install command:

```bash
helm install castai-cluster-controller cluster-controller --namespace=castai-agent --set castai.apiKey=<cluster_token>,provider=eks,castai.clusterID=<cluster_id>,createNamespace=false,apiURL="https://api.cast.ai"

### Step 1: Get EKS cluster authentication mode
```
CLUSTER_NAME=""
REGION=""
current_auth_mode=$(aws eks describe-cluster --name $CLUSTER_NAME --region $REGION | grep authenticationMode | awk '{print $2}')
echo "Authentication mode is $current_auth_mode"
```

5. Update [aws-auth](https://docs.aws.amazon.com/eks/latest/userguide/add-user-role.html) configmap with instance profile used by CAST AI. This instance profile is used by CAST AI managed nodes to communicate with EKS control plane. Example of entry can be found [here](https://github.com/castai/terraform-provider-castai/blob/157babd57b0977f499eb162e9bee27bee51d292a/examples/eks/eks_cluster_assumerole/eks.tf#L28-L38).

### Step 2: If EKS AUTH mode is API/API_CONFIGMAP, This step can be SKIPPED.
#### User to add cast role in aws-auth configmap, configmap may have other entries, so add the below role to it
```
apiVersion: v1
data:
mapRoles: |
- rolearn: arn:aws:iam::028075177508:role/castai-eks-<clustername>
username: system:node:{{EC2PrivateDNSName}}
groups:
- system:bootstrappers
- system:nodes
kind: ConfigMap
metadata:
name: aws-auth
namespace: kube-system
```


### Step 3 & 4: Update TF vars & TF Init, plan & apply
After successful apply, CAST Console UI will be in `Connecting` state. \
Note generated 'CASTAI_CLUSTER_ID' from outputs


### Step 5: Deploy Helm chart of CAST Components
Coponents: `castai-cluster-controller`,`castai-evictor`, `castai-spot-handler`, `castai-kvisor`, `castai-workload-autoscaler`, `castai-pod-pinner` \
After all CAST AI components are installed in the cluster its status in CAST AI console would change from `Connecting` to `Connected` which means that cluster onboarding process completed successfully.

```
CASTAI_API_KEY=""
CASTAI_CLUSTER_ID=""
CAST_CONFIG_SOURCE="castai-cluster-controller"
#### Mandatory Component: Castai-agent
helm upgrade -i castai-agent castai-helm/castai-agent -n castai-agent \
--set apiKey=$CASTAI_API_KEY \
--set provider=eks \
--create-namespace
#### Mandatory Component: castai-cluster-controller
helm upgrade -i cluster-controller castai-helm/castai-cluster-controller -n castai-agent \
--set castai.apiKey=$CASTAI_API_KEY \
--set castai.clusterID=$CASTAI_CLUSTER_ID \
--set autoscaling.enabled=true
#### castai-spot-handler
helm upgrade -i castai-spot-handler castai-helm/castai-spot-handler -n castai-agent \
--set castai.clusterID=$CASTAI_CLUSTER_ID \
--set castai.provider=aws
#### castai-evictor
helm upgrade -i castai-evictor castai-helm/castai-evictor -n castai-agent --set replicaCount=0
#### castai-pod-pinner
helm upgrade -i castai-pod-pinner castai-helm/castai-pod-pinner -n castai-agent \
--set castai.apiKey=$CASTAI_API_KEY \
--set castai.clusterID=$CASTAI_CLUSTER_ID \
--set replicaCount=0
#### castai-workload-autoscaler
helm upgrade -i castai-workload-autoscaler castai-helm/castai-workload-autoscaler -n castai-agent \
--set castai.apiKeySecretRef=$CAST_CONFIG_SOURCE \
--set castai.configMapRef=$CAST_CONFIG_SOURCE \
#### castai-kvisor
helm upgrade -i castai-kvisor castai-helm/castai-kvisor -n castai-agent \
--set castai.apiKey=$CASTAI_API_KEY \
--set castai.clusterID=$CASTAI_CLUSTER_ID \
--set controller.extraArgs.kube-linter-enabled=true \
--set controller.extraArgs.image-scan-enabled=true \
--set controller.extraArgs.kube-bench-enabled=true \
--set controller.extraArgs.kube-bench-cloud-provider=eks
```

## Steps Overview

1. If EKS auth mode is not API/API_CONFIGMAP - Update [aws-auth](https://docs.aws.amazon.com/eks/latest/userguide/add-user-role.html) configmap with instance profile used by CAST AI. This instance profile is used by CAST AI managed nodes to communicate with EKS control plane. Example of entry can be found [here](https://github.com/castai/terraform-provider-castai/blob/157babd57b0977f499eb162e9bee27bee51d292a/examples/eks/eks_cluster_assumerole/eks.tf#L28-L38).
2. Configure `tf.vars.example` file with required values. If EKS cluster is already managed by Terraform you could instead directly reference those resources.
3. Run `terraform init`
4. Run `terraform apply` and make a note of `cluster_id` output values. At this stage you would see that your cluster is in `Connecting` state in CAST AI console
5. Install CAST AI components using Helm. Use `cluster_id` and `api_key` values to configure Helm releases:
- Set `castai.apiKey` property to `api_key`
- Set `castai.clusterID` property to `cluster_id`
6. After all CAST AI components are installed in the cluster its status in CAST AI console would change from `Connecting` to `Connected` which means that cluster onboarding process completed successfully.


## Importing already onboarded cluster to Terraform

This example can also be used to import EKS cluster to Terraform which is already onboarded to CAST AI console trough [script](https://docs.cast.ai/docs/cluster-onboarding#how-it-works).
This example can also be used to import EKS cluster to Terraform which is already onboarded to CAST AI console through [script](https://docs.cast.ai/docs/cluster-onboarding#how-it-works).
For importing existing cluster follow steps 1-3 above and change `castai_node_configuration.default` Node Configuration name.
This would allow to manage already onboarded clusters' CAST AI Node Configurations and Node Templates through IaC.
This would allow to manage already onboarded clusters' CAST AI Node Configurations and Node Templates through IaC.
182 changes: 174 additions & 8 deletions examples/eks/eks_cluster_gitops/castai.tf
Original file line number Diff line number Diff line change
@@ -1,25 +1,191 @@
resource "castai_eks_cluster" "my_castai_cluster" {
account_id = var.aws_account_id
region = var.aws_cluster_region
name = var.aws_cluster_name
# Create IAM resources required for connecting cluster to CAST AI.
locals {
resource_name_postfix = var.aws_cluster_name
account_id = data.aws_caller_identity.current.account_id
partition = data.aws_partition.current.partition

instance_profile_role_name = "castai-eks-${local.resource_name_postfix}-node-role"
iam_role_name = "castai-eks-${local.resource_name_postfix}-cluster-role"
iam_inline_policy_name = "CastEKSRestrictedAccess"
role_name = "castai-eks-role"
}

data "aws_caller_identity" "current" {}

data "aws_partition" "current" {}

data "aws_eks_cluster" "existing_cluster" {
name = var.aws_cluster_name
}

# Configure EKS cluster connection using CAST AI eks-cluster module.
resource "castai_eks_clusterid" "cluster_id" {
account_id = data.aws_caller_identity.current.account_id
region = var.aws_cluster_region
cluster_name = var.aws_cluster_name
}

resource "castai_eks_user_arn" "castai_user_arn" {
cluster_id = castai_eks_clusterid.cluster_id.id
}

module "castai-eks-role-iam" {
source = "castai/eks-role-iam/castai"

aws_account_id = data.aws_caller_identity.current.account_id
aws_cluster_region = var.aws_cluster_region
aws_cluster_name = var.aws_cluster_name
aws_cluster_vpc_id = var.vpc_id

castai_user_arn = castai_eks_user_arn.castai_user_arn.arn

create_iam_resources_per_cluster = true
}

# Creates access entry if eks auth mode is API/API_CONFIGMAP
locals {
access_entry = can(regex("API", data.aws_eks_cluster.existing_cluster.access_config[0].authentication_mode))
}

resource "aws_eks_access_entry" "access_entry" {
count = local.access_entry ? 1 : 0
cluster_name = local.resource_name_postfix
principal_arn = module.castai-eks-role-iam.instance_profile_role_arn
type = "EC2_LINUX"
}

# Connect eks cluster to CAST AI
resource "castai_eks_cluster" "my_castai_cluster" {
account_id = var.aws_account_id
region = var.aws_cluster_region
name = local.resource_name_postfix
delete_nodes_on_disconnect = var.delete_nodes_on_disconnect
assume_role_arn = var.aws_assume_role_arn
assume_role_arn = module.castai-eks-role-iam.role_arn
}

# Creates node configuration
resource "castai_node_configuration" "default" {
cluster_id = castai_eks_cluster.my_castai_cluster.id
name = "default"
disk_cpu_ratio = 0
min_disk_size = 100
subnets = var.subnets
eks {
security_groups = var.security_groups
instance_profile_arn = var.instance_profile
security_groups = [
var.cluster_security_group_id,
var.node_security_group_id
]
instance_profile_arn = module.castai-eks-role-iam.instance_profile_arn
}
}


# Promotes node configuration as default node configuration
resource "castai_node_configuration_default" "this" {
cluster_id = castai_eks_cluster.my_castai_cluster.id
configuration_id = castai_node_configuration.default.id
}
}

resource "castai_node_template" "default_by_castai" {
cluster_id = castai_eks_cluster.my_castai_cluster.id

name = "default-by-castai"
is_default = true
is_enabled = true
configuration_id = castai_node_configuration.default.id
should_taint = false

constraints {
on_demand = true
}

}

resource "castai_node_template" "example_spot_template" {
cluster_id = castai_eks_cluster.my_castai_cluster.id

name = "example_spot_template"
is_default = false
is_enabled = true
configuration_id = castai_node_configuration.default.id
should_taint = true

custom_labels = {
type = "spot"
}

custom_taints {
key = "dedicated"
value = "spot"
effect = "NoSchedule"
}

constraints {
spot = true
use_spot_fallbacks = true
fallback_restore_rate_seconds = 1800
enable_spot_diversity = true
spot_diversity_price_increase_limit_percent = 20
spot_interruption_predictions_enabled = true
spot_interruption_predictions_type = "aws-rebalance-recommendations"
is_gpu_only = false
min_cpu = 2
max_cpu = 16
min_memory = 4096
max_memory = 24576
architectures = ["amd64"]
azs = ["eu-central-1a", "eu-central-1b"]
customer_specific = "disabled"

instance_families {
exclude = ["m5"]
}

custom_priority {
instance_families = ["c5"]
spot = true
}
}

}

resource "castai_autoscaler" "castai_autoscaler_policy" {
cluster_id = castai_eks_cluster.my_castai_cluster.id

autoscaler_settings {
enabled = true
is_scoped_mode = false
node_templates_partial_matching_enabled = false

unschedulable_pods {
enabled = true
}

cluster_limits {
enabled = false

cpu {
min_cores = 1
max_cores = 200
}
}

node_downscaler {
enabled = true

empty_nodes {
enabled = true
}

evictor {
enabled = true
aggressive_mode = false
cycle_interval = "60s"
dry_run = false

node_grace_period_minutes = 10
scoped_mode = false
}
}
}
}
Loading

0 comments on commit 65f04d9

Please sign in to comment.