Skip to content

Latest commit

 

History

History
399 lines (337 loc) · 19 KB

README.md

File metadata and controls

399 lines (337 loc) · 19 KB

Terraform module for connecting a GKE cluster to CAST AI

Website: https://www.cast.ai

Requirements

Using the module

A module to connect a GKE cluster to CAST AI.

Requires castai/castai and hashicorp/google providers to be configured.

For Phase 2 onboarding credentials from terraform-gke-iam are required

module "castai_gke_cluster" {
  source = "castai/gke-cluster/castai"

  project_id           = var.project_id
  gke_cluster_name     = var.cluster_name
  gke_cluster_location = module.gke.location # cluster region or zone

  gke_credentials            = module.castai_gke_iam.private_key
  delete_nodes_on_disconnect = var.delete_nodes_on_disconnect
  autoscaler_policies_json   = var.autoscaler_policies_json

  default_node_configuration = module.castai_gke_cluster.node_configurations["default"]

  node_configurations = {
    default = {
      disk_cpu_ratio = 25
      subnets        = [module.vpc.subnets_ids[0]]
      tags = {
        "node-config" : "default"
      }

      max_pods_per_node = 110
      network_tags      = ["dev"]
      disk_type         = "pd-balanced"

    }
  }
  node_templates = {
    spot_tmpl = {
      configuration_id = module.castai_gke_cluster.node_configurations["default"]

      should_taint = true

      custom_labels = {
        custom-label-key-1 = "custom-label-value-1"
        custom-label-key-2 = "custom-label-value-2"
      }

      custom_taints = [
        {
          key   = "custom-taint-key-1"
          value = "custom-taint-value-1"
        },
        {
          key   = "custom-taint-key-2"
          value = "custom-taint-value-2"
        }
      ]

      constraints = {
        fallback_restore_rate_seconds = 1800
        spot                          = true
        use_spot_fallbacks            = true
        min_cpu                       = 4
        max_cpu                       = 100
        instance_families = {
          exclude = ["e2"]
        }
        compute_optimized_state = "disabled"
        storage_optimized_state = "disabled"
        is_gpu_only             = false
        architectures           = ["amd64"]
      }

      custom_instances_enabled                      = true
      custom_instances_with_extended_memory_enabled = true
    }
  }

  autoscaler_settings = {
    enabled                                 = true
    node_templates_partial_matching_enabled = false

    unschedulable_pods = {
      enabled = true

      headroom = {
        enabled           = true
        cpu_percentage    = 10
        memory_percentage = 10
      }

      headroom_spot = {
        enabled           = true
        cpu_percentage    = 10
        memory_percentage = 10
      }
    }

    node_downscaler = {
      enabled = true

      empty_nodes = {
        enabled = true
      }

      evictor = {
        aggressive_mode           = false
        cycle_interval            = "5s10s"
        dry_run                   = false
        enabled                   = true
        node_grace_period_minutes = 10
        scoped_mode               = false
      }
    }

    cluster_limits = {
      enabled = true

      cpu = {
        max_cores = 20
        min_cores = 1
      }
    }
  }
}

Migrating from 3.x.x to 4.x.x

Version 4.x.x changes:

  • Removed custom_label attribute in castai_node_template resource. Use custom_labels instead.

Old configuration:

module "castai-gke-cluster" {
  node_templates = {
    spot_tmpl = {
      custom_label = {
        key = "custom-label-key-1"
        value = "custom-label-value-1"
      }
    }
  }
}

New configuration:

module "castai-gke-cluster" {
  node_templates = {
    spot_tmpl = {
      custom_labels = {
        custom-label-key-1 = "custom-label-value-1"
      }
    }
  }
}

Migrating from 4.x.x to 5.x.x

Version 5.x.x changed:

  • Removed compute_optimized and storage_optimized attributes in castai_node_template resource, constraints object. Use compute_optimized_state and storage_optimized_state instead.

Old configuration:

module "castai-gke-cluster" {
  node_templates = {
    spot_tmpl = {
      constraints = {
        compute_optimized = false
        storage_optimized = true
      }
    }
  }
}

New configuration:

module "castai-gke-cluster" {
  node_templates = {
    spot_tmpl = {
      constraints = {
        compute_optimized_state = "disabled"
        storage_optimized_state = "enabled"
      }
    }
  }
}

Migrating from 6.1.x to 6.3.x

Version 6.3.x changed:

  • Deprecated autoscaler_policies_json attribute. Use autoscaler_settings instead.

Old configuration:

module "castai-gke-cluster" {
  autoscaler_policies_json = <<-EOT
    {
        "enabled": true,
        "unschedulablePods": {
            "enabled": true
        },
        "nodeDownscaler": {
            "enabled": true,
            "emptyNodes": {
                "enabled": true
            },
            "evictor": {
                "aggressiveMode": false,
                "cycleInterval": "5m10s",
                "dryRun": false,
                "enabled": true,
                "nodeGracePeriodMinutes": 10,
                "scopedMode": false
            }
        },
        "nodeTemplatesPartialMatchingEnabled": false,
        "clusterLimits": {
            "cpu": {
                "maxCores": 20,
                "minCores": 1
            },
            "enabled": true
        }
    }
  EOT
}

New configuration:

module "castai-gke-cluster" {
  autoscaler_settings = {
    enabled                                 = true
    node_templates_partial_matching_enabled = false

    unschedulable_pods = {
      enabled = true
    }

    node_downscaler = {
      enabled = true

      empty_nodes = {
        enabled = true
      }

      evictor = {
        aggressive_mode           = false
        cycle_interval            = "5m10s"
        dry_run                   = false
        enabled                   = true
        node_grace_period_minutes = 10
        scoped_mode               = false
      }
    }

    cluster_limits = {
      enabled = true

      cpu = {
        max_cores = 20
        min_cores = 1
      }
    }
  }
}

Examples

Usage examples are located in terraform provider repo

Requirements

Name Version
terraform >= 0.13
castai ~> 7.17
google >= 2.49
helm >= 2.0.0

Providers

Name Version
castai ~> 7.17
helm >= 2.0.0
null n/a

Modules

No modules.

Resources

Name Type
castai_autoscaler.castai_autoscaler_policies resource
castai_gke_cluster.castai_cluster resource
castai_node_configuration.this resource
castai_node_configuration_default.this resource
castai_node_template.this resource
castai_workload_scaling_policy.this resource
helm_release.castai_agent resource
helm_release.castai_cloud_proxy resource
helm_release.castai_cluster_controller resource
helm_release.castai_cluster_controller_self_managed resource
helm_release.castai_evictor resource
helm_release.castai_evictor_ext resource
helm_release.castai_evictor_self_managed resource
helm_release.castai_kvisor resource
helm_release.castai_kvisor_self_managed resource
helm_release.castai_pod_pinner resource
helm_release.castai_pod_pinner_self_managed resource
helm_release.castai_spot_handler resource
helm_release.castai_workload_autoscaler resource
helm_release.castai_workload_autoscaler_self_managed resource
null_resource.wait_for_cluster resource

Inputs

Name Description Type Default Required
agent_values List of YAML formatted string values for agent helm chart list(string) [] no
agent_version Version of castai-agent helm chart. Default latest string null no
api_grpc_addr CAST AI GRPC API address string "api-grpc.cast.ai:443" no
api_url URL of alternative CAST AI API to be used during development or testing string "https://api.cast.ai" no
autoscaler_policies_json Optional json object to override CAST AI cluster autoscaler policies. Deprecated, use autoscaler_settings instead. string null no
autoscaler_settings Optional Autoscaler policy definitions to override current autoscaler settings any null no
castai_api_token Optional CAST AI API token created in console.cast.ai API Access keys section. Used only when wait_for_cluster_ready is set to true string "" no
castai_components_labels Optional additional Kubernetes labels for CAST AI pods map(any) {} no
cloud_proxy_grpc_url_override Override for the castai-cloud-proxy gRPC URL string null no
cloud_proxy_values List of YAML formatted strings with castai-cloud-proxy values list(string) [] no
cloud_proxy_version Version of the castai-cloud-proxy Helm chart. Defaults to latest. string null no
cluster_controller_values List of YAML formatted string values for cluster-controller helm chart list(string) [] no
cluster_controller_version Version of castai-cluster-controller helm chart. Default latest string null no
default_node_configuration ID of the default node configuration string "" no
default_node_configuration_name Name of the default node configuration string "" no
delete_nodes_on_disconnect Optionally delete Cast AI created nodes when the cluster is destroyed bool false no
evictor_ext_values List of YAML formatted string with evictor-ext values list(string) [] no
evictor_ext_version Version of castai-evictor-ext chart. Default latest string null no
evictor_values List of YAML formatted string values for evictor helm chart list(string) [] no
evictor_version Version of castai-evictor chart. Default latest string null no
gke_cluster_location Location of the cluster to be connected to CAST AI. Can be region or zone for zonal clusters string n/a yes
gke_cluster_name Name of the cluster to be connected to CAST AI. string n/a yes
gke_credentials Optional GCP Service account credentials.json string n/a yes
grpc_url gRPC endpoint used by pod-pinner string "grpc.cast.ai:443" no
install_cloud_proxy Optional flag for installation of castai-cloud-proxy bool false no
install_security_agent Optional flag for installation of security agent (https://docs.cast.ai/product-overview/console/security-insights/) bool false no
install_workload_autoscaler Optional flag for installation of workload autoscaler (https://docs.cast.ai/docs/workload-autoscaling-configuration) bool false no
kvisor_controller_extra_args Extra arguments for the kvisor controller. Optionally enable kvisor to lint Kubernetes YAML manifests, scan workload images and check if workloads pass CIS Kubernetes Benchmarks as well as NSA, WASP and PCI recommendations. map(string)
{
"image-scan-enabled": "true",
"kube-bench-enabled": "true",
"kube-linter-enabled": "true"
}
no
kvisor_values List of YAML formatted string values for kvisor helm chart list(string) [] no
kvisor_version Version of kvisor chart. If not provided, latest version will be used. string null no
node_configurations Map of GKE node configurations to create any {} no
node_templates Map of node templates to create any {} no
pod_pinner_values List of YAML formatted string values for agent helm chart list(string) [] no
pod_pinner_version Version of pod-pinner helm chart. Default latest string null no
project_id The project id from GCP string n/a yes
self_managed Whether CAST AI components' upgrades are managed by a customer; by default upgrades are managed CAST AI central system. bool false no
spot_handler_values List of YAML formatted string values for spot-handler helm chart list(string) [] no
spot_handler_version Version of castai-spot-handler helm chart. Default latest string null no
wait_for_cluster_ready Wait for cluster to be ready before finishing the module execution, this option requires castai_api_token to be set bool false no
workload_autoscaler_values List of YAML formatted string with cluster-workload-autoscaler values list(string) [] no
workload_autoscaler_version Version of castai-workload-autoscaler helm chart. Default latest string null no
workload_scaling_policies Map of workload scaling policies to create any {} no

Outputs

Name Description
castai_node_configurations Map of node configurations ids by name
castai_node_templates Map of node template by name
cluster_id CAST.AI cluster id, which can be used for accessing cluster data using API