Strimzi Kafka Helm Chart

A comprehensive, production-ready Helm chart for deploying Apache Kafka clusters using the Strimzi Kafka Operator on Kubernetes. This chart provides extensive parameterization and supports multiple environments with flexible configuration options.

🚀 Features

Core Capabilities

Multi-Environment Support: Dedicated configurations for nonprod, staging, and prod environments
Flexible Kafka Versions: Support for Kafka 3.8.0, 3.9.0, and newer versions
KRaft Mode Only: Modern ZooKeeper-less Kafka deployment (ZooKeeper is deprecated)
Flexible Node Roles: Support for Controller, Broker, or Dual-role nodes per Strimzi KRaft docs
Dynamic Scaling: Optional Horizontal Pod Autoscaler (HPA) support
Flexible Listener Configuration: List-based approach supporting unlimited listeners with different types, ports, and authentication

Advanced Features

Storage Flexibility: Configurable persistent storage with optional storage classes
Node Scheduling: Comprehensive affinity and toleration support for EKS nodegroups
Security: TLS encryption, SCRAM-SHA-512 authentication, and ACL-based authorization
Monitoring: Built-in JMX Prometheus Exporter metrics for Kafka, Cruise Control, and Connect
Large Message Support: Pre-configured for handling messages up to 10MB
Auto-Rebalancing: Cruise Control integration with customizable rebalancing goals

Operational Excellence

External DNS Integration: Automatic DNS record management for external listeners
Rack Awareness: Multi-AZ deployment support with automatic replica selector configuration
Resource Management: Configurable CPU and memory limits/requests
Pod Disruption Budgets: Built-in availability protection
Comprehensive Testing: Helm tests for connectivity validation
CI/CD Integration: Automated security scanning, template validation, and publishing

🔒 Security Features

This Helm chart implements enterprise-grade security with comprehensive hardening:

Authentication & Authorization

Multiple Authentication Methods: TLS certificates and SCRAM-SHA-512 support
Fine-grained ACLs: Least-privilege access control with deny-by-default
Superuser Management: Explicit superuser configuration for admin access
User Management: Comprehensive KafkaUser resources with role-based permissions

Transport Security

TLS Everywhere: Mandatory TLS encryption on all listeners
Strong Cipher Suites: TLS 1.3 preferred with secure protocol configuration
Certificate Management: Support for Strimzi CA and external cert-manager integration
Client Authentication: Required client certificate or password authentication

Production Hardening

Secure Defaults: Auto-topic creation disabled, unclean leader election prevented
High Availability: 5-replica configuration with 3 min-sync replicas
Audit Logging: Security event logging and monitoring integration
Network Isolation: Listener-based network segmentation

Security Documentation

📖 Complete Security Guide - Comprehensive security configuration, best practices, and troubleshooting

📋 Prerequisites

Kubernetes 1.21+
Helm 3.8+
Strimzi Kafka Operator installed in your cluster
For external listeners: Ingress controller (e.g., NGINX) and ExternalDNS (optional)
For EKS: Appropriate nodegroups with taints/labels configured

🛠️ Installation

Quick Start

Add the Helm repository (if using a repository):

helm repo add strimzi-kafka /path/to/chart
helm repo update

Install with default values:

helm install my-kafka strimzi-kafka/strimzi-kafka -n kafka-system --create-namespace

Install for specific environment:

# Non-production (release name: kafka-nonprod, namespace: om-kafka)
helm install kafka-nonprod . -f values-nonprod.yaml -n om-kafka --create-namespace

# Staging (release name: kafka-staging, namespace: om-kafka-staging)
helm install kafka-staging . -f values-staging.yaml -n om-kafka-staging --create-namespace

# Production (release name: kafka-prod, namespace: om-kafka-prod)
helm install kafka-prod . -f values-prod.yaml -n om-kafka-prod --create-namespace

Using the Deployment Script

For easier multi-environment deployment, use the provided script:

# Make the script executable
chmod +x scripts/deploy.sh

# Deploy to different environments
./scripts/deploy.sh nonprod om-kafka kafka-nonprod
./scripts/deploy.sh staging om-kafka-staging kafka-staging
./scripts/deploy.sh prod om-kafka-prod kafka-prod

🎯 Flexible Naming Configuration

This Helm chart provides flexible naming with sensible defaults that follow Helm best practices:

🏷️ Default Behavior (Recommended)

Cluster Name: Defaults to {{ .Release.Name }} (your Helm release name)
Namespace: Defaults to {{ .Release.Namespace }} (your deployment namespace)
Node Pools: Prefixed with cluster name (e.g., my-kafka-dual-role)
Secrets: Prefixed with cluster name (e.g., my-kafka-kafka-tls-secret)

🔧 Override Options

kafkaCluster:
  name: "my-custom-kafka"        # Override cluster name
  namespace: "custom-namespace"  # Override namespace
  nodePools:
    - name: "custom-dual-role"   # Will become: my-custom-kafka-custom-dual-role

✅ Benefits

🎯 Flexible: Override names when needed for specific requirements
📏 Consistent: Follows Helm conventions by default
🔍 Predictable: Resource names are clearly prefixed and organized
🚀 Simple: Works out-of-the-box with sensible defaults

Example:

# This creates a Kafka cluster named "my-kafka" in namespace "kafka-system"
helm install my-kafka . -n kafka-system --create-namespace

⚙️ Configuration

Core Parameters

Parameter	Description	Default	Environment Specific
`kafkaCluster.version`	Kafka version	`3.9.0`	✅
`kafkaCluster.replicas`	Number of Kafka brokers	`3`	✅

Note:

The chart uses Helm built-in variables for naming: {{ .Release.Name }} for cluster name and {{ .Release.Namespace }} for namespace
Deploy to the desired namespace using helm install --namespace <namespace> <release-name>
The release name becomes your Kafka cluster name automatically

🐳 Docker Image Configuration

The chart provides comprehensive image configuration options for all Strimzi components, supporting both public and private registries with flexible override capabilities.

Global Image Defaults

Set default image settings that apply to all components unless overridden:

global:
  # Default Image Configuration
  defaultImageRegistry: "quay.io"              # Default: quay.io
  defaultImageRepository: "strimzi"            # Default: strimzi  
  defaultImageTag: "0.47.0-kafka-3.9.0"      # Strimzi version with Kafka version
  
  # Global Image Pull Configuration
  imagePullPolicy: "IfNotPresent"             # Always, Never, IfNotPresent
  imagePullSecrets:                           # Global pull secrets
    - name: "private-registry-secret"
    - name: "ecr-registry-secret"

Component-Specific Image Overrides

Override image settings for specific components:

# Strimzi Operator Image Configuration
strimzi:
  operator:
    image:
      registry: "my-registry.com"             # Override global registry
      repository: "custom-strimzi"           # Override global repository
      name: "operator"                       # Operator image name
      tag: "0.47.0-custom"                  # Override global tag
      pullPolicy: "Always"                   # Override global pull policy
      pullSecrets:                          # Additional component secrets
        - name: "operator-registry-secret"

# Kafka Cluster Image Configuration  
kafkaCluster:
  image:
    registry: ""                            # Empty = use global default
    repository: ""                          # Empty = use global default
    tag: ""                                # Empty = use global default
    pullPolicy: ""                         # Empty = use global default
    pullSecrets: []                        # Additional Kafka secrets

# Kafka Connect Image Configuration
kafkaConnects:
  - name: my-connect-cluster
    image:
      registry: "my-registry.com"
      repository: "custom-kafka"
      name: "kafka-connect"                 # Connect image name
      tag: "3.9.0-custom"
      pullPolicy: "Always"
      pullSecrets:
        - name: "connect-registry-secret"

Environment-Specific Examples

Non-Production (Public Registry)

global:
  defaultImageRegistry: "quay.io"
  defaultImageRepository: "strimzi"
  defaultImageTag: "0.47.0-kafka-3.9.0"
  imagePullPolicy: "IfNotPresent"
  imagePullSecrets: []  # No secrets for public registry

Production (Private ECR Registry)

global:
  defaultImageRegistry: "123456789012.dkr.ecr.us-east-1.amazonaws.com"
  defaultImageRepository: "strimzi"
  defaultImageTag: "0.47.0-kafka-3.9.0"
  imagePullPolicy: "Always"  # Always pull latest for production
  imagePullSecrets:
    - name: "ecr-registry-secret"

Image Pull Secrets Setup

For private registries, create image pull secrets:

# For ECR (AWS)
kubectl create secret docker-registry ecr-registry-secret \
  --docker-server=123456789012.dkr.ecr.us-east-1.amazonaws.com \
  --docker-username=AWS \
  --docker-password=$(aws ecr get-login-password --region us-east-1) \
  --namespace=kafka-system

# For Docker Hub
kubectl create secret docker-registry dockerhub-secret \
  --docker-server=docker.io \
  --docker-username=myusername \
  --docker-password=mypassword \
  --namespace=kafka-system

# For Harbor/Custom Registry  
kubectl create secret docker-registry harbor-secret \
  --docker-server=harbor.company.com \
  --docker-username=myusername \
  --docker-password=mypassword \
  --namespace=kafka-system

🔒 Security Configuration

This chart provides comprehensive security features with production-grade hardening. For complete security documentation, see SECURITY.md.

Quick Security Setup

1. Basic TLS Configuration

kafkaCluster:
  listeners:
    # Internal TLS listener
    - name: "tls"
      port: 9093
      type: internal
      tls: true
      authentication:
        type: tls
    
    # SCRAM-SHA-512 listener for applications
    - name: "scram"
      port: 9094
      type: internal
      tls: true
      authentication:
        type: scram-sha-512

2. Production Security Hardening

kafkaCluster:
  authorization:
    type: simple
    superUsers:
      - CN=kafka-admin-prod
      - kafka-superuser-prod
  
  config:
    # CRITICAL: Security hardening
    auto.create.topics.enable: false
    allow.everyone.if.no.acl.found: false
    ssl.client.auth: required
    unclean.leader.election.enable: false
    
    # High availability
    default.replication.factor: 5
    min.insync.replicas: 3

3. KafkaUser with ACLs

kafkaUsers:
  - name: app-service-user
    authentication:
      type: scram-sha-512
    authorization:
      type: simple
      acls:
        # Read access to specific topics
        - resource:
            type: topic
            name: "app-events"
            patternType: literal
          operations: [Describe, Read]
          host: "*"
        
        # Consumer group access
        - resource:
            type: group
            name: "app-service-group"
            patternType: literal
          operations: [Read]
          host: "*"

Security Environment Examples

Development (Relaxed Security)

# values-dev.yaml
kafkaCluster:
  config:
    auto.create.topics.enable: true
    allow.everyone.if.no.acl.found: true
  listeners:
    - name: "internal"
      port: 9092
      type: internal
      tls: false
      authentication: {}

Production (Full Security)

# values-prod.yaml
kafkaCluster:
  config:
    auto.create.topics.enable: false
    allow.everyone.if.no.acl.found: false
    ssl.client.auth: required
    ssl.enabled.protocols: TLSv1.3,TLSv1.2
    ssl.protocol: TLSv1.3
  
  listeners:
    - name: "tls"
      port: 9093
      type: internal
      tls: true
      authentication:
        type: tls
    
    - name: "external"
      port: 9095
      type: ingress
      tls: true
      authentication:
        type: scram-sha-512
      configuration:
        class: "nginx"
        annotations:
          nginx.ingress.kubernetes.io/ssl-redirect: "true"
          cert-manager.io/cluster-issuer: "letsencrypt-prod"

Security Features Summary

Feature	Development	Staging	Production
TLS Encryption	Optional	Required	Required
Client Authentication	None	SCRAM/TLS	TLS + SCRAM
ACL Authorization	Disabled	Basic	Comprehensive
Auto Topic Creation	Enabled	Disabled	Disabled
Superuser Access	Open	Limited	Restricted
Certificate Management	Self-signed	Strimzi CA	cert-manager

📖 Complete Security Guide - Authentication, authorization, TLS configuration, ACL patterns, troubleshooting, and security checklists.

Kubernetes-style Ingress Configuration

The chart supports modern Kubernetes-style ingress configuration with optional annotations and flexible TLS settings. This provides better control over ingress resources and follows Kubernetes best practices.

Ingress Configuration Options

kafkaCluster:
  listeners:
    external:
      ingress:
        # Ingress class name (required)
        className: "nginx"  # or "alb", "traefik", etc.
        
        # Ingress host (required)
        host: "kafka.example.com"
        
        # Optional annotations - if {} then disabled/not required
        annotations:
          external-dns.alpha.kubernetes.io/hostname: "kafka.example.com"
          external-dns.alpha.kubernetes.io/ttl: "60"
          nginx.ingress.kubernetes.io/ssl-redirect: "true"
          cert-manager.io/cluster-issuer: "letsencrypt-prod"
        
        # TLS Configuration
        tls:
          enabled: true  # true or false
          secretName: "kafka-tls-secret"  # optional - auto-generated if not specified
      
      # Broker configuration - inherits className, annotations, and TLS from parent
      brokers:
        # Host pattern: broker-{broker}-{parent.host} (e.g., broker-0-kafka.example.com)
        # Brokers automatically inherit all parent ingress settings
        hostPattern: "broker-{broker}-kafka.example.com"  # Optional override

Environment-Specific Examples

Non-Production: Simplified setup with TLS disabled

kafkaCluster:
  listeners:
    external:
      ingress:
        className: "nginx"
        host: "kafka.dev.example.com"
        annotations:
          external-dns.alpha.kubernetes.io/hostname: "kafka.dev.example.com"
        tls:
          enabled: false  # Simplified for development
      brokers:
        # Brokers inherit all settings from parent
        # Host pattern: broker-0-kafka.dev.example.com, broker-1-kafka.dev.example.com, etc.

Production: Full security with TLS and cert-manager

kafkaCluster:
  listeners:
    external:
      ingress:
        className: "nginx"
        host: "kafka.prod.example.com"
        annotations:
          external-dns.alpha.kubernetes.io/hostname: "kafka.prod.example.com"
          cert-manager.io/cluster-issuer: "letsencrypt-prod"
          nginx.ingress.kubernetes.io/ssl-redirect: "true"
          nginx.ingress.kubernetes.io/force-ssl-redirect: "true"
        tls:
          enabled: true
          secretName: "kafka-prod-tls"
      brokers:
        # Brokers inherit all settings from parent including TLS and cert-manager
        # Host pattern: broker-0-kafka.prod.example.com, broker-1-kafka.prod.example.com, etc.

Optional Annotations

Annotations are completely optional. If you set annotations: {} or omit the annotations section entirely, no annotations will be applied to the ingress resources. This provides maximum flexibility for different deployment scenarios.

KRaft Node Roles Configuration

This chart supports KRaft mode only (ZooKeeper is deprecated). You can configure flexible node roles as per the Strimzi KRaft documentation:

Supported Node Roles

Controller: Manages cluster metadata and leader elections
Broker: Handles client requests and data storage
Dual-role: Both controller and broker functions (suitable for dev/test)

Configuration Examples

Development/Testing - Dual-role nodes:

kafkaCluster:
  nodePools:
    - name: "kafka-dual-role"
      replicas: 3  # Minimum for KRaft quorum
      roles:
        - broker
        - controller

Production - Dedicated roles (recommended):

kafkaCluster:
  nodePools:
    # Dedicated controllers for metadata management
    - name: "kafka-controllers"
      replicas: 3  # Odd number for quorum (3 or 5)
      roles:
        - controller
      resources:
        requests:
          memory: "2Gi"
          cpu: "500m"
    
    # Dedicated brokers for client traffic
    - name: "kafka-brokers"
      replicas: 6  # Scale based on throughput needs
      roles:
        - broker
      resources:
        requests:
          memory: "8Gi"
          cpu: "2000m"

Environment-Specific Patterns

Non-Production: Dual-role nodes for cost efficiency
Staging: Mixed setup to test production patterns
Production: Dedicated roles for optimal performance and isolation

Global Node Selection, Affinity and Tolerations

The chart supports flexible node selection through both simple nodeSelector and advanced affinity configurations, plus tolerations that apply to all Kafka components (brokers, entity operator, Kafka exporter, Cruise Control) unless explicitly overridden at the component level.

🎯 Global Override System

Key Benefits:

✅ DRY Principle: Configure once, apply everywhere
✅ Consistent Scheduling: All components use same node selection by default
✅ Easy Management: Change global settings to affect all components
✅ Selective Override: Override only specific components when needed

How it Works:

Global Configuration: Set global.nodeSelector, global.affinity, global.tolerations
Automatic Inheritance: All Kafka components inherit these settings
Component Override: Override at component level only when different behavior is needed

✨ Before vs After Example

❌ Before (Repetitive):

# Repeated in every component - hard to maintain!
kafkaCluster:
  nodePools:
    - template:
        pod:
          affinity:
            nodeAffinity:
              requiredDuringSchedulingIgnoredDuringExecution:
                nodeSelectorTerms:
                  - matchExpressions:
                      - key: eks.amazonaws.com/nodegroup
                        operator: In
                        values: ["kafka-nodes"]
entityOperator:
  template:
    pod:
      affinity: # Same config repeated again!
        nodeAffinity: # ... 15 lines of duplicate config
cruiseControl:
  template:
    pod:
      affinity: # Same config repeated again!
        nodeAffinity: # ... 15 lines of duplicate config

✅ After (Clean & DRY):

# Configure once globally
global:
  nodeSelector:
    eks.amazonaws.com/nodegroup: "kafka-nodes"

# All components automatically inherit - no repetition!
kafkaCluster:
  nodePools:
    - template:
        pod:
          affinity: {} # Inherits from global
entityOperator:
  template:
    pod:
      affinity: {} # Inherits from global
cruiseControl:
  template:
    pod:
      affinity: {} # Inherits from global

# Override only when needed
kafkaExporter:
  template:
    pod:
      nodeSelector:
        special-node: "monitoring" # Override for this component only

Node Selection Options

Simple Node Selection (Recommended): Use nodeSelector for straightforward node targeting:

global:
  nodeSelector:
    eks.amazonaws.com/nodegroup: "kafka-nodegroup"
    node-type: "kafka-dedicated"
    kubernetes.io/arch: "amd64"

Advanced Node Selection: Use affinity.nodeAffinity for complex node selection logic:

global:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
              - key: eks.amazonaws.com/nodegroup
                operator: In
                values:
                  - "kafka-nodegroup-1"
                  - "kafka-nodegroup-2"

Global Configuration

global:
  # Global Affinity Configuration (applied to all components unless overridden)
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
              - key: eks.amazonaws.com/nodegroup
                operator: In
                values:
                  - "kafka-nodegroup"
              - key: node-type
                operator: In
                values:
                  - "kafka-dedicated"
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        - labelSelector:
            matchExpressions:
              - key: strimzi.io/cluster
                operator: In
                values:
                  - "{{ .Values.global.clusterName }}"
          topologyKey: kubernetes.io/hostname
  
  # Global Tolerations Configuration (applied to all components unless overridden)
  tolerations:
    - key: "dedicated"
      operator: "Equal"
      value: "kafka"
      effect: "NoSchedule"
    - key: "identifier"
      operator: "Equal"
      value: "kafka-nodegroup-taint"
      effect: "NoSchedule"

Component-Level Overrides

Individual components can override global affinity and tolerations:

kafkaCluster:
  nodePools:
    - template:
        pod:
          # Override global affinity for this node pool
          affinity:
            nodeAffinity:
              requiredDuringSchedulingIgnoredDuringExecution:
                nodeSelectorTerms:
                  - matchExpressions:
                      - key: node-type
                        operator: In
                        values:
                          - "kafka-broker-only"
          # Override global tolerations for this node pool
          tolerations:
            - key: "broker-dedicated"
              operator: "Equal"
              value: "true"
              effect: "NoSchedule"
  
  entityOperator:
    template:
      pod:
        # Entity operator can have different affinity/tolerations
        affinity:
          nodeAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
              nodeSelectorTerms:
                - matchExpressions:
                    - key: node-type
                      operator: In
                      values:
                        - "kafka-management"

Environment-Specific Examples

Non-Production: Basic node affinity, no pod anti-affinity

global:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
              - key: eks.amazonaws.com/nodegroup
                operator: In
                values:
                  - "dev-nodegroup"
  tolerations:
    - key: "dev-taint"
      operator: "Equal"
      value: "kafka"
      effect: "NoSchedule"

Production: Strict node affinity and pod anti-affinity

global:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
              - key: eks.amazonaws.com/nodegroup
                operator: In
                values:
                  - "kafka-dedicated-nodegroup"
              - key: node-type
                operator: In
                values:
                  - "kafka-dedicated"
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:  # Hard anti-affinity for production
        - labelSelector:
            matchExpressions:
              - key: strimzi.io/cluster
                operator: In
                values:
                  - "kafka-cluster-prod"
          topologyKey: kubernetes.io/hostname
  tolerations:
    - key: "dedicated"
      operator: "Equal"
      value: "kafka"
      effect: "NoSchedule"

Storage Configuration

kafkaCluster:
  nodePools:
    - storage:
        enabled: true
        type: jbod  # or persistent-claim
        volumes:
          - id: 1
            type: persistent-claim
            size: 100Gi  # Configurable per environment
            deleteClaim: false
            kraftMetadata: shared
            storageClass: gp3  # Optional: specify storage class

🎧 Flexible Listener Configuration

The chart now supports flexible list-based listeners - add as many listeners as needed with different configurations:

Basic Example

kafkaCluster:
  listeners:
    # Internal TLS listener (always recommended)
    - name: "tls"
      port: 9093
      type: internal
      tls: true
      authentication:
        type: tls
    
    # External ingress listener
    - name: "external"
      port: 9095
      type: ingress
      tls: true
      authentication:
        type: scram-sha-512
      configuration:
        class: "nginx"
        bootstrap:
          host: "kafka.example.com"
          annotations:
            external-dns.alpha.kubernetes.io/hostname: "kafka.example.com"
        brokers:
          hostTemplate: "kafka-{id}.example.com"
          annotations:
            external-dns.alpha.kubernetes.io/hostname: "kafka-{id}.example.com"
        tls:
          secretName: "kafka-tls-secret"
          brokerSecretName: "kafka-broker-tls"

Advanced Multi-Listener Example

kafkaCluster:
  listeners:
    # Internal TLS for inter-broker communication
    - name: "tls"
      port: 9093
      type: internal
      tls: true
      authentication:
        type: tls
    
    # Internal SCRAM for applications
    - name: "scram"
      port: 9094
      type: internal
      tls: true
      authentication:
        type: scram-sha-512
    
    # External ingress for web clients
    - name: "web-clients"
      port: 9095
      type: ingress
      tls: true
      authentication:
        type: scram-sha-512
      configuration:
        class: "nginx"
        bootstrap:
          host: "kafka-web.example.com"
        brokers:
          hostTemplate: "kafka-web-{id}.example.com"
    
    # External LoadBalancer for internal services
    - name: "internal-services"
      port: 9096
      type: loadbalancer
      tls: true
      authentication:
        type: tls
      configuration:
        loadBalancerSourceRanges:
          - "10.0.0.0/8"
          - "172.16.0.0/12"
    
    # NodePort for development access
    - name: "dev-access"
      port: 9097
      type: nodeport
      tls: false
      authentication:
        type: scram-sha-512
      configuration:
        nodePort: 32000

✅ Benefits of List-Based Listeners

🔧 Unlimited Flexibility: Add any number of listeners with different configurations
🎯 Purpose-Specific: Create listeners for different client types (web, mobile, internal services)
🔒 Security Granular: Different authentication methods per listener
🌐 Multi-Environment: Different ingress hosts and TLS configurations per listener
📈 Scalable: Easy to add new listeners without restructuring existing ones

Node Affinity and Tolerations

Configure pod scheduling for EKS nodegroups:

kafkaCluster:
  nodePools:
    - template:
        pod:
          affinity:
            enabled: true
            nodeAffinity:
              requiredDuringSchedulingIgnoredDuringExecution:
                nodeSelectorTerms:
                  - matchExpressions:
                      - key: eks.amazonaws.com/nodegroup
                        operator: In
                        values:
                          - "kafka-nodegroup"
            podAntiAffinity:
              enabled: true  # Recommended for production
          tolerations:
            enabled: true
            tolerations:
              - key: "kafka-dedicated"
                operator: "Equal"
                value: "true"
                effect: "NoSchedule"

Horizontal Pod Autoscaler (HPA)

⚠️ CRITICAL LIMITATION: HPA MUST NOT target Kafka broker StatefulSets managed by the Strimzi operator.

❌ What HPA CANNOT Do

Cannot scale Kafka brokers: Broker scaling is exclusively managed by the Strimzi operator
Cannot target StatefulSets: HPA cannot manage StatefulSets created by the operator
Cannot handle data rebalancing: Scaling brokers requires data rebalancing coordination

✅ What HPA CAN Do

HPA can only be used for ancillary components not managed by the Strimzi operator:

# ✅ VALID: HPA for Kafka Connect (separate deployment)
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: kafka-connect-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-connect-cluster-connect  # Kafka Connect deployment
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70

# ✅ VALID: HPA for MirrorMaker2 (separate deployment)
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: mirrormaker2-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-mm2-cluster-mirrormaker2
  minReplicas: 1
  maxReplicas: 5

🔧 Proper Broker Scaling

To scale Kafka brokers, update the replicas configuration and apply Helm upgrade:

# values-prod.yaml
kafkaCluster:
  replicas: 5  # Scale from 3 to 5 brokers

# Apply the change
helm upgrade my-kafka . -f values-prod.yaml

The Strimzi operator will:

Create new broker pods
Update cluster metadata
Trigger automatic rebalancing (if Cruise Control is enabled)
Ensure data distribution across new brokers

📊 Monitoring Broker Capacity

Instead of HPA, monitor these metrics for manual scaling decisions:

# Monitor broker CPU/Memory usage
kubectl top pods -l strimzi.io/cluster=my-kafka

# Check disk usage per broker
kubectl exec my-kafka-dual-role-0 -c kafka -- df -h /var/lib/kafka

# Monitor partition distribution
kubectl exec my-kafka-dual-role-0 -c kafka -- bin/kafka-topics.sh \
  --bootstrap-server localhost:9092 --describe --under-replicated-partitions

Metrics Configuration

Enable comprehensive monitoring with JMX Prometheus Exporter:

kafkaCluster:
  metricsConfig:
    enabled: true
    type: jmxPrometheusExporter
    configMapName: kafka-metrics
    configMapKey: kafka-metrics-config.yml
  
  kafkaExporter:
    enabled: true  # Additional Kafka-specific metrics
  
  cruiseControl:
    enabled: true
    metricsConfig:
      enabled: true
      type: jmxPrometheusExporter
      configMapName: cruise-control-metrics
      configMapKey: metrics-config.yml

Kafka Users and Topics

Define users and topics declaratively:

kafkaUsers:
  - name: app-user
    authentication:
      type: scram-sha-512
    authorization:
      type: simple
      acls:
        - resource:
            type: topic
            name: "app-*"
            patternType: prefix
          operations: [Describe, Read, Write]
          host: "*"

kafkaTopics:
  - name: user-events
    partitions: 12
    replicas: 3
    config:
      retention.ms: 604800000  # 7 days
      segment.bytes: 1073741824
      compression.type: lz4

Kafka Connect

Deploy Kafka Connect clusters with custom connectors:

kafkaConnects:
  - name: analytics-connect
    replicas: 3
    version: "3.9.0"
    resources:
      requests:
        memory: 4Gi
        cpu: 2
      limits:
        memory: 4Gi
        cpu: 4
    bootstrapServers: "kafka.example.com:443"
    build:
      output:
        type: docker
        image: "registry.example.com/kafka-connect:latest"
      plugins:
        - name: opensearch-sink
          artifacts:
            - type: zip
              url: "https://github.com/Aiven-Open/opensearch-connector-for-apache-kafka/releases/download/v3.1.1/opensearch-connector-for-apache-kafka-3.1.1.zip"

🎯 Cruise Control Configuration

Cruise Control provides intelligent rebalancing and optimization for Kafka clusters. This section covers goal templates aligned with Strimzi defaults and operational guidance.

Default Goal Templates

Production Rebalancing Goals (Recommended)

kafkaCluster:
  cruiseControl:
    enabled: true
    config:
      # Strimzi default goals - optimized for production stability
      default.goals: >
        RackAwareGoal,
        ReplicaCapacityGoal,
        DiskCapacityGoal,
        NetworkInboundCapacityGoal,
        NetworkOutboundCapacityGoal,
        CpuCapacityGoal,
        ReplicaDistributionGoal,
        PotentialNwOutGoal,
        DiskUsageDistributionGoal,
        NetworkInboundUsageDistributionGoal,
        NetworkOutboundUsageDistributionGoal,
        CpuUsageDistributionGoal,
        LeaderReplicaDistributionGoal,
        LeaderBytesInDistributionGoal
      
      # Hard goals that cannot be violated
      hard.goals: >
        RackAwareGoal,
        ReplicaCapacityGoal,
        DiskCapacityGoal,
        NetworkInboundCapacityGoal,
        NetworkOutboundCapacityGoal,
        CpuCapacityGoal
      
      # Self-healing configuration
      self.healing.goals: >
        RackAwareGoal,
        ReplicaCapacityGoal,
        DiskCapacityGoal
      
      # Anomaly detection
      anomaly.detection.goals: >
        RackAwareGoal,
        ReplicaCapacityGoal,
        DiskCapacityGoal,
        NetworkInboundCapacityGoal,
        NetworkOutboundCapacityGoal,
        CpuCapacityGoal

# KafkaRebalance templates for different scenarios
kafkaRebalances:
  # Full cluster rebalance (use sparingly)
  - name: "full-rebalance"
    mode: "full"
    goals:
      - RackAwareGoal
      - ReplicaCapacityGoal
      - DiskCapacityGoal
      - NetworkInboundCapacityGoal
      - NetworkOutboundCapacityGoal
      - CpuCapacityGoal
      - ReplicaDistributionGoal
      - PotentialNwOutGoal
      - DiskUsageDistributionGoal
      - NetworkInboundUsageDistributionGoal
      - NetworkOutboundUsageDistributionGoal
      - CpuUsageDistributionGoal
      - LeaderReplicaDistributionGoal
      - LeaderBytesInDistributionGoal
  
  # Add brokers rebalance (when scaling up)
  - name: "add-brokers-rebalance"
    mode: "add-brokers"
    goals:
      - RackAwareGoal
      - ReplicaCapacityGoal
      - DiskCapacityGoal
      - ReplicaDistributionGoal
      - DiskUsageDistributionGoal
  
  # Remove brokers rebalance (when scaling down)
  - name: "remove-brokers-rebalance"
    mode: "remove-brokers"
    goals:
      - RackAwareGoal
      - ReplicaCapacityGoal
      - DiskCapacityGoal
      - ReplicaDistributionGoal

Conservative Rebalancing Goals (Minimal Impact)

kafkaCluster:
  cruiseControl:
    enabled: true
    config:
      # Conservative goals for minimal cluster disruption
      default.goals: >
        RackAwareGoal,
        ReplicaCapacityGoal,
        DiskCapacityGoal,
        ReplicaDistributionGoal
      
      hard.goals: >
        RackAwareGoal,
        ReplicaCapacityGoal,
        DiskCapacityGoal

Operational Guidance

When to Disable Cruise Control

⚠️ CRITICAL: Disable Cruise Control during these disruptive operations:

Kafka Version Upgrades

kafkaCluster:
  cruiseControl:
    enabled: false  # Disable during upgrade

Broker Configuration Changes
- JVM settings modifications
- Storage configuration changes
- Network configuration updates
Cluster Maintenance
- Node maintenance windows
- Kubernetes cluster upgrades
- Storage system maintenance
Emergency Situations
- Broker failures requiring immediate attention
- Network partitions or connectivity issues
- Data corruption incidents

Re-enabling After Maintenance

# 1. Ensure all brokers are healthy
kubectl get pods -l strimzi.io/cluster=my-kafka

# 2. Check for under-replicated partitions
kubectl exec my-kafka-dual-role-0 -c kafka -- bin/kafka-topics.sh \
  --bootstrap-server localhost:9092 --describe --under-replicated-partitions

# 3. Re-enable Cruise Control
helm upgrade my-kafka . -f values-prod.yaml  # With cruiseControl.enabled: true

# 4. Wait for Cruise Control to start
kubectl get pods -l strimzi.io/name=my-kafka-cruise-control

Goal Explanations

Goal	Purpose	Impact	Recommended Use
RackAwareGoal	Ensures replicas are distributed across racks/AZs	High	Always include (hard goal)
ReplicaCapacityGoal	Prevents brokers from exceeding replica limits	High	Always include (hard goal)
DiskCapacityGoal	Prevents disk space exhaustion	High	Always include (hard goal)
NetworkInboundCapacityGoal	Balances network inbound traffic	Medium	Production clusters
NetworkOutboundCapacityGoal	Balances network outbound traffic	Medium	Production clusters
CpuCapacityGoal	Balances CPU utilization	Medium	Production clusters
ReplicaDistributionGoal	Evenly distributes replicas	Low	General optimization
LeaderReplicaDistributionGoal	Evenly distributes leader replicas	Low	Performance optimization
DiskUsageDistributionGoal	Balances disk usage	Low	Storage optimization

Monitoring Cruise Control

Key Metrics to Monitor

# Cruise Control status
kubectl get pods -l strimzi.io/name=my-kafka-cruise-control

# Active rebalances
kubectl get kafkarebalance

# Cruise Control logs
kubectl logs deployment/my-kafka-cruise-control

# Anomaly detector status
kubectl exec my-kafka-cruise-control-xxx -- curl -s localhost:9090/kafkacruisecontrol/state

Prometheus Metrics (if enabled)

Monitor these Cruise Control metrics:

kafka_cruisecontrol_anomaly_detector_mean_time_between_anomalies_ms
kafka_cruisecontrol_executor_execution_stopped
kafka_cruisecontrol_monitor_sampling_rate

Troubleshooting

Common Issues

Cruise Control Pod Not Starting

# Check logs for configuration errors
kubectl logs deployment/my-kafka-cruise-control

# Verify Kafka cluster is healthy
kubectl get kafka my-kafka -o yaml

Rebalance Stuck in Pending State

# Check rebalance status
kubectl describe kafkarebalance my-rebalance

# Check Cruise Control logs
kubectl logs deployment/my-kafka-cruise-control | grep -i rebalance

Goals Cannot Be Satisfied

# Review goal configuration
kubectl get kafka my-kafka -o yaml | grep -A 20 cruiseControl

# Check cluster capacity
kubectl exec my-kafka-cruise-control-xxx -- curl -s \
  "localhost:9090/kafkacruisecontrol/load?json=true"

References

👤 KafkaUser Resources & ACL Examples

This section provides comprehensive examples of KafkaUser resources with SCRAM authentication and role-based ACL configurations.

Basic KafkaUser with SCRAM Authentication

apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaUser
metadata:
  name: app-producer
  labels:
    strimzi.io/cluster: my-kafka
spec:
  authentication:
    type: scram-sha-512
  authorization:
    type: simple
    acls:
      # Producer permissions for specific topics
      - resource:
          type: topic
          name: orders
        operations: [Write, Describe]
      - resource:
          type: topic
          name: payments
        operations: [Write, Describe]
      
      # Schema Registry permissions (if using Confluent Schema Registry)
      - resource:
          type: topic
          name: _schemas
        operations: [Read, Write, Describe]
      
      # Consumer group for monitoring/health checks
      - resource:
          type: group
          name: app-producer-monitoring
        operations: [Read]

Consumer Application User

apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaUser
metadata:
  name: app-consumer
  labels:
    strimzi.io/cluster: my-kafka
spec:
  authentication:
    type: scram-sha-512
  authorization:
    type: simple
    acls:
      # Consumer permissions for specific topics
      - resource:
          type: topic
          name: orders
        operations: [Read, Describe]
      - resource:
          type: topic
          name: payments
        operations: [Read, Describe]
      
      # Consumer group permissions
      - resource:
          type: group
          name: order-processing-service
        operations: [Read]
      - resource:
          type: group
          name: payment-processing-service
        operations: [Read]
      
      # Offset management
      - resource:
          type: topic
          name: __consumer_offsets
        operations: [Read]

Admin User (Full Cluster Access)

apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaUser
metadata:
  name: kafka-admin
  labels:
    strimzi.io/cluster: my-kafka
spec:
  authentication:
    type: scram-sha-512
  authorization:
    type: simple
    acls:
      # Full cluster administration
      - resource:
          type: cluster
        operations: [All]
      
      # All topic operations
      - resource:
          type: topic
          name: "*"
        operations: [All]
      
      # All consumer group operations
      - resource:
          type: group
          name: "*"
        operations: [All]
      
      # Transaction operations (for exactly-once semantics)
      - resource:
          type: transactionalId
          name: "*"
        operations: [All]

Role-Based ACL Bundles

Data Engineering Role

apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaUser
metadata:
  name: data-engineer
  labels:
    strimzi.io/cluster: my-kafka
spec:
  authentication:
    type: scram-sha-512
  authorization:
    type: simple
    acls:
      # Read access to all data topics
      - resource:
          type: topic
          name: data.*
          patternType: prefix
        operations: [Read, Describe]
      
      # Write access to analytics topics
      - resource:
          type: topic
          name: analytics.*
          patternType: prefix
        operations: [Write, Create, Describe]
      
      # Consumer groups for data processing
      - resource:
          type: group
          name: data-engineering.*
          patternType: prefix
        operations: [Read]
      
      # Schema Registry access
      - resource:
          type: topic
          name: _schemas
        operations: [Read, Write, Describe]

Monitoring Role

apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaUser
metadata:
  name: monitoring-user
  labels:
    strimzi.io/cluster: my-kafka
spec:
  authentication:
    type: scram-sha-512
  authorization:
    type: simple
    acls:
      # Read-only access to all topics for monitoring
      - resource:
          type: topic
          name: "*"
        operations: [Read, Describe]
      
      # Consumer groups for monitoring tools
      - resource:
          type: group
          name: monitoring.*
          patternType: prefix
        operations: [Read]
      
      # Cluster metadata access
      - resource:
          type: cluster
        operations: [Describe]

Development Role (Limited Access)

apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaUser
metadata:
  name: developer
  labels:
    strimzi.io/cluster: my-kafka
spec:
  authentication:
    type: scram-sha-512
  authorization:
    type: simple
    acls:
      # Access only to development topics
      - resource:
          type: topic
          name: dev.*
          patternType: prefix
        operations: [Read, Write, Create, Describe]
      
      # Test topics access
      - resource:
          type: topic
          name: test.*
          patternType: prefix
        operations: [Read, Write, Create, Delete, Describe]
      
      # Development consumer groups
      - resource:
          type: group
          name: dev.*
          patternType: prefix
        operations: [Read, Write, Create, Delete]
      
      # Limited cluster access
      - resource:
          type: cluster
        operations: [Describe]

Prefix Pattern Implications

⚠️ Important Considerations for Prefix Patterns

Prefix patterns (patternType: prefix) have significant security implications:

Overly Broad Access

# ❌ DANGEROUS: Grants access to ALL topics starting with "app"
- resource:
    type: topic
    name: app
    patternType: prefix
  operations: [Read, Write]

# ✅ BETTER: More specific prefix
- resource:
    type: topic
    name: app.orders.
    patternType: prefix
  operations: [Read, Write]

Unintended Topic Access

# ❌ PROBLEM: "user" prefix matches "user-data", "users", "user-profiles", etc.
- resource:
    type: topic
    name: user
    patternType: prefix

# ✅ SOLUTION: Use specific naming with delimiters
- resource:
    type: topic
    name: user.
    patternType: prefix  # Matches "user.profile", "user.settings", etc.

Best Practices for Prefix Patterns
- Use clear naming conventions with delimiters (., -, _)
- Test prefix patterns in development environments
- Regularly audit ACL permissions
- Prefer specific topic names over broad prefixes when possible

SCRAM Secret Management

Retrieving SCRAM Credentials

# Get the generated password for a KafkaUser
kubectl get secret app-producer -o jsonpath='{.data.password}' | base64 -d

# Get the SASL JAAS configuration
kubectl get secret app-producer -o jsonpath='{.data.sasl\.jaas\.config}' | base64 -d

# Complete connection details
kubectl get secret app-producer -o yaml

Using SCRAM Credentials in Applications

Java Application Example:

# application.properties
bootstrap.servers=my-kafka-kafka-bootstrap:9092
security.protocol=SASL_SSL
sasl.mechanism=SCRAM-SHA-512
sasl.jaas.config=org.apache.kafka.common.security.scram.ScramLoginModule required \
  username="app-producer" \
  password="<password-from-secret>";
ssl.truststore.location=/path/to/truststore.jks
ssl.truststore.password=<truststore-password>

Python Application Example:

from kafka import KafkaProducer
import ssl

producer = KafkaProducer(
    bootstrap_servers=['my-kafka-kafka-bootstrap:9092'],
    security_protocol='SASL_SSL',
    sasl_mechanism='SCRAM-SHA-512',
    sasl_plain_username='app-producer',
    sasl_plain_password='<password-from-secret>',
    ssl_context=ssl.create_default_context(),
    ssl_check_hostname=False,
    ssl_cafile='/path/to/ca.crt'
)

ACL Operations Reference

Operation	Description	Typical Use Case
Read	Read messages from topics	Consumers
Write	Write messages to topics	Producers
Create	Create topics/consumer groups	Admin operations
Delete	Delete topics/consumer groups	Admin operations
Alter	Modify topic/group configurations	Admin operations
Describe	Get metadata about resources	Monitoring, clients
ClusterAction	Cluster-level operations	Admin operations
All	All operations	Full admin access

Least Privilege Examples

Microservice Producer (Minimal Permissions)

apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaUser
metadata:
  name: order-service-producer
  labels:
    strimzi.io/cluster: my-kafka
spec:
  authentication:
    type: scram-sha-512
  authorization:
    type: simple
    acls:
      # Only write to order events topic
      - resource:
          type: topic
          name: order.events
        operations: [Write, Describe]
      
      # Transactional ID for exactly-once semantics
      - resource:
          type: transactionalId
          name: order-service-tx
        operations: [Write, Describe]

Microservice Consumer (Minimal Permissions)

apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaUser
metadata:
  name: notification-service-consumer
  labels:
    strimzi.io/cluster: my-kafka
spec:
  authentication:
    type: scram-sha-512
  authorization:
    type: simple
    acls:
      # Only read from order events
      - resource:
          type: topic
          name: order.events
        operations: [Read, Describe]
      
      # Specific consumer group
      - resource:
          type: group
          name: notification-service
        operations: [Read]

Troubleshooting ACLs

Common ACL Issues

Access Denied Errors

# Check user's ACLs
kubectl describe kafkauser app-producer

# Verify topic exists and user has access
kubectl exec my-kafka-dual-role-0 -c kafka -- bin/kafka-acls.sh \
  --bootstrap-server localhost:9092 \
  --list --principal User:app-producer

Consumer Group Authorization Failures

# Verify consumer group ACLs
kubectl exec my-kafka-dual-role-0 -c kafka -- bin/kafka-consumer-groups.sh \
  --bootstrap-server localhost:9092 --list

Testing ACL Permissions

# Test producer permissions
kubectl exec my-kafka-dual-role-0 -c kafka -- bin/kafka-console-producer.sh \
  --bootstrap-server localhost:9092 \
  --topic test-topic \
  --producer.config /tmp/client.properties

# Test consumer permissions
kubectl exec my-kafka-dual-role-0 -c kafka -- bin/kafka-console-consumer.sh \
  --bootstrap-server localhost:9092 \
  --topic test-topic \
  --consumer.config /tmp/client.properties \
  --from-beginning

References

🔐 Certificate Management: Cert-Manager vs Strimzi CA

This section provides guidance on choosing between cert-manager and Strimzi's built-in Certificate Authority (CA) for TLS certificate management, including rotation flows and environment-specific recommendations.

Comparison Matrix

Feature	Strimzi CA	Cert-Manager	Recommendation
Setup Complexity	Simple (built-in)	Moderate (external dependency)	Strimzi CA for simple setups
Certificate Rotation	Automatic	Automatic	Both support auto-rotation
External Trust	Manual trust distribution	Industry-standard CAs	Cert-Manager for external clients
Multi-Cluster	Per-cluster CA	Centralized management	Cert-Manager for multi-cluster
Compliance	Self-signed certificates	Trusted CA certificates	Cert-Manager for compliance
Operational Overhead	Low	Medium	Strimzi CA for internal-only
Client Configuration	Custom truststore required	Standard CA trust	Cert-Manager for ease of use

Strimzi CA Configuration (Default)

Basic Strimzi CA Setup

kafkaCluster:
  listeners:
    - name: "tls"
      port: 9093
      type: internal
      tls: true
      authentication:
        type: tls
      # Uses Strimzi-generated CA by default

  # Optional: Customize CA certificate validity
  clusterCa:
    renewalDays: 30        # Renew 30 days before expiration
    validityDays: 365      # Certificate valid for 1 year
    generateCertificateAuthority: true
  
  clientsCa:
    renewalDays: 30
    validityDays: 365
    generateCertificateAuthority: true

Strimzi CA Certificate Rotation

Automatic Rotation (Recommended):

kafkaCluster:
  clusterCa:
    generateCertificateAuthority: true
    renewalDays: 30
    validityDays: 365
    # Automatic rotation enabled by default
  
  clientsCa:
    generateCertificateAuthority: true
    renewalDays: 30
    validityDays: 365

Manual Rotation Process:

# 1. Check current certificate status
kubectl get secret my-kafka-cluster-ca-cert -o yaml

# 2. Trigger manual rotation (if needed)
kubectl annotate kafka my-kafka strimzi.io/force-renew=true

# 3. Monitor rotation progress
kubectl get kafka my-kafka -o yaml | grep -A 10 status

# 4. Verify new certificates
kubectl get secret my-kafka-cluster-ca-cert -o jsonpath='{.data.ca\.crt}' | base64 -d | openssl x509 -text -noout

Cert-Manager Integration

Prerequisites

# Install cert-manager
kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.13.0/cert-manager.yaml

# Verify installation
kubectl get pods -n cert-manager

ClusterIssuer Configuration

Let's Encrypt Production:

apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: letsencrypt-prod
spec:
  acme:
    server: https://acme-v02.api.letsencrypt.org/directory
    email: admin@example.com
    privateKeySecretRef:
      name: letsencrypt-prod
    solvers:
    - http01:
        ingress:
          class: nginx
    - dns01:
        route53:
          region: us-east-1
          accessKeyID: AKIAIOSFODNN7EXAMPLE
          secretAccessKeySecretRef:
            name: route53-credentials
            key: secret-access-key

Private CA Issuer:

apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: private-ca-issuer
spec:
  ca:
    secretName: private-ca-key-pair

Kafka with Cert-Manager Certificates

kafkaCluster:
  listeners:
    - name: "external"
      port: 9095
      type: ingress
      tls: true
      authentication:
        type: tls
      configuration:
        class: "nginx"
        bootstrap:
          host: "kafka.example.com"
          annotations:
            cert-manager.io/cluster-issuer: "letsencrypt-prod"
          tls:
            enabled: true
            secretName: "kafka-bootstrap-tls"  # Managed by cert-manager
        brokers:
          generateDynamic: true
          maxBrokers: 3
          hostPattern: "kafka-{broker}.example.com"
          annotations:
            cert-manager.io/cluster-issuer: "letsencrypt-prod"
          tls:
            enabled: true
            secretName: "kafka-brokers-tls"   # Managed by cert-manager

  # Disable Strimzi CA for external listeners
  clusterCa:
    generateCertificateAuthority: false
    secretName: "external-cluster-ca"  # Provide your own CA

Certificate Resources

Bootstrap Certificate:

apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: kafka-bootstrap-cert
  namespace: kafka-system
spec:
  secretName: kafka-bootstrap-tls
  issuerRef:
    name: letsencrypt-prod
    kind: ClusterIssuer
  dnsNames:
  - kafka.example.com
  - kafka-bootstrap.example.com

Broker Certificates (Template):

apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: kafka-broker-0-cert
  namespace: kafka-system
spec:
  secretName: kafka-broker-0-tls
  issuerRef:
    name: letsencrypt-prod
    kind: ClusterIssuer
  dnsNames:
  - kafka-0.example.com
  - kafka-broker-0.example.com

Certificate Rotation Flows

Strimzi CA Rotation Flow

graph TD
    A[Certificate Expiry Approaching] --> B[Strimzi Operator Detects]
    B --> C[Generate New CA Certificate]
    C --> D[Update Cluster CA Secret]
    D --> E[Rolling Restart Brokers]
    E --> F[Update Client Certificates]
    F --> G[Rotation Complete]
    
    G --> H[Clients Auto-Reconnect]
    H --> I[Verify New Certificates]

Monitoring Strimzi CA Rotation:

# Check CA certificate expiry
kubectl get secret my-kafka-cluster-ca-cert -o jsonpath='{.data.ca\.crt}' | \
  base64 -d | openssl x509 -enddate -noout

# Monitor rotation events
kubectl get events --field-selector reason=CaCertRenewed

# Check operator logs
kubectl logs deployment/strimzi-cluster-operator -n strimzi-system

Cert-Manager Rotation Flow

graph TD
    A[Certificate Expiry Approaching] --> B[Cert-Manager Detects]
    B --> C[Request New Certificate from CA]
    C --> D[Update Certificate Secret]
    D --> E[Ingress Controller Reloads]
    E --> F[Kafka Brokers Reload TLS]
    F --> G[Rotation Complete]
    
    G --> H[Clients Use New Certificates]
    H --> I[Verify Certificate Chain]

Monitoring Cert-Manager Rotation:

# Check certificate status
kubectl get certificates -n kafka-system

# Check certificate expiry
kubectl describe certificate kafka-bootstrap-cert -n kafka-system

# Monitor cert-manager logs
kubectl logs deployment/cert-manager -n cert-manager

# Check certificate events
kubectl get events --field-selector involvedObject.kind=Certificate

Environment-Specific Recommendations

Development Environment

Recommendation: Strimzi CA

kafkaCluster:
  clusterCa:
    generateCertificateAuthority: true
    validityDays: 90  # Shorter validity for dev
  clientsCa:
    generateCertificateAuthority: true
    validityDays: 90

Rationale:

✅ Simple setup, no external dependencies
✅ Fast iteration and testing
✅ Self-contained environment
❌ Requires custom truststore configuration

Staging Environment

Recommendation: Cert-Manager with Staging CA

kafkaCluster:
  listeners:
    - name: "external"
      type: ingress
      configuration:
        bootstrap:
          annotations:
            cert-manager.io/cluster-issuer: "letsencrypt-staging"

Rationale:

✅ Tests production-like certificate management
✅ Validates cert-manager integration
✅ Uses staging CA (higher rate limits)
✅ Prepares for production deployment

Production Environment

Recommendation: Cert-Manager with Production CA

kafkaCluster:
  listeners:
    - name: "external"
      type: ingress
      configuration:
        bootstrap:
          annotations:
            cert-manager.io/cluster-issuer: "letsencrypt-prod"
            # Additional production annotations
            nginx.ingress.kubernetes.io/ssl-redirect: "true"
            nginx.ingress.kubernetes.io/force-ssl-redirect: "true"

Rationale:

✅ Trusted certificates for external clients
✅ Compliance with security policies
✅ Automatic renewal and rotation
✅ Industry-standard certificate management
❌ Additional operational complexity

Hybrid Approach (Recommended for Large Deployments)

kafkaCluster:
  listeners:
    # Internal listeners use Strimzi CA
    - name: "tls-internal"
      port: 9093
      type: internal
      tls: true
      # Uses Strimzi CA (default)
    
    # External listeners use cert-manager
    - name: "external"
      port: 9095
      type: ingress
      tls: true
      configuration:
        bootstrap:
          annotations:
            cert-manager.io/cluster-issuer: "letsencrypt-prod"
          tls:
            enabled: true
            secretName: "kafka-external-tls"  # Managed by cert-manager

  # Keep Strimzi CA for internal communication
  clusterCa:
    generateCertificateAuthority: true
  clientsCa:
    generateCertificateAuthority: true

Troubleshooting Certificate Issues

Common Strimzi CA Issues

Certificate Expiry

# Check certificate validity
kubectl get secret my-kafka-cluster-ca-cert -o jsonpath='{.data.ca\.crt}' | \
  base64 -d | openssl x509 -dates -noout

# Force certificate renewal
kubectl annotate kafka my-kafka strimzi.io/force-renew=true

Client Trust Issues

# Extract CA certificate for client truststore
kubectl get secret my-kafka-cluster-ca-cert -o jsonpath='{.data.ca\.crt}' | \
  base64 -d > kafka-ca.crt

# Create Java truststore
keytool -import -trustcacerts -alias kafka-ca -file kafka-ca.crt \
  -keystore kafka-truststore.jks -storepass changeit

Common Cert-Manager Issues

Certificate Not Issued

# Check certificate status
kubectl describe certificate kafka-bootstrap-cert

# Check cert-manager logs
kubectl logs deployment/cert-manager -n cert-manager

# Check ACME challenge status
kubectl get challenges

DNS Validation Failures

# Check DNS propagation
dig kafka.example.com

# Verify DNS01 solver configuration
kubectl describe clusterissuer letsencrypt-prod

Best Practices

Security Best Practices

Certificate Rotation
- Set appropriate renewal periods (30 days before expiry)
- Monitor certificate expiry dates
- Test rotation procedures regularly
Key Management
- Protect private keys with appropriate RBAC
- Use separate CAs for different environments
- Implement certificate pinning where appropriate
Monitoring
- Set up alerts for certificate expiry
- Monitor certificate rotation events
- Validate certificate chains regularly

Operational Best Practices

Documentation
- Document certificate management procedures
- Maintain certificate inventory
- Document client configuration requirements
Testing
- Test certificate rotation in non-production
- Validate client reconnection behavior
- Test certificate validation failures
Automation
- Automate certificate deployment
- Implement certificate monitoring
- Automate client truststore updates

References

🌍 Multi-Environment Deployment

Environment-Specific Configurations

The chart includes pre-configured values for different environments:

Non-Production (`values-nonprod.yaml`)

3 replicas
4Gi memory per broker
100Gi storage per broker
1-hour log retention
HPA disabled
Basic monitoring

Staging (`values-staging.yaml`)

3 replicas (HPA: 3-8)
6Gi memory per broker
200Gi storage per broker
24-hour log retention
HPA enabled
Pod anti-affinity enabled
Enhanced monitoring

Production (`values-prod.yaml`)

5 replicas (HPA: 5-15)
8Gi memory per broker
500Gi storage per broker
7-day log retention
HPA enabled with conservative scaling
Pod anti-affinity enforced
Comprehensive monitoring
Multiple Connect clusters
Enhanced security settings

Deployment Examples

# Deploy to non-production (cluster name: kafka-nonprod)
helm install kafka-nonprod . \
  -f values-nonprod.yaml \
  -n om-kafka \
  --create-namespace

# Deploy to staging with custom overrides (cluster name: kafka-staging)
helm install kafka-staging . \
  -f values-staging.yaml \
  --set kafkaCluster.replicas=4 \
  --set hpa.maxReplicas=10 \
  -n om-kafka-staging \
  --create-namespace

# Deploy to production with additional security (cluster name: kafka-prod)
helm install kafka-prod . \
  -f values-prod.yaml \
  --set kafkaCluster.config.auto.create.topics.enable=false \
  --set kafkaCluster.authorization.type=simple \
  -n om-kafka-prod \
  --create-namespace

🧪 Testing

Helm Tests

Run built-in connectivity tests:

# Test the deployment
helm test kafka-nonprod -n om-kafka

# View test logs
kubectl logs -n om-kafka kafka-nonprod-test-kafka-connection

Manual Testing

Create a test producer:

kubectl run kafka-producer -ti --image=quay.io/strimzi/kafka:0.41.0-kafka-3.9.0 --rm=true --restart=Never -- bin/kafka-console-producer.sh --bootstrap-server om-kafka-cluster-kafka-bootstrap:9092 --topic test-topic

Create a test consumer:

kubectl run kafka-consumer -ti --image=quay.io/strimzi/kafka:0.41.0-kafka-3.9.0 --rm=true --restart=Never -- bin/kafka-console-consumer.sh --bootstrap-server om-kafka-cluster-kafka-bootstrap:9092 --topic test-topic --from-beginning

📊 Monitoring

Metrics Collection

The chart automatically configures JMX Prometheus Exporter for:

Kafka Brokers: Core Kafka metrics, JVM metrics, and custom business metrics
Cruise Control: Rebalancing and optimization metrics
Kafka Connect: Connector performance and health metrics
Kafka Exporter: Additional Kafka-specific metrics

Grafana Dashboards

Import the provided Grafana dashboards:

# Import Kafka cluster health dashboard
kubectl apply -f grafana/kafka-cluster-health.json

# Import Zookeeper dashboard (if using ZooKeeper mode)
kubectl apply -f grafana/zookeeper.json

Prometheus Configuration

Add the following to your Prometheus configuration:

scrape_configs:
  - job_name: 'kafka'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)

🔧 Troubleshooting

Common Issues

Pods stuck in Pending state:
- Check node affinity and tolerations
- Verify nodegroup labels and taints
- Ensure sufficient resources in the cluster
External listeners not accessible:
- Verify ingress controller is running
- Check DNS resolution for broker hosts
- Validate TLS certificates
Storage issues:
- Verify storage class exists
- Check PVC creation and binding
- Ensure sufficient storage quota
Authentication failures:
- Verify user secrets are created
- Check ACL configurations
- Validate certificate trust chains

Debug Commands

# Check Kafka cluster status
kubectl get kafka -n om-kafka

# View Kafka cluster details
kubectl describe kafka om-kafka-cluster -n om-kafka

# Check pod logs
kubectl logs -n om-kafka om-kafka-cluster-kafka-0

# View Strimzi operator logs
kubectl logs -n strimzi-operator deployment/strimzi-cluster-operator

# Check external DNS records
kubectl logs -n external-dns deployment/external-dns

Performance Tuning

For production workloads, consider these optimizations:

JVM Tuning:

jvmOptions:
  xms: "4g"
  xmx: "4g"
  additionalOptions:
    - "-XX:+UseG1GC"
    - "-XX:MaxGCPauseMillis=20"
    - "-XX:InitiatingHeapOccupancyPercent=35"

Kafka Configuration:

config:
  num.io.threads: 16
  num.network.threads: 8
  socket.send.buffer.bytes: 102400
  socket.receive.buffer.bytes: 102400
  socket.request.max.bytes: 104857600
  num.replica.fetchers: 4

Storage Optimization:

storage:
  type: jbod
  volumes:
    - type: persistent-claim
      size: 1000Gi
      class: io1  # High IOPS storage class

🔄 Upgrading

Kafka Version Upgrade

Update the chart values:

kafkaCluster:
  version: "3.9.0"  # New version

Apply the upgrade:

helm upgrade kafka-nonprod . -f values-nonprod.yaml -n om-kafka

Monitor the rolling update:
```
kubectl get pods -n om-kafka -w
```

Chart Upgrade

# Upgrade to a new chart version
helm upgrade kafka-nonprod . -f values-nonprod.yaml -n om-kafka

# Rollback if needed
helm rollback kafka-nonprod 1 -n om-kafka

🗑️ Uninstallation

# Uninstall the Helm release
helm uninstall kafka-nonprod -n om-kafka

# Clean up persistent volumes (if needed)
kubectl delete pvc -l strimzi.io/cluster=om-kafka-cluster -n om-kafka

# Remove the namespace
kubectl delete namespace om-kafka

📚 Additional Resources

🤝 Contributing

Fork the repository
Create a feature branch
Make your changes
Add tests if applicable
Submit a pull request

📊 Metrics Configuration

The chart provides flexible metrics configuration through conditional ConfigMaps that are created only when needed, eliminating unnecessary overhead and keeping your values.yaml clean.

🎯 Dynamic ConfigMap Creation

Instead of hardcoding 160+ lines of metrics configuration in values.yaml, ConfigMaps are now created dynamically based on your metrics requirements:

# Optional standalone metrics configurations
kafkaMetrics:
  enabled: true                    # Create Kafka JMX metrics ConfigMap
  configMapName: kafka-metrics
  configMapKey: kafka-metrics-config.yml

cruiseControlMetrics:
  enabled: true                    # Create Cruise Control metrics ConfigMap
  configMapName: cruise-control-metrics
  configMapKey: cruise-control-metrics-config.yml

✅ Benefits

🎯 Conditional Creation: ConfigMaps are created only when enabled: true
📦 Reduced Overhead: No hardcoded metrics configuration in values.yaml
🔧 Flexible Configuration: Enable only the metrics you need
📊 Default Configurations: Production-ready JMX Prometheus Exporter patterns included
🚀 Easy Management: Simple enable/disable per metrics type

📈 Available Metrics Types

Metrics Type	Purpose	Default Enabled
`kafkaMetrics`	Kafka broker JMX metrics with comprehensive patterns	✅
`cruiseControlMetrics`	Cruise Control rebalancing metrics	✅

Note: Kafka Exporter has built-in metrics collection and doesn't require a separate ConfigMap.

🔗 Integration with Components

The metrics ConfigMaps automatically integrate with their respective components:

kafkaCluster:
  metricsConfig:
    enabled: true
    configMapName: kafka-metrics        # References kafkaMetrics ConfigMap
    configMapKey: kafka-metrics-config.yml

  cruiseControl:
    metricsConfig:
      enabled: true
      configMapName: cruise-control-metrics  # References cruiseControlMetrics ConfigMap
      configMapKey: cruise-control-metrics-config.yml

🔄 CI/CD & Quality Assurance

This chart maintains high quality standards through comprehensive automated testing and security scanning:

🛡️ Security Analysis

CodeQL Analysis: Automated security vulnerability scanning
Checkov Security Scan: Infrastructure-as-Code security best practices validation
SARIF Integration: Security findings uploaded to GitHub Security tab

🧪 Automated Testing

Template Validation: All Helm templates validated across multiple configurations
Dependency Management: Automated Strimzi operator dependency updates
Multi-Scenario Testing:
- Default configuration rendering
- Rack awareness enabled scenarios
- Node selector and affinity configurations
- Complex scheduling combinations

📦 Release Management

Automated Versioning: Semantic versioning with automated bumps
Chart Publishing: Automated chart packaging and publishing to GitHub Packages
Release Notes: Auto-generated release documentation

📊 Quality Metrics

Template Coverage: 100% of templates tested in CI
Security Rating: A+ security score with zero known vulnerabilities
Maintainability: Clean, well-documented, and modular code structure

All workflows run on every pull request and main branch push, ensuring consistent quality and security standards.

📄 License

This Helm chart is licensed under the Apache License 2.0. See the LICENSE file for details.

🆘 Support

For issues and questions:

Check the troubleshooting section
Review Strimzi documentation
Open an issue in the repository
Contact the maintainers

Note: This chart is designed for production use with comprehensive configuration options. Always test in a non-production environment before deploying to production.

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
.github/workflows		.github/workflows
charts		charts
templates		templates
.checkov.yaml		.checkov.yaml
.helmignore		.helmignore
CERTIFICATES.md		CERTIFICATES.md
CONTRIBUTING.md		CONTRIBUTING.md
Chart.lock		Chart.lock
Chart.yaml		Chart.yaml
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
upgrades.md		upgrades.md
values.yaml		values.yaml

License

sakkiii/strimzi-kafka-helm

Folders and files

Latest commit

History

Repository files navigation

Strimzi Kafka Helm Chart

🚀 Features

Core Capabilities

Advanced Features

Operational Excellence

🔒 Security Features

Authentication & Authorization

Transport Security

Production Hardening

Security Documentation

📋 Prerequisites

🛠️ Installation

Quick Start

Using the Deployment Script

🎯 Flexible Naming Configuration

🏷️ Default Behavior (Recommended)

🔧 Override Options

✅ Benefits

Example:

⚙️ Configuration

Core Parameters

🐳 Docker Image Configuration

Global Image Defaults

Component-Specific Image Overrides

Environment-Specific Examples

Non-Production (Public Registry)

Production (Private ECR Registry)

Image Pull Secrets Setup

🔒 Security Configuration

Quick Security Setup

1. Basic TLS Configuration

2. Production Security Hardening

3. KafkaUser with ACLs

Security Environment Examples

Development (Relaxed Security)

Production (Full Security)

Security Features Summary

Kubernetes-style Ingress Configuration

Ingress Configuration Options

Environment-Specific Examples

Optional Annotations

KRaft Node Roles Configuration

Supported Node Roles

Configuration Examples

Environment-Specific Patterns

Global Node Selection, Affinity and Tolerations

🎯 Global Override System

✨ Before vs After Example

Node Selection Options

Global Configuration

Component-Level Overrides

Environment-Specific Examples

Storage Configuration

🎧 Flexible Listener Configuration

Basic Example

Advanced Multi-Listener Example

✅ Benefits of List-Based Listeners

Node Affinity and Tolerations

Horizontal Pod Autoscaler (HPA)

❌ What HPA CANNOT Do

✅ What HPA CAN Do

🔧 Proper Broker Scaling

📊 Monitoring Broker Capacity

Metrics Configuration

Kafka Users and Topics

Kafka Connect

🎯 Cruise Control Configuration

Default Goal Templates

Production Rebalancing Goals (Recommended)

Conservative Rebalancing Goals (Minimal Impact)

Operational Guidance

When to Disable Cruise Control

Re-enabling After Maintenance

Goal Explanations

Non-Production (`values-nonprod.yaml`)

Staging (`values-staging.yaml`)

Production (`values-prod.yaml`)