Description
I did some research with the provided links from Cloudera and OpenShift and checked some of the CPU and memory consumption of the operators / products.
I ran the (waterlevel)[https://docs.stackable.tech/stackablectl/stable/demos/nifi-kafka-druid-water-level-data.html] demo.
- Install a metricsserver (give @maltesander a ping if you need a metricsserver yaml working with kind (cannot upload it here unfortunantely)
- Exec into a node and check running processes (via top):
docker exec -it kind-worker2 bash
I used kind in order to be able to run it on a laptop locally and installed a metrics server for some insights:
NAMESPACE NAME PF READY RESTARTS STATUS CPU MEM %CPU/R %CPU/L %MEM/R %MEM/L IP NODE AGE
default airflow-operator-deployment-55c59c64c9-9lr8f ● 1/1 0 Running 1 5 n/a n/a n/a n/a 10.244.2.2 kind-worker2 93m
default commons-operator-deployment-67bb7859bf-gwwrh ● 1/1 0 Running 0 8 n/a n/a n/a n/a 10.244.3.2 kind-worker 92m
default create-druid-ingestion-job-xm6wh ● 0/1 7 Completed 0 0 n/a n/a n/a n/a 10.244.3.9 kind-worker 86m
default create-nifi-ingestion-job-4x5xz ● 0/1 7 Completed 0 0 n/a n/a n/a n/a 10.244.1.10 kind-worker3 86m
default druid-broker-default-0 ● 1/1 0 Running 12 444 n/a n/a n/a n/a 10.244.3.13 kind-worker 81m
default druid-coordinator-default-0 ● 1/1 0 Running 49 484 n/a n/a n/a n/a 10.244.3.14 kind-worker 81m
default druid-historical-default-0 ● 1/1 0 Running 5 610 n/a n/a n/a n/a 10.244.3.12 kind-worker 81m
default druid-middlemanager-default-0 ● 1/1 0 Running 23 2088 n/a n/a n/a n/a 10.244.3.11 kind-worker 81m
default druid-middlemanager-default-1 ● 1/1 0 Running 15 2046 n/a n/a n/a n/a 10.244.1.17 kind-worker3 81m
default druid-operator-deployment-55dfc87cb5-58szg ● 1/1 0 Running 1 10 n/a n/a n/a n/a 10.244.1.2 kind-worker3 92m
default druid-router-default-0 ● 1/1 0 Running 5 249 n/a n/a n/a n/a 10.244.3.15 kind-worker 81m
default hbase-operator-deployment-7599b6cdd6-2zkqx ● 1/1 0 Running 0 4 n/a n/a n/a n/a 10.244.1.3 kind-worker3 92m
default hdfs-operator-deployment-7456467d65-w2gcs ● 1/1 0 Running 0 6 n/a n/a n/a n/a 10.244.2.3 kind-worker2 91m
default hive-operator-deployment-6d9d69b69c-prgp7 ● 1/1 0 Running 1 5 n/a n/a n/a n/a 10.244.3.3 kind-worker 91m
default kafka-broker-default-0 ● 2/2 0 Running 21 773 8 n/a 37 n/a 10.244.1.13 kind-worker3 86m
default kafka-operator-deployment-56fb5fdf9c-6jn4h ● 1/1 0 Running 0 7 n/a n/a n/a n/a 10.244.2.4 kind-worker2 91m
default minio-druid-7496648fdf-hb6d4 ● 1/1 0 Running 1 85 n/a n/a 4 n/a 10.244.3.8 kind-worker 87m
default nifi-node-default-0 ● 1/1 0 Running 180 3353 36 4 81 81 10.244.2.17 kind-worker2 86m
default nifi-operator-deployment-cff9497b5-kfxv8 ● 1/1 0 Running 0 6 n/a n/a n/a n/a 10.244.3.4 kind-worker 90m
default opa-operator-deployment-57bfbbc89c-v46jv ● 1/1 0 Running 0 4 n/a n/a n/a n/a 10.244.1.4 kind-worker3 90m
default postgresql-superset-0 ● 1/1 0 Running 5 35 2 n/a 13 n/a 10.244.2.10 kind-worker2 87m
default secret-operator-daemonset-7pdgb ● 3/3 0 Running 1 21 n/a n/a n/a n/a 10.244.2.5 kind-worker2 90m
default secret-operator-daemonset-868hw ● 3/3 0 Running 1 21 n/a n/a n/a n/a 10.244.1.5 kind-worker3 90m
default secret-operator-daemonset-mn2lz ● 3/3 0 Running 1 21 n/a n/a n/a n/a 10.244.3.5 kind-worker 90m
default setup-superset-hshsh ● 0/1 5 Completed 0 0 n/a n/a n/a n/a 10.244.3.10 kind-worker 86m
default spark-k8s-operator-deployment-887f994ff-t5hf6 ● 1/1 0 Running 0 5 n/a n/a n/a n/a 10.244.3.6 kind-worker 90m
default superset-6dx72 ● 0/1 0 Completed 0 0 n/a n/a n/a n/a 10.244.1.8 kind-worker3 86m
default superset-druid-connection-import-cgmdc ● 0/1 0 Completed 0 0 n/a n/a n/a n/a 10.244.1.16 kind-worker3 81m
default superset-node-default-0 ● 2/2 0 Running 4 174 n/a n/a n/a n/a 10.244.1.15 kind-worker3 82m
default superset-operator-deployment-7c4c46c5d-rgbsq ● 1/1 0 Running 0 7 n/a n/a n/a n/a 10.244.2.6 kind-worker2 89m
default trino-operator-deployment-cf4748586-d796c ● 1/1 0 Running 1 5 n/a n/a n/a n/a 10.244.1.6 kind-worker3 88m
default zookeeper-operator-deployment-76956ccffb-rrbr6 ● 1/1 0 Running 0 9 n/a n/a n/a n/a 10.244.1.7 kind-worker2 88m
default zookeeper-server-default-0 ● 1/1 0 Running 13 269 n/a n/a n/a n/a 10.244.1.11 kind-worker3 86m
This was after the cluster stabilized. CPU are mili, meaning a value of 1000 would be one core. Memory is specified in MB.
The operators (even when reconciling) require almost no CPU/memory. Now this can of course change if we have 1000000 e.g. NiFi clusters to reconcile. So this should be specified.
In general i would recommend about 1/5th or even 1/10th of a core and 50 to 100 mb memory for each operator (should be tested more reliably).
The biggest product parts were NiFi with about 3.3GB memory and the two druid Middlemanagers with about 2GB memory each.
docker stats:
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
5b13a847151c kind-control-plane 12.80% 1.012GiB / 31.25GiB 3.24% 20.2MB / 130MB 118MB / 12.2MB 299
7f32cf575b9b kind-worker3 8.13% 4.13GiB / 31.25GiB 13.21% 3.37GB / 1.39GB 6.75MB / 98.3kB 754
362844dd5007 kind-worker2 15.81% 4.882GiB / 31.25GiB 15.62% 2.56GB / 1.39GB 6.44MB / 98.3kB 464
63e79a8f56d5 kind-worker 5.71% 4.493GiB / 31.25GiB 14.38% 2.56GB / 46.6MB 3.4MB / 98.3kB 1257
This uses 31.25GB for each node (which is the overall memory of my computer), so the memory percentages should be multiplied by 4 assuming each node would get a quarter (minus OS etc.).
I did some testing to apply CRs / reconciling (docker exec -it kind-worker2 bash
and running top
):
Applying a new cluster every 0.5 seconds the CPU did not exceed 2 percent:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
22932 1000 20 0 1130012 22788 14256 S 2.0 0.1 0:00.37 stackable-zooke
Without the 0.5s sleep it did not exceed 7 percent:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
24402 1000 20 0 1131640 24244 14048 S 7.3 0.1 0:00.51 stackable-zooke
Memory stayed unchanged.
What do others do?
- Cloudera: https://docs.cloudera.com/documentation/enterprise/release-notes/topics/hardware_requirements_guide.html
- OpenShift: https://docs.openshift.com/container-platform/3.11/install/prerequisites.html
They basically list each role for each product with recommended system requirements (CPU, Memory, Disc etc.).
Additionally, they recommend e.g. increasing heap/memory depending on incoming connections etc. and sometimes provide hints on where to increase this. For disk they recommend standard disks vs SDDs etc. depending on the product.
What do we need to specify
Stackable Data Platform
I think by running some tests with multiple clusters, we can fix some values for CPU and memory for each operator and/or the whole SDP.
Products
I think we cannot specify anything reliable for products. We can specifiy e.g. the minimum requirements to run e.g. our demos, or a minimal example.
But we do not know the customers data, queries etc. in order to give a proper estimation. I would refer here to the products and their requirements/scaling.
Cloudera e.g. lists hardware requirements (Heap/Memory, CPU, Disk) for products.
I assume they come from experience / testing, i could not verify the values to be taken from any product website (e.g. HBase)
Upper limits for resources / clusters
I tested with a 1000 custom resources (zookeeper - but with replicas set to 0) just being applied after each other in a script and there were no issues (check the usages above). The API server is doing more than the operator...
This should still be tested for every operator.
Minimum requirements for SDP and demos
I think at least for kind and what we are all testing everyday and everytime, a laptop with (16? to) 32GB memory and 4 to 8 cores can easily run any demo we produced so far.
Upper limits - How many resources can a single operator instance support?
Seeing the CPU and memory from the zk operator, i think it can handle more than there will ever be present in any cluster.
But it probably makes sense to set an upper limit per cluster just to be on the safe side.
Recommended Documentation Layout
The operator and product requirements should be captured in a table per operator/product with the following columns:
- Role (not required for operators)
- Heap/Memory
- CPU
- Disk
- Other Recommendations
What is left TODO / Acceptance (must be further refined and split up into tickets):
Operators
- Document operator system requirements #289
- Define/Discuss if the operator requirements should be provided per operator or SDP
- Optional: Test the upper limits of operators. Will the operator crash before the cluster ;-) ?
- Document the minimal/maximal requirements (per release - needs to be versioned) to be referred to from GTCs and other contract documents
Products
- Define/Discuss the minimal requirements for each product? Is it sufficiant to e.g. see a UI or should there be a "task" running without crashing?
- Test e.g. the demos/stacks with limited resources to get closer to a real minimal requirement for the SDP cluster
- Test the definend minimal requirements for each of our products in combination with the SDP (this should probably be an epic itself?)
- Is there any rule of thumb for scaling? E.g. 10 clients -> 10GB memory etc.
- Document the minimal/maximal requirements (per release - needs to be versioned) to be referred to from GTCs and other contract documents
Cluster
- Which k8s implementations can we support? Is there a feature set that we can say we need? (I.e. we need namespaces, a PV provider that can do read-write-many...)
- How should a cluster look like to benefit from HA (number of nodes etc.?
- K8s autoscaling settings for resilience / failover / peaks?
- Suggested number of nodes and node size depending on the products / components?
Misc
- The Stackable support team is willing to accept incoming support issues that run on clusters using the documented minimal requirements
- The documentation describes supported versions of Operating Systems (at least for stackablectl), Kubernetes product and version (managed and on-prem products) and SDP component versions
Hopefully merged properly with #258