You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
// PodSandboxStats provides the resource usage statistics for a pod.
405
+
// The linux or windows field will be populated depending on the platform.
463
406
message PodSandboxStats {
464
407
// Information of the pod.
465
-
// Corresponds to PodRef in SummaryAPI
466
408
PodSandboxAttributes attributes = 1;
467
-
// CPU usage gathered from the pod.
468
-
// Corresponds to Stats SummaryAPI CPUStats field
469
-
CpuUsage cpu = 2;
470
-
// Memory usage gathered from the pod.
471
-
// Corresponds to Stats SummaryAPI MemoryStats field
472
-
MemoryUsage memory = 3;
473
-
// TODO: do we want a start time field?
474
-
// The time at which data collection for the pod-scoped (e.g. network) stats was (re)started.
475
-
// int64 timestamp = 1;
476
-
// Stats of containers in the measured pod.
477
-
repeated ContainerStats containers
478
-
// Stats pertaining to CPU resources consumed by pod cgroup (which includes all containers' resource usage and pod overhead).
479
-
NetworkStats network = 4;
480
-
// Note the specific omission of VolumeStats and EphemeralStorage
481
-
// Each of these fields will be calculated Kubelet-level
482
-
// ProcessStats pertaining to processes.
483
-
ProcessStats process = 5;
409
+
// Stats from linux.
410
+
LinuxPodSandboxStats linux = 2;
411
+
// Stats from windows.
412
+
WindowsPodSandboxStats windows = 3;
484
413
}
485
414
486
-
// NetworkStats contains data about network resources.
487
-
message NetworkStats {
488
-
// The time at which these stats were updated.
489
-
int64 timestamp = 1;
415
+
// LinuxPodSandboxStats provides the resource usage statistics for a pod sandbox on linux.
416
+
message LinuxPodSandboxStats {
417
+
// CPU usage gathered for the pod sandbox.
418
+
CpuUsage cpu = 1;
419
+
// Memory usage gathered for the pod sandbox.
420
+
MemoryUsage memory = 2;
421
+
// Network usage gathered for the pod sandbox
422
+
NetworkUsage network = 3;
423
+
// Stats pertaining to processes in the pod sandbox.
424
+
ProcessUsage process = 4;
425
+
// Stats of containers in the measured pod sandbox.
426
+
repeated ContainerStats containers = 5;
427
+
}
490
428
491
-
// Stats for the default interface, if found
492
-
InterfaceStats default_interface = 2;
429
+
// WindowsPodSandboxStats provides the resource usage statistics for a pod sandbox on windows
430
+
message WindowsPodSandboxStats {
431
+
// TODO: Add stats relevant to windows.
432
+
}
493
433
494
-
repeated InterfaceStats interfaces = 3;
434
+
// NetworkUsage contains data about network resources.
435
+
message NetworkUsage {
436
+
// The time at which these stats were updated.
437
+
int64 timestamp = 1;
438
+
// Stats for the default network interface.
439
+
NetworkInterfaceUsage default_interface = 2;
440
+
// Stats for all found network interfaces, excluding the default.
441
+
repeated NetworkInterfaceUsage interfaces = 3;
495
442
}
496
443
497
-
// InterfaceStats contains resource value data about interface.
498
-
type InterfaceStats struct {
499
-
// The name of the interface
444
+
// NetworkInterfaceUsage contains resource value data about a network interface.
445
+
message NetworkInterfaceUsage {
446
+
// The name of the network interface.
500
447
string name = 1;
501
448
// Cumulative count of bytes received.
502
-
Uint64Value rx_bytes = 2;
449
+
UInt64Value rx_bytes = 2;
503
450
// Cumulative count of receive errors encountered.
504
-
Uint64Value rx_errors = 2;
451
+
UInt64Value rx_errors = 3;
505
452
// Cumulative count of bytes transmitted.
506
-
Uint64Value tx_bytes = 2;
453
+
UInt64Value tx_bytes = 4;
507
454
// Cumulative count of transmit errors encountered.
508
-
Uint64Value tx_errors = 2;
455
+
UInt64Value tx_errors = 5;
509
456
}
510
457
511
-
// ProcessStats are stats pertaining to processes.
512
-
message ProcessStats {
513
-
// Number of processes in the pod.
514
-
Uint64Value process_count = 1;
458
+
// ProcessUsage are stats pertaining to processes.
459
+
message ProcessUsage {
460
+
// The time at which these stats were updated.
461
+
int64 timestamp = 1;
462
+
// Number of processes.
463
+
UInt64Value process_count = 2;
515
464
}
516
465
```
517
466
@@ -561,18 +510,18 @@ The table above describes the various metrics that are in this endpoint.
561
510
Each compliant CRI implementation must:
562
511
- Have a location broadcasted about where these metrics can be gathered from. The endpoint name must not necessarily be `/metrics/cadvisor`, nor be gathererd from the same port as it was from cAdvisor
563
512
- Implement *all* metrics within the set of metrics that are decided on.
564
-
- **TODO** How will we decide this set? We could support all, or take polls from the community and come up with a set of sufficiently useful metrics.
513
+
-**TODO** How will we decide this set? We could support all, or take polls from the community and come up with a set of sufficiently useful metrics.
565
514
- Pass a set of tests in the critest suite that verify they report the correct values for *all* supported metrics labels (to ensure continued conformance and standardization).
566
515
567
516
Below is the proposed strategy for doing so:
568
517
569
518
1. The Alpha release will strictly cover research, performance testing and the creation of conformance tests.
570
-
- Initial research on the set of metrics required should be done. This will, possibly, allow the community to declare metrics that are not required to be moved to the CRI implementations.
571
-
- Testing on how performant cAdvisor+Kubelet are today should be done, to find a target, acceptable threshold of performance for the CRI implementations
572
-
- Creation of tests verifying the metrics are reported correctly should be created and verified with the existing cAdvisor implementation.
519
+
- Initial research on the set of metrics required should be done. This will, possibly, allow the community to declare metrics that are not required to be moved to the CRI implementations.
520
+
- Testing on how performant cAdvisor+Kubelet are today should be done, to find a target, acceptable threshold of performance for the CRI implementations
521
+
- Creation of tests verifying the metrics are reported correctly should be created and verified with the existing cAdvisor implementation.
573
522
2. For the Beta release, add initial support for CRI implementations to report these metrics
574
-
- This set of metrics will be based on the research done in alpha
575
-
- Each will be validated against the conformance and performance tests created in alpha.
523
+
- This set of metrics will be based on the research done in alpha
524
+
- Each will be validated against the conformance and performance tests created in alpha.
576
525
3. For the GA release, the CRI implementation should be the source of truth for all pod and container level metrics that external parties rely on (no matter how many endpoints the Kubelet advertises).
577
526
578
527
#### cAdvisor
@@ -618,7 +567,7 @@ As a requirement for the Beta stage, cAdvisor must support optionally collecting
618
567
### Version Skew Strategy
619
568
620
569
- Breaking changes between versions will be mitigated by the FeatureGate.
621
-
- By the time the FeatureGate is deprecated, it is expected the transition between CRI and cAdvisor is complete, and CRI has had at least one release to expose the required metrics (to allow for `n-1` CRI skew).
570
+
- By the time the FeatureGate is deprecated, it is expected the transition between CRI and cAdvisor is complete, and CRI has had at least one release to expose the required metrics (to allow for `n-1` CRI skew).
622
571
- In general, CRI should be updated in tandem with or before the Kubelet.
623
572
624
573
## Production Readiness Review Questionnaire
@@ -775,13 +724,13 @@ operations covered by [existing SLIs/SLOs]?**
775
724
Think about adding additional work or introducing new steps in between
776
725
(e.g. need to do X to start a container), etc. Please describe the details.
777
726
- The process of collecting and reporting the metrics should not differ too much between cAdvisor and the CRI implementation:
778
-
- At a high level, both need to watch the changes to the stats (from cgroups, disk and network stats)
779
-
- Once collected, the CRI implementation will need to report them (both through the CRI and eventually through the prometheus endpoint).
780
-
- Both of these steps are already done by cAdvisor, so the work is changing hands, but not fundamentally changing.
727
+
- At a high level, both need to watch the changes to the stats (from cgroups, disk and network stats)
728
+
- Once collected, the CRI implementation will need to report them (both through the CRI and eventually through the prometheus endpoint).
729
+
- Both of these steps are already done by cAdvisor, so the work is changing hands, but not fundamentally changing.
781
730
- It is possible the Alpha iteration of this KEP may affect CPU/memory usage on the node:
782
731
- This may come because cAdvisor's performance has been fine-tuned, and changing the location of work may loose some optimizations.
783
-
- However, it is explicitly stated that a requirement for transition from Alpha->Beta is little to no performance degradation.
784
-
- The existence of the feature gate will allow users to mitigate this potential blip in performance (by not opting-in).
732
+
- However, it is explicitly stated that a requirement for transition from Alpha->Beta is little to no performance degradation.
733
+
- The existence of the feature gate will allow users to mitigate this potential blip in performance (by not opting-in).
785
734
***Will enabling / using this feature result in non-negligible increase of
786
735
resource usage (CPU, RAM, disk, IO, ...) in any components?**
787
736
- It most likely will reduce resource utilization. Right now, there is duplicate work being done between CRI and cAdvisor.
@@ -818,6 +767,6 @@ Note: This is by design as this will enable to decouple runtime implementation d
818
767
## Alternatives
819
768
820
769
- Instead of teaching CRI how to do *everything* cAdvisor does, we could instead have cAdvisor not do the work the CRI stats end up doing (specifically when reporting disk stats, which are the most expensive operation to report).
821
-
- However, this doesn't address the anti-pattern of having multiple parties confusingly responsible for a wide array of metrics and other issues described.
770
+
- However, this doesn't address the anti-pattern of having multiple parties confusingly responsible for a wide array of metrics and other issues described.
822
771
- Have cAdvisor implement the summary API. A cAdvisor daemonset could be a drop-in replacement for the summary API.
823
772
- Don't keep supporting the summary API. Replace it with a "better" format, like prometheus. Or help users migrate to equivalent APIs that container runtimes already expose for monitoring.
0 commit comments