Record pod Ready status, or all pod status conditions #32941

gdvalle · 2024-05-08T14:11:22Z

Component(s)

receiver/k8scluster

Is your feature request related to a problem? Please describe.

I would like to be able to see a pod's Ready status in metrics.

Describe the solution you'd like

A named metric with an analog to k8s.container.ready ,like k8s.pod.ready.

Or, a more generic solution where all pod status conditions are represented, i.e. k8s.pod.status.condition. I might prefer this as it captures more states.

I would like to include attributes for the detail info fields (message, reason), for either approach.

Describe alternatives you've considered

Using the existing k8s.container.ready metric. It's unfortunately not complete because of pod ReadinessGates.

Additional context

See #32933 for an impl of k8s.pod.ready, currently sans attributes.

If we can agree, I think the condition status metric makes more sense. I think we need agreement on how to represent it: other things in this receiver (e.g. k8s.pod.phase) use values to represent states, so 0=false, 1=true, 2=unknown could fit, but the ergonomics are a bit poor.

The text was updated successfully, but these errors were encountered:

github-actions · 2024-05-08T14:12:06Z

Pinging code owners:

receiver/k8scluster: @dmitryax @TylerHelmuth @povilasv

See Adding Labels via Comments if you do not have permissions to add labels yourself.

TylerHelmuth · 2024-05-10T19:52:27Z

This feels reasonable, especially if there is existing precedence in KSM.

Is PodStatus already watchable via the k8sobjectsreceiver?

gdvalle · 2024-05-15T17:26:15Z

Is PodStatus already watchable via the k8sobjectsreceiver?

Do you mean k8sclusterreceiver? It has existing watch support for Pod, which should cover the Status struct AFAIK.

k8sobjectsreceiver, I'm not certain, but glancing at the field selection language I'm not sure how you would get the ready condition out of the list.

povilasv · 2024-05-16T13:13:59Z

I think given we have k8s.container.ready we could add k8s.pod.ready.

Regarding pod.status.condition does Kube State Metrics have it? Would be interesting to see what they do around that and what kind of values are there.

sirianni · 2024-05-16T13:23:57Z

I think we need agreement on how to represent it: other things in this receiver (e.g. k8s.pod.phase) use values to represent states, so 0=false, 1=true, 2=unknown could fit, but the ergonomics are a bit poor.

Please no.

See #29157 for a generic k8s.pod.state metric modeled after kube-state-metrics.

sirianni · 2024-05-16T13:25:19Z

In general, I feel like the piecemeal and fragmented approach to Kubernetes metrics in OTel is going to make it impossible to have any vendors build meaningful user experiences on top of this dataset, ultimately delaying any possible standardization in this space.

povilasv · 2024-05-16T13:34:25Z

In this case k8s.pod.state sounds to me more like resource attribute than a metric? Then IMO we can add is a resource attribute.

Regarding:

think we need agreement on how to represent it: other things in this receiver (e.g. k8s.pod.phase) use values to represent states, so 0=false, 1=true, 2=unknown could fit, but the ergonomics are a bit poor.

I guess your against this enum approach? IMO then this should be added to the spec, how to correctly represent enums in otel, so we can solve it once and for all.

TylerHelmuth · 2024-05-16T19:26:00Z

In general, I feel like the piecemeal and fragmented approach to Kubernetes metrics in OTel is going to make it impossible to have any vendors build meaningful user experiences on top of this dataset, ultimately delaying any possible standardization in this space.

We have had an uptick in k8s components feature request, it would be be very nice to make progress on open-telemetry/semantic-conventions#1032

sirianni · 2024-05-16T19:33:53Z

Thanks @TylerHelmuth . In general, my team has been happy with the metrics collected by kubeletstatsreceiver and how they are modeled. They are struggling significantly with the "state" metrics that come from k8sclusterreceiver. We are coming from a Datadog background.

gdvalle · 2024-05-20T19:16:36Z

I guess your against this enum approach? IMO then this should be added to the spec, how to correctly represent enums in otel, so we can solve it once and for all.

I made a bit of a side comment there. I personally feel the value-as-attribute approach to enums has the better user experience, and that's what we should favor, but the clear downside is extraneous metric volume when reporting zeroes. Machines use the numeric enum, but humans want the string.

I think some of this can be resolved in spec, and some ideas are outlined in open-telemetry/opentelemetry-specification#1711, but given there's no consensus for some time, I'd suggest we move forward with existing patterns here. Whichever direction we go there's a need to review and refactor existing metrics in the collector.

More than we need something ideal, we need something usable.

In this case k8s.pod.state sounds to me more like resource attribute than a metric? Then IMO we can add is a resource attribute.

I think the problem with resource metrics for that (as I understand it) is they're effectively metadata, emitted to enrich an actual point. In the pod state example, the state change itself is the point.

Can we move forward, regardless of the enum end-game? I think we need to choose whether we represent pod ready status as e.g. k8s.pod.status.condition (generic) or k8s.pod.ready (cherry-picked/named). I think that means I am proposing k8s.pod.status.condition be part of the k8s semconv. WDYT?

povilasv · 2024-05-24T12:07:08Z

I agree on:

I'd suggest we move forward with existing patterns here. Whichever direction we go there's a need to review and refactor existing metrics in the collector.

I personally would be okay with k8s.pod.ready metric.

Regarding k8s.pod.status.condition, would be interesting to see some example data points to understand what kind of data would it actually produces and what did Kube State Metrics do in this regard.

gdvalle · 2024-05-28T19:57:35Z

Regarding k8s.pod.status.condition, would be interesting to see some example data points to understand what kind of data would it actually produces and what did Kube State Metrics do in this regard.

Whoops, thought I answered your question earlier, but apparently didn't hit send.
KSM seems to use named conditions, kube_pod_status_ready and kube_pod_status_scheduled, so they only support whatever someone bothers to implement.

The generic approach would take any condition and record points for it. Here's a sample from a GKE cluster:

status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: null
    message: 'Pod is in NEG "Key{\"k8s1-7929355b-foo-80-6e399202\",
      zone: \"foo-b\"}". NEG is not attached to any BackendService with health
      checking. Marking condition "cloud.google.com/load-balancer-neg-ready" to True.'
    reason: LoadBalancerNegWithoutHealthCheck
    status: "True"
    type: cloud.google.com/load-balancer-neg-ready
  - lastProbeTime: null
    lastTransitionTime: "2024-05-28T19:14:30Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2024-05-28T19:14:36Z"
    status: "True"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2024-05-28T19:14:35Z"
    status: "True"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2024-05-28T19:14:20Z"
    status: "True"
    type: PodScheduled

The enum is the status field, so Unknown, True, or False (docs).
We can use reason ~~and message~~ fields as attributes.
To illustrate:

k8s.pod.status.condition{type:"Ready", reason:"", message:""} 1
k8s.pod.status.condition{type:"cloud.google.com/load-balancer-neg-ready", reason:"LoadBalancerNegWithoutHealthCheck", message:'Pod is in NEG "Key{\"k8s1-7929355b-foo-80-6e399202\",
      zone: \"foo-b\"}". NEG is not attached to any BackendService with health
      checking. Marking condition "cloud.google.com/load-balancer-neg-ready" to True.'} 1

Edit: as pointed out below, message is probably too high cardinality for many systems, and has less value, so nix that.

povilasv · 2024-05-29T07:05:52Z

IMO message field is a big no, too much carnality and the value is too huge. Prometheus and other systems won't handle this.

Also, this looks like more like events to me rather than metric that tracks state changes?

Maybe instead if should be k8s.pod.status.last_condition or something like that and it would show the last condition from the list?

gdvalle · 2024-05-29T15:33:45Z

IMO message field is a big no, too much carnality and the value is too huge. Prometheus and other systems won't handle this.

I think it has potentially unbounded cardinality! Agree it's not a fit. I think reason is the field we want as it's machine-oriented.

Also, this looks like more like events to me rather than metric that tracks state changes?

k8s.pod.ready represents the Ready condition. You could also emit as events/logs, but I think they're far more useful as metrics, unless maybe you want to read the message field. The idea is making the case that Ready isn't special, and we can just represent all of these conditions so we're not inventing a new metric name for each.

Maybe instead if should be k8s.pod.status.last_condition or something like that and it would show the last condition from the list?

The k8scluster receiver works by watching over a pod resource, so we emit the current condition each time it changes. I don't think it makes sense to call it "last condition"?

dfsdevops · 2024-07-12T22:54:14Z

What about something like

Regarding k8s.pod.status.condition, would be interesting to see some example data points to understand what kind of data would it actually produces and what did Kube State Metrics do in this regard.

Whoops, thought I answered your question earlier, but apparently didn't hit send. KSM seems to use named conditions, kube_pod_status_ready and kube_pod_status_scheduled, so they only support whatever someone bothers to implement.

I like the idea of each status having it's own enableable metric, with k8s.pod.status_ready being enabled by default, and the configuring other status conditions to be included optionally via config. Right now some gymnastics have to be performed to tell if a running pod is actually fully ready or not because you have to cross reference the container metrics. As a user, having these is valuable in my mind as it could be useful in benchmarking or just generally getting a timeline of how long it takes pods to accumulate certain statuses over time.

gdvalle added enhancement New feature or request needs triage New item requiring triage labels May 8, 2024

github-actions bot added the receiver/k8scluster label May 8, 2024

gdvalle mentioned this issue May 8, 2024

[receiver/k8sclusterreceiver]: add k8s.pod.ready metric #32933

Closed

TylerHelmuth added discussion needed Community discussion needed and removed needs triage New item requiring triage labels May 10, 2024

gdvalle mentioned this issue Sep 9, 2024

k8s: add metric for pod status conditions open-telemetry/semantic-conventions#1398

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record pod Ready status, or all pod status conditions #32941

Record pod Ready status, or all pod status conditions #32941

gdvalle commented May 8, 2024 •

edited

Loading

github-actions bot commented May 8, 2024

TylerHelmuth commented May 10, 2024

gdvalle commented May 15, 2024

povilasv commented May 16, 2024

sirianni commented May 16, 2024

sirianni commented May 16, 2024

povilasv commented May 16, 2024

TylerHelmuth commented May 16, 2024

sirianni commented May 16, 2024

gdvalle commented May 20, 2024

povilasv commented May 24, 2024

gdvalle commented May 28, 2024 •

edited

Loading

povilasv commented May 29, 2024

gdvalle commented May 29, 2024

dfsdevops commented Jul 12, 2024

Record pod Ready status, or all pod status conditions #32941

Record pod Ready status, or all pod status conditions #32941

Comments

gdvalle commented May 8, 2024 • edited Loading

Component(s)

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Additional context

github-actions bot commented May 8, 2024

TylerHelmuth commented May 10, 2024

gdvalle commented May 15, 2024

povilasv commented May 16, 2024

sirianni commented May 16, 2024

sirianni commented May 16, 2024

povilasv commented May 16, 2024

TylerHelmuth commented May 16, 2024

sirianni commented May 16, 2024

gdvalle commented May 20, 2024

povilasv commented May 24, 2024

gdvalle commented May 28, 2024 • edited Loading

povilasv commented May 29, 2024

gdvalle commented May 29, 2024

dfsdevops commented Jul 12, 2024

gdvalle commented May 8, 2024 •

edited

Loading

gdvalle commented May 28, 2024 •

edited

Loading