`kubelet` not using full timestamp precision in its log records #414

LHozzan · 2024-11-14T09:53:11Z

What happened?

We can see many times, that our log processor detecting duplicated entries. The entries arent really duplicated, but generated too fast. As a result, from log processor perspective, log entries looks duplicated (e.g.):

...
E1112 06:20:07.634790 2937 controller.go:195] "Failed to update lease" err="Put \"https://api.REDACTED.com/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/ip-10-250-0-11.eu-central-1.compute.internal?timeout=10s\": read tcp 10.250.0.11:41222->REDACTED:443: read: connection reset by peer"
E1112 06:20:07.634790 2937 controller.go:195] "Failed to update lease" err="Put \"https://api.REDACTED.com/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/ip-10-250-0-11.eu-central-1.compute.internal?timeout=10s\": read tcp 10.250.0.11:41222->REDACTED:443: read: connection reset by peer"
E1112 06:20:07.634790 2937 controller.go:195] "Failed to update lease" err="Put \"https://api.REDACTED.com/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/ip-10-250-0-11.eu-central-1.compute.internal?timeout=10s\": read tcp 10.250.0.11:41222->REDACTED:443: read: connection reset by peer"
...

, no matter, that the situation can happen in very short time period (e.g.):

...
E1112 06:20:07.634790012 2937 controller.go:195] "Failed to update lease" err="Put \"https://api.REDACTED.com/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/ip-10-250-0-11.eu-central-1.compute.internal?timeout=10s\": read tcp 10.250.0.11:41222->REDACTED:443: read: connection reset by peer"
E1112 06:20:07.634790089 2937 controller.go:195] "Failed to update lease" err="Put \"https://api.REDACTED.com/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/ip-10-250-0-11.eu-central-1.compute.internal?timeout=10s\": read tcp 10.250.0.11:41222->REDACTED:443: read: connection reset by peer"
E1112 06:20:07.634790167 2937 controller.go:195] "Failed to update lease" err="Put \"https://api.REDACTED.com/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/ip-10-250-0-11.eu-central-1.compute.internal?timeout=10s\": read tcp 10.250.0.11:41222->REDACTED:443: read: connection reset by peer"
...

As you can see, in first example (current state) kubelet have only microseconds precision. We would like to set to kubelet, that the log entries will have nanoseconds (second example, goal state).

What did you expect to happen?

It will be possible to set for kubelet full time precision timestamp (RFC3339Nano standard).

How can we reproduce it (as minimally and precisely as possible)?

This situation happening randomly, but often on bigger kubernetes nodes with huge amount of small pods in reconciliation time (e.g. upgrade many pods in different namespaces in same time).

Anything else we need to know?

No response

Kubernetes version

$ kubectl version
Client Version: v1.31.2
Kustomize Version: v5.4.2
Server Version: v1.30.4

Cloud provider

AWS, Azure

OS version

# On Linux:
$ cat /etc/os-release
ID=gardenlinux
NAME="Garden Linux"
PRETTY_NAME="Garden Linux 1592.1"
HOME_URL="https://gardenlinux.io"
SUPPORT_URL="https://github.com/gardenlinux/gardenlinux"
BUG_REPORT_URL="https://github.com/gardenlinux/gardenlinux/issues"
GARDENLINUX_CNAME=aws-gardener_prod-amd64-1592.1
GARDENLINUX_FEATURES=log,sap,ssh,_boot,_nopkg,_prod,_slim,base,server,cloud,aws,gardener
GARDENLINUX_VERSION=1592.1
GARDENLINUX_COMMIT_ID=ec945aa9
GARDENLINUX_COMMIT_ID_LONG=ec945aa995d0f08d64303ff6045b313b40b665fb
$ uname -a
Linux ip-10-250-0-11.eu-central-1.compute.internal 6.6.47-cloud-amd64 kubernetes/kubernetes#1 SMP PREEMPT_DYNAMIC Debian 6.6.47-0gardenlinux0 (2024-08-19) x86_64 GNU/Linux

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

The text was updated successfully, but these errors were encountered:

k8s-ci-robot · 2024-11-14T09:53:21Z

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

HirazawaUi · 2024-11-14T12:35:26Z

/sig node
/remove-kind bug
/kind feature

BenTheElder · 2024-11-18T05:05:53Z

/sig instrumentation
/remove-sig node

this isn't really kubelet specific, it's more of a k8s.io/klog feature request

IMHO: I'd recommend disabling log de-duping on your logging provider for kubelet logs. If kubelet starts generating spurious duplicate logs, that is a bug. But logging with the log format of a decade now is expected. Attempting to dedupe kubelet logs is probably a waste of resources aside from potentially hiding recurring events / errors.

LHozzan · 2024-11-18T11:36:16Z

Hi @BenTheElder .

this isn't really kubelet specific, it's more of a k8s.io/klog feature request

Thank you for correction. I would like to kindly ask you to move the issue to better location. I will following it.

IMHO: I'd recommend disabling log de-duping on your logging provider for kubelet logs. If kubelet starts generating spurious duplicate logs, that is a bug. But logging with the log format of a decade now is expected.

Thank you for feedback. Unfortunately, we not have enough precision timestamp to be sure, that this is really duplicated log entries, or only same log messages emitted from kubelet very quick.
So, this is first step - get more precision timestamp and next, we can continue with investigation.

Attempting to dedupe kubelet logs is probably a waste of resources aside from potentially hiding recurring events / errors.

I fully agree, but due to missing more precision timestamp, this is actually impossible.

BenTheElder · 2024-11-18T15:19:52Z

Thank you for correction. I would like to kindly ask you to move the issue to better location. I will following it

https://github.com/kubernetes/klog

I fully agree, but due to missing more precision timestamp, this is actually impossible

Er ... no, I'm saying you should simply disable any attempt to deduplicate kubelet logs in the log collector because anytime kubelet produced pointless duplicate logs that would be considered a bug to fix and in any other case you're just spending compute time on hiding intentionally repeated messages, this could be true with even more precise timestamps as there's no guarantee about how slowly something is legitimately logged twice. It would be far less likely that the entire line is identical with nanosecond precision but in that case it would still make sense to simply disable the log deduplication behavior in your pipeline outside of kubelet because you are now expecting it will never detect one anyhow...

LHozzan · 2024-11-19T13:41:35Z

https://github.com/kubernetes/klog

Thank you, unfortunately, I not have rights to do this. I would like to kindly ask you to move the issue there.

no, I'm saying you should simply disable any attempt to deduplicate kubelet logs in the log collector ...

Yes, this is our default state, so, no deduplication. If same log record exist many times, log collector attempting to deliver every occurrences. But, log collector (FluentBit) calculating unique ID from log record and attempt to store it to the DB (OpenSearch). And here is our case - if FluentBit attempt to store "same" log record, OpenSearch reject it and after several time, FluentBit drop it. Unfortunately, meantime FluentBit attempt to retry delivery it, but will still failing, because same log record is exist in OpenSearch and OpenSearch will still reject it. As a consequence, affected FluentBit instance start rising the retry metric and Prometheus start firing alert, because FluentBit have a problem with delivery log records.

And here is another problem - there can be many different reasons, why FluentBit retrying delivery, so, operation team must check every occurrence from the FluentBit logs, what is reason for retrying.

First step is simple - ask Kubernetes community to improve log mechanism in kubelet and got more accurate data.

Second step will be:

if the events will occurring quick really many time, fine, we will got more precise log entries without any "fog"
if the events will be really duplicated by kubelet (what you mentioned as a possible thing too), I will open new issue and provide cooperation to solve the problem

dims · 2024-11-19T14:21:19Z

/transfer klog

pohly · 2024-11-19T16:42:39Z

Like @BenTheElder I also don't think that the time stamp should be used to make log entries unique. There is no guarantee that "time stamp" + "message" + "key/value" is unique. A log consumer which makes that assumption should add its own unique ID (for example, a running counter) instead.

BenTheElder · 2024-11-19T17:28:46Z

Yes, this is our default state, so, no deduplication. If same log record exist many times, log collector attempting to deliver every occurrences. But, log collector (FluentBit) calculating unique ID from log record and attempt to store it to the DB (OpenSearch). And here is our case - if FluentBit attempt to store "same" log record, OpenSearch reject it and after several time, FluentBit drop it. Unfortunately, meantime FluentBit attempt to retry delivery it, but will still failing, because same log record is exist in OpenSearch and OpenSearch will still reject it. As a consequence, affected FluentBit instance start rising the retry metric and Prometheus start firing alert, because FluentBit have a problem with delivery log records.

And here is another problem - there can be many different reasons, why FluentBit retrying delivery, so, operation team must check every occurrence from the FluentBit logs, what is reason for retrying.

That all sounds like bugs in the log pipeline, attempting to circumvent them by pushing new complexity into the logging used for all Kubernetes components for the better part of a decade now doesn't seem like the best way, on balance, to handle this.

Like @BenTheElder I also don't think that the time stamp should be used to make log entries unique. There is no guarantee that "time stamp" + "message" + "key/value" is unique. A log consumer which makes that assumption should add its own unique ID (for example, a running counter) instead.

Right, this can be handled by inserting a monotonic counter based on the line number in the log pipeline without modifying kubelet / klog.

Without more demand for this functionality, I don't know that it's a good tradeoff to implement this, and we certainly wouldn't change the default lightly because there is a lot of tooling out there parsing these logs.

The consistency of logs emitted by core kubernetes components across providers/installers/distros and the predictable ability to parse them correctly is a feature of its own currently.

LHozzan · 2024-11-20T11:49:05Z

Thank you for your feedback. We using that timestamp precision so far without any problems, because it is very unlikely (yop, there is nothing "for sure"), that two log records will be emitted in same time, same nano second. I mean, this is unlikely lesser:

...
06:20:07.634790012
06:20:07.634790012
06:20:07.634790012
...

like

...
06:20:07.634790
06:20:07.634790
06:20:07.634790
...

So, I am near sure, that if this situation

...
06:20:07.634790012
06:20:07.634790012
06:20:07.634790012
...

will happening, it is very high chance for duplication from kubelet or systemd-journald side.

Without more demand for this functionality, I don't know that it's a good tradeoff to implement this, and we certainly wouldn't change the default lightly because there is a lot of tooling out there parsing these logs.

I agree, default timestamp should stay as is, but if anybody else need to have full timestamp precision, this feature will be available.

pohly · 2024-11-20T13:50:12Z

👎 from me for adding more code and thus complexity to klog for something that I consider a workaround for a deficiency in the log consumer.

dims · 2024-11-20T15:26:00Z

I agree @pohly

LHozzan · 2024-11-21T08:50:13Z

Hi guys.

It seems, that you have better idea, where problem can be. So, lets jump to beginning and skip our logging stack. I used it only to locate log records, which it marked as duplicated.

I was on the node, where the records was generated and I used something like this: journalctl --since "2024-11-20 05:00:00" --until "2024-11-20 07:00:00" -u kubelet.service.

Are the marked log records really duplicated by kubelet or the records are two different events, which was emitted very quick? I would like to kindly ask you to explain to me your decision. It seems, that something I missed.

Thank you!

pohly · 2024-11-21T09:45:53Z

Probably these are two different records. We are working on contextual logging, which could be used to add more information about which stream was being watched. As it stands, the kubelet is watching several and they all go down at roughly the same time, thus the high chance that they get the same time stamp.

LHozzan · 2024-11-21T10:55:56Z

We are working on contextual logging, which could be used to add more information about which stream was being watched.

OK, thank you. Could you please share issue, where are these activities tracked? It seems, that I missed it.

As it stands, the kubelet is watching several and they all go down at roughly the same time, thus the high chance that they get the same time stamp.

OK, thank you for explanation. If timestamp precision will not be helpful in the situations, I afraid, that without adding another attribute to the log record, this will not possible to solved. I mean add something like "event counter" or adding some kind of ID to difference each log records, no matter, that they can be emitted too quick.

pohly · 2024-11-21T11:17:18Z

Could you please share issue, where are these activities tracked?

kubernetes/enhancements#3077

I mean add something like "event counter" or adding some kind of ID to difference each log records, no matter, that they can be emitted too quick.

Just to be clear, we are not going to add anything like that to our log output. Each log entry as written is a new log entry.

LHozzan · 2024-11-21T11:57:23Z

kubernetes/enhancements#3077

Thank you.

Just to be clear, we are not going to add anything like that to our log output. Each log entry as written is a new log entry.

I am curious - How I, as a end user, can be sure, that no duplication occurring? I would like to not be paranoid, but only blind trust to output from the klog isnt good idea for me.

pohly · 2024-11-21T13:12:19Z

Why should duplication occur?

As a user, you have to trust the code that you are using and the people who wrote it. As a developer, you can review the source code.

LHozzan · 2024-11-21T14:33:40Z

Why should duplication occur?

This is the perfect question, which I actually attempting to got answer in the issue. As you can see from my screenshots, as a end user, I see identical log records and I not have bullet-proof evidence, that are or it arent duplicated log records. As I mentioned previously, they can be log records from events, which are generating too quickly, but again I not have bullet-proof evidence, that are or it arent that records.

Unfortunately, my Golang knowledge and experiences are very limited. This is reason, why I open the issue and try to got answers.

As a user, you have to trust the code that you are using and the people who wrote it.

In current age, when projects can be corrupted, I consider my paranoia as a good starting point to be curious, how things work inside. But this is off topic, only for illustration, that healthy part of mistrust is required.

OK, lets get back to my case - Are these records duplicated or not? If not, I would like to kindly ask you to provide bullet-proof evidence, which I can confirm from my side.

dims · 2024-11-21T14:36:21Z

@LHozzan if you need something specific, go ahead and fork this project to suit your needs. @pohly has already responded to the original request with

👎 from me for adding more code and thus complexity to klog for something that I consider a workaround for a deficiency in the log consumer.

/close

k8s-ci-robot · 2024-11-21T14:36:27Z

@dims: Closing this issue.

In response to this:

@LHozzan if you need something specific, go ahead and fork this project to suit your needs. @pohly has already responded to the original request with

👎 from me for adding more code and thus complexity to klog for something that I consider a workaround for a deficiency in the log consumer.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

BenTheElder · 2024-11-22T01:10:15Z

I would like to kindly ask you to provide bullet-proof evidence, which I can confirm from my side.

This seems excessive.

The point is this library is not intended to be used by code that is generating pointless duplicates.
As discussed above, we have no reason to believe that kubelet is spamming logs, and that these are in fact for different events (though they could do with more detail, and there is a broader effort to implement richer contextual logs).

If you have evidence that kubelet is logging multiple times for the same underlying event, please file a bug with details and we will work with the relevant owners to fix that.

But asking us to provide "bullet proof evidence" that this doesn't exist is an unreasonable request.

IMHO: There's no reason to deduplicate kubelet logs. Every log line from kubelet is meaningful, and on the off chance that it isn't, that is, because it's evidence of a bug and you should report the bug to the project so we can fix it.

As explained above, the "identical" logs are likely telling you that the same thing is happening repeatedly to different streams, deduplicating those is losing info but adding nanoseconds isn't meaningfully adding any.

You can reasonably assume that every kubelet log is a unique entry/event and if something is logged repeatedly then it did in fact repeat.

LHozzan · 2024-11-22T08:31:33Z

The point is this library is not intended to be used by code that is generating pointless duplicates.
As discussed above, we have no reason to believe that kubelet is spamming logs, and that these are in fact for different events (though they could do with more detail, and there is a broader effort to implement richer contextual logs).

I provided proof, that duplication occurring. As a user, I really dont know, what is source of the duplication (kubelet itself?, any another library in the project?, environment?, journald?, ... or something totally different).
You only writing, that library not have possibilities to do this duplication without any proofs from your side.

If you have evidence that kubelet is logging multiple times for the same underlying event, please file a bug with details and we will work with the relevant owners to fix that.

This is it! If I have it, definitely I provide it. Again, as a user, I really dont know, what is source of the duplication.

But asking us to provide "bullet proof evidence" that this doesn't exist is an unreasonable request.

If you thinking, I am sure, that any investigation for this problem is only wasting of time.

IMHO: There's no reason to deduplicate kubelet logs. Every log line from kubelet is meaningful, and on the off chance that it isn't, that is, because it's evidence of a bug and you should report the bug to the project so we can fix it.

Again, as a user, I really dont know, what is source of the duplication.

As explained above, the "identical" logs are likely telling you that the same thing is happening repeatedly to different streams, deduplicating those is losing info but adding nanoseconds isn't meaningfully adding any.

I agree "are likely", but not sure.

I know, that there can be parser in another log collectors, which expecting some log format. This is reason, why I asked you to add more precision timestamp. Fine, you have right, this again isnt bullet proof, because event can happen even in nanoseconds or picoseconds. Istead of counter-suggestion you repeatelly you only claim, that this is "unlikely".

As explained above, the "identical" logs are likely telling you that the same thing is happening repeatedly to different streams, deduplicating those is losing info but adding nanoseconds isn't meaningfully adding any.

I know, something is somewhere wrong. Where and what?

You can reasonably assume that every kubelet log is a unique entry/event and if something is logged repeatedly then it did in fact repeat.

This was my assumption too, until I discovered duplicated log records.

OK, lets summarize it:

we discovered problem and provided proof and placed suggestion
we dont know, where problem exactly is
you are not willing to continue with investigation, because you are sure, that Kubernetes project is flawless

Fine, this is actually for us only annoyingly and we can wait for somebody else, who will have same problems and merge our effort to find source of the problem. Of course, if we will still using Kubernetes on the time, because:

if you need something specific, go ahead and fork this project to suit your needs. pohly has already responded to the original request with

is only longer writen "use it or let it". So, it is more convenient wait for somebody else to solving the problem, like wasting time. This was my first issue to the project, but after this experience, you can be sure, that last!

I am willing to provide some effort only if the issue will exist, problem will be critical for us and of course, I will still using Kubernetes.

Have a nice day!

LHozzan added the kind/bug Categorizes issue or PR as related to a bug. label Nov 14, 2024

k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Nov 14, 2024

k8s-ci-robot added sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. and removed sig/node Categorizes an issue or PR as relevant to SIG Node. labels Nov 18, 2024

k8s-ci-robot transferred this issue from kubernetes/kubernetes Nov 19, 2024

k8s-ci-robot closed this as completed Nov 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`kubelet` not using full timestamp precision in its log records #414

`kubelet` not using full timestamp precision in its log records #414

LHozzan commented Nov 14, 2024

k8s-ci-robot commented Nov 14, 2024

HirazawaUi commented Nov 14, 2024

BenTheElder commented Nov 18, 2024

LHozzan commented Nov 18, 2024

BenTheElder commented Nov 18, 2024

LHozzan commented Nov 19, 2024

dims commented Nov 19, 2024

pohly commented Nov 19, 2024

BenTheElder commented Nov 19, 2024

LHozzan commented Nov 20, 2024

pohly commented Nov 20, 2024

dims commented Nov 20, 2024

LHozzan commented Nov 21, 2024

pohly commented Nov 21, 2024

LHozzan commented Nov 21, 2024

pohly commented Nov 21, 2024

LHozzan commented Nov 21, 2024

pohly commented Nov 21, 2024

LHozzan commented Nov 21, 2024

dims commented Nov 21, 2024

k8s-ci-robot commented Nov 21, 2024

BenTheElder commented Nov 22, 2024

LHozzan commented Nov 22, 2024

kubelet not using full timestamp precision in its log records #414

kubelet not using full timestamp precision in its log records #414

Comments

LHozzan commented Nov 14, 2024

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Kubernetes version

Cloud provider

OS version

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

k8s-ci-robot commented Nov 14, 2024

HirazawaUi commented Nov 14, 2024

BenTheElder commented Nov 18, 2024

LHozzan commented Nov 18, 2024

BenTheElder commented Nov 18, 2024

LHozzan commented Nov 19, 2024

dims commented Nov 19, 2024

pohly commented Nov 19, 2024

BenTheElder commented Nov 19, 2024

LHozzan commented Nov 20, 2024

pohly commented Nov 20, 2024

dims commented Nov 20, 2024

LHozzan commented Nov 21, 2024

pohly commented Nov 21, 2024

LHozzan commented Nov 21, 2024

pohly commented Nov 21, 2024

LHozzan commented Nov 21, 2024

pohly commented Nov 21, 2024

LHozzan commented Nov 21, 2024

dims commented Nov 21, 2024

k8s-ci-robot commented Nov 21, 2024

BenTheElder commented Nov 22, 2024

LHozzan commented Nov 22, 2024

`kubelet` not using full timestamp precision in its log records #414

`kubelet` not using full timestamp precision in its log records #414