Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat: consider adopting procfs lib FS.Meminfo() for memory collector #2957

Closed
tjhop opened this issue Mar 15, 2024 · 10 comments
Closed

Feat: consider adopting procfs lib FS.Meminfo() for memory collector #2957

tjhop opened this issue Mar 15, 2024 · 10 comments

Comments

@tjhop
Copy link
Contributor

tjhop commented Mar 15, 2024

As of #2952, the node exporter has been bumped to use procfs lib v0.13.0, which has a fix for safer meminfo parsing from /proc/meminfo. This means it's possible to move away from the custom meminfo parsing the node exporter currently does and use the updated library's parsing instead.

Considerations:
The node exporter memory collector's Update() func uses and expects memory info to be returned as a map[string]float64 from the various platform implementations, which means that even if we adopt the library's updated memory info parsing, we would then need to convert the struct into the expected map type. This can be done with a quick json Marshal/Unmarshal dance playground, if we're willing to pull encoding/json in as a dependency. I'd really rather avoid manually/explicitly parsing out the struct fields as it feels fragile and prone to breakage on procfs updates, so ideas welcome.

I'm willing to implement the changes if the concepts here are accepted 👍

@tjhop tjhop changed the title Feat: considering adopting procfs lib FS.Meminfo() for memory collector Feat: consider adopting procfs lib FS.Meminfo() for memory collector Mar 18, 2024
@discordianfish
Copy link
Member

I don't think we should marshal them to a map though, we should finally make the meminfo metrics follow more the best pratices. E.g using labels for metrics that can be summed up. For that I'd suggest creating a new meminfo collector and deprecate the old one, then in a next major release enabled the new one by default and disable the deprecated one.

@SuperQ wdyt?

@rexagod
Copy link
Contributor

rexagod commented May 28, 2024

I can work on that if @SuperQ is +1, and @tjhop has no plans in the near future to take this up.

@tjhop
Copy link
Contributor Author

tjhop commented May 29, 2024

Thanks @rexagod! I was mostly waiting on the green light to proceed, I'm still willing to take this on. However, I would be very happy/grateful if you would be willing to help review the PR once it's pushed and/or PR against my branch if you want to collaborate more.

Initial thoughts/questions for feedback:

  • do we want a feature flag to toggle the new collector? I would think so

  • prometheus/procfs is clearly pretty *nix oriented, do we also convert to the proposed new metrics format for darwin/netbsd/openbsd? I would think so, for at least consistency reasons

  • similar question above for the meminfo numa collector -- should it also get normalized to the new format?

  • I'd suggest creating a new meminfo collector and deprecate the old one -- should the metrics stay in the memory_ subsystem namespace still?

  • E.g using labels for metrics that can be summed up -- this is a great idea. Metric naming docs provide the following guidance:

    As a rule of thumb, either the sum() or the avg() over all dimensions of a given metric should be meaningful (though not necessarily useful). If it is not meaningful, split the data up into multiple metrics. For example, having the capacity of various queues in one metric is good, while mixing the capacity of a queue with the current number of elements in the queue is not.

    With this in mind, how many metrics/labels do we want to have? Some metrics in the darwin/netbsd/openbsd meminfo collectors are counters, should they remain counters (and thus a separate metric)?

  • There's lots of downstream repos that will likely need to be updated to account for these changes (monitoring mixin rules, etc), and likely not all of them under the purview of the prometheus project itself. How to best communicate intended changes?

(sorry for the stream of consciousness, like I said, initial thoughts 🙃 )

@discordianfish
Copy link
Member

do we want a feature flag to toggle the new collector? I would think so

If it's a new collector, it can be disabled/enabled - so no 'feature flag' specifically

prometheus/procfs is clearly pretty *nix oriented, do we also convert to the proposed new metrics format for darwin/netbsd/openbsd? I would think so, for at least consistency reasons

If we can support the other OSes with the new collector, cool - if not, we can add support for that later.

similar question above for the meminfo numa collector -- should it also get normalized to the new format?
If it fits the scope of the new collector, why not.

I'd suggest creating a new meminfo collector and deprecate the old one -- should the metrics stay in the memory_ subsystem namespace still?

Yes, I'd say we make the collectors mutually exclusive so you can use the same metric names where it makes sense

With this in mind, how many metrics/labels do we want to have? Some metrics in the darwin/netbsd/openbsd meminfo collectors are counters, should they remain counters (and thus a separate metric)?

The general best practices apply, so yeah we shouldn't mix counters and gauges. Only things where sum() makes sense should be labels in the same metric.

There's lots of downstream repos that will likely need to be updated to account for these changes (monitoring mixin rules, etc), and likely not all of them under the purview of the prometheus project itself. How to best communicate intended changes?

Thats why I suggest a new collector (and mark the old one deprecated eventually), downstream projects can still use the old one but get warnings that it is deprecated

@rexagod
Copy link
Contributor

rexagod commented Jun 8, 2024

I'd be happy to review your PR, @tjhop! Feel free to tag me there once its up! Godspeed! 👋🏼

tjhop added a commit to tjhop/node_exporter that referenced this issue Jun 8, 2024
Part of prometheus#2957

Adds a new collector named `meminfo_procfs` that exposes memory metrics
in a format that attemps to be more inline with upstream conventions --
memory metrics are now exposed under a single metric named
`node_memory_bytes` that has a single label called `field`,
corresponding to the name of the field in `/proc/meminfo` that value
represents. Label values for the `field` label are named according to
the struct field in the procfs.Meminfo struct, and the values always use
the byte-normalized counterpart fields in the procfs.Meminfo struct,
resulting in a transition such as the following:
`node_memory_Active_anon_bytes -> node_memory_bytes{field="ActiveAnon"}`

Notes:
currently linux only, as that is the focus of the procfs lib. once
consensus has been reached here on new metric name/labels/format, I can
expand coverage for darwin/openbsd/netbsd and forward-port the existing
meminfo collector for those platforms to use the updated format.

Signed-off-by: TJ Hoplock <t.hoplock@gmail.com>
@SuperQ
Copy link
Member

SuperQ commented Jun 9, 2024

I'd really rather avoid manually/explicitly parsing out the struct fields as it feels fragile and prone to breakage on procfs updates, so ideas welcome.

This is actually quite intentional, and the recommended way to do things in Go. Struct breakage is explicit at compile time, so it's quite stable.

Dynamic mapping, while common and convenient for the developer, is fragile. I much prefer explicit struct-to-metric mapping like is done in other collectors. For example, take a look at the xfrm collector. It appears verbose, but it's explicit and compile-time safe.

I don't see a major need to create a new collector. Just convert the existing dynamic mapping to an explicit mapping.

@discordianfish
Copy link
Member

I don't see a major need to create a new collector. Just convert the existing dynamic mapping to an explicit mapping.

Depends on whether we want to fix/change the metric names

@tjhop
Copy link
Contributor Author

tjhop commented Jun 10, 2024

@SuperQ I've grown to agree with you since my last comment, re: explicit struct mapping and have taken that approach in the PR.

I'm happy to re-scope #3043 to just refactoring the existing meminfo collector while we further discuss whether or not to refactor the memory metrics and how to label them 👍

@SuperQ
Copy link
Member

SuperQ commented Jun 11, 2024

Yea, let's just do the minimal migration and do any metric renaming as a separate task. Thanks!

tjhop added a commit to tjhop/node_exporter that referenced this issue Jun 12, 2024
Part of prometheus#2957

Prometheus' procfs lib supports collecting memory info and we're using a
new enough version of the lib that has it available, so this converts
the meminfo collector for Linux to use data from procfs lib instead. The
bits I've touched for darwin/openbsd/netbsd are with intent to preserve
the original struct implementation/backwards compatibility.

Signed-off-by: TJ Hoplock <t.hoplock@gmail.com>
tjhop added a commit to tjhop/node_exporter that referenced this issue Jun 14, 2024
Part of prometheus#2957

Prometheus' procfs lib supports collecting memory info and we're using a
new enough version of the lib that has it available, so this converts
the meminfo collector for Linux to use data from procfs lib instead. The
bits I've touched for darwin/openbsd/netbsd are with intent to preserve
the original struct implementation/backwards compatibility.

Signed-off-by: TJ Hoplock <t.hoplock@gmail.com>
SuperQ pushed a commit that referenced this issue Jul 14, 2024
* ref!: convert linux meminfo implementation to use procfs lib

Part of #2957

Prometheus' procfs lib supports collecting memory info and we're using a
new enough version of the lib that has it available, so this converts
the meminfo collector for Linux to use data from procfs lib instead. The
bits I've touched for darwin/openbsd/netbsd are with intent to preserve
the original struct implementation/backwards compatibility.

Signed-off-by: TJ Hoplock <t.hoplock@gmail.com>

* fix: meminfo debug log unsupported value

Fixes:

```
ts=2024-06-11T19:04:55.591Z caller=meminfo.go:44 level=debug collector=meminfo msg="Set node_mem" memInfo="unsupported value type"
```

Signed-off-by: TJ Hoplock <t.hoplock@gmail.com>

* fix: don't coerce nil Meminfo entries to 0, leave out if nil

Nil entries in procfs.Meminfo fields indicate that the value isn't
present on the system. Coercing those nil values to `0` introduces new
metrics on systems that should not be present and can break some
queries.

Addresses PR feedback:
#3049 (comment)
#3049 (comment)

Signed-off-by: TJ Hoplock <t.hoplock@gmail.com>

---------

Signed-off-by: TJ Hoplock <t.hoplock@gmail.com>
@tjhop
Copy link
Contributor Author

tjhop commented Aug 2, 2024

Circling back to this -- the node exporter has been updated to use procfs lib for the meminfo collector, so I believe the core of this issue is complete.

Are we ok with opening a new issue if/when it's time to discuss renaming the metrics?

@tjhop tjhop closed this as completed Aug 15, 2024
v-zhuravlev pushed a commit to grafana/node_exporter that referenced this issue Nov 1, 2024
…eus#3049)

* ref!: convert linux meminfo implementation to use procfs lib

Part of prometheus#2957

Prometheus' procfs lib supports collecting memory info and we're using a
new enough version of the lib that has it available, so this converts
the meminfo collector for Linux to use data from procfs lib instead. The
bits I've touched for darwin/openbsd/netbsd are with intent to preserve
the original struct implementation/backwards compatibility.

Signed-off-by: TJ Hoplock <t.hoplock@gmail.com>

* fix: meminfo debug log unsupported value

Fixes:

```
ts=2024-06-11T19:04:55.591Z caller=meminfo.go:44 level=debug collector=meminfo msg="Set node_mem" memInfo="unsupported value type"
```

Signed-off-by: TJ Hoplock <t.hoplock@gmail.com>

* fix: don't coerce nil Meminfo entries to 0, leave out if nil

Nil entries in procfs.Meminfo fields indicate that the value isn't
present on the system. Coercing those nil values to `0` introduces new
metrics on systems that should not be present and can break some
queries.

Addresses PR feedback:
prometheus#3049 (comment)
prometheus#3049 (comment)

Signed-off-by: TJ Hoplock <t.hoplock@gmail.com>

---------

Signed-off-by: TJ Hoplock <t.hoplock@gmail.com>
Signed-off-by: Vitaly Zhuravlev <v-zhuravlev@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants