Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Getting messages "was collected before with the same name and label values" #135

Open
Zubrania opened this issue Apr 13, 2018 · 16 comments
Open

Comments

@Zubrania
Copy link

Hello

Getting the messages like "was collected before with the same name and label values" in /var/log/messages and the file is constantly growing

[root]# lustre_exporter --version
lustre_exporter, version 2.0.0 (branch: HEAD, revision: 6177537)
build user: prometheus@rc-lustre-oss-2.dev.net
build date: 20171205-22:27:17
go version: go1.8.3

@joehandzik
Copy link
Contributor

@Zubrania Do you have any information about your cluster configuration? Is this happening on all nodes, or just some of them? Also, what version of Lustre are you using? Older versions are pretty tied to the 1.0.0 release that we have.

@Zubrania
Copy link
Author

@joehandzik The cluster consists of of 2 mgs/mds servers and 4 oss servers
Lustre version is 2.10.3 RHEL 7.4 based

@erijpkema
Copy link

I'm also seeing this with lustre_version 2.10.4 RHEL 7.5.
We're running 3 filestystems with 2 mgs/mds and 4 oss servers each.
Example from an ost:

>  was collected before with the same name and label values
* collected metric lustre_job_stats_total label:<name:"component" value:"ost" > label:<name:"jobid" value:".0" > label:<name:"operation" value:"destroy" > label:<name:"target" value:"dh2-OST0017" > counter:<value:0 >  was collected before with the same name and label values
* collected metric lustre_job_stats_total label:<name:"component" value:"ost" > label:<name:"jobid" value:".0" > label:<name:"operation" value:"create" > label:<name:"target" value:"dh2-OST0017" > counter:<value:0 >  was collected before with the same name and label values
* collected metric lustre_job_stats_total label:<name:"component" value:"ost" > label:<name:"jobid" value:".0" > label:<name:"operation" value:"get_info" > label:<name:"target" value:"dh2-OST0017" > counter:<value:0 >  was collected before with the same name and label values
* collected metric lustre_job_stats_total label:<name:"component" value:"ost" > label:<name:"jobid" value:".0" > label:<name:"operation" value:"set_info" > label:<name:"target" value:"dh2-OST0017" > counter:<value:0 >  was collected before with the same name and label values
* collected metric lustre_job_stats_total label:<name:"component" value:"ost" > label:<name:"jobid" value:".0" > label:<name:"operation" value:"quotactl" > label:<name:"target" value:"dh2-OST0017" > counter:<value:0 >  was collected before with the same name and label values

And from an mdt.

0 >  was collected before with the same name and label values
* collected metric lustre_job_stats_total label:<name:"component" value:"mdt" > label:<name:"jobid" value:".0" > label:<name:"operation" value:"getxattr" > label:<name:"target" value:"dh1-MDT0000" > counter:<value:0 >  was collected before with the same name and label values
* collected metric lustre_job_stats_total label:<name:"component" value:"mdt" > label:<name:"jobid" value:".0" > label:<name:"operation" value:"setxattr" > label:<name:"target" value:"dh1-MDT0000" > counter:<value:0 >  was collected before with the same name and label values
* collected metric lustre_job_stats_total label:<name:"component" value:"mdt" > label:<name:"jobid" value:".0" > label:<name:"operation" value:"statfs" > label:<name:"target" value:"dh1-MDT0000" > counter:<value:0 >  was collected before with the same name and label values
* collected metric lustre_job_stats_total label:<name:"component" value:"mdt" > label:<name:"jobid" value:".0" > label:<name:"operation" value:"sync" > label:<name:"target" value:"dh1-MDT0000" > counter:<value:0 >  was collected before with the same name and label values
* collected metric lustre_job_stats_total label:<name:"component" value:"mdt" > label:<name:"jobid" value:".0" > label:<name:"operation" value:"samedir_rename" > label:<name:"target" value:"dh1-MDT0000" > counter:<value:0 >  was collected before with the same name and label values
* collected metric lustre_job_stats_total label:<name:"component" value:"mdt" > label:<name:"jobid" value:".0" > label:<name:"operation" value:"crossdir_rename" > label:<name:"target" value:"dh1-MDT0000" > counter:<value:0 >  was collected before with the same name and label values

I'm running the exporter with the following command.
/usr/local/prometheus/lustre_exporter --collector.ost=core --collector.mdt=core --collector.mgs=extended --collector.generic=core

It was built today from git source today in the golang:1.9-stretch docker image. (Docker was only used for building)

@wutz
Copy link
Contributor

wutz commented Oct 31, 2018

I have noticed that the metric label is jobid=.0 which cause report error, the process name has missing.

@ldd91
Copy link

ldd91 commented Mar 21, 2019

I'm also seeing this with lustre_version 2.12.0 RHEL 7.5.But it was seen in MDS node,the OSS node is normal

@ldd91
Copy link

ldd91 commented Mar 21, 2019

@wutzx Have you solved this problem?

@wutz
Copy link
Contributor

wutz commented Mar 21, 2019

@ldd91 Your clone https://github.com/wutzx/lustre_exporter with my PR, and build it.

@ldd91
Copy link

ldd91 commented Mar 21, 2019

@wutzx I clone your PR,and build it ,but met an error
[root@k8sv2node1 lustre_exporter]# make

formatting code
linting code
WARNING: Linters are now vendored by default, --update ignored. The original
behaviour can be re-enabled with --no-vendored-linters.

To request an update for a vendored linter file an issue at:
https://github.com/alecthomas/gometalinter/issues/new

WARNING: deadline exceeded by linter vetshadow (try increasing --deadline)
WARNING: deadline exceeded by linter varcheck (try increasing --deadline)
WARNING: deadline exceeded by linter interfacer (try increasing --deadline)
make: *** [gometalinter] Error 2
[root@k8sv2node1 lustre_exporter]# ll
total 388
-rw-r--r-- 1 root root 18526 Mar 21 17:51 CHANGELOG.md
-rw-r--r-- 1 root root 2428 Mar 21 17:51 Gopkg.lock
-rw-r--r-- 1 root root 731 Mar 21 17:51 Gopkg.toml
-rw-r--r-- 1 root root 11357 Mar 21 17:51 LICENSE
-rw-r--r-- 1 root root 6488 Mar 21 17:51 lustre_exporter.go
-rw-r--r-- 1 root root 312818 Mar 21 17:51 lustre_exporter_test.go
-rw-r--r-- 1 root root 2051 Mar 21 17:51 Makefile
drwxr-xr-x 4 root root 4096 Mar 21 17:51 proc
-rw-r--r-- 1 root root 2896 Mar 21 17:51 README.md
drwxr-xr-x 2 root root 4096 Mar 21 17:51 sources
drwxr-xr-x 3 root root 4096 Mar 21 17:51 sys
drwxr-xr-x 2 root root 4096 Mar 21 17:51 systemd
drwxr-xr-x 5 root root 4096 Mar 21 17:51 vendor
-rw-r--r-- 1 root root 6 Mar 21 17:51 VERSION

@wutz
Copy link
Contributor

wutz commented Mar 21, 2019 via email

@ldd91
Copy link

ldd91 commented Mar 21, 2019

@wutzx Thank you,i tried and it works

@lszentannai
Copy link

Hi,

I still have this problem using commit 6177537, running 2.10.5 on CentOS 7.5.
No matter if I use procname_uid or SLURM_JOB_ID.

Is there any fix for this problem?

Thanks,
Lorand Szentannai

@wutz
Copy link
Contributor

wutz commented Apr 30, 2019

@lszentannai You can try my fork https://github.com/wutz/lustre_exporter

@lszentannai
Copy link

@wutz thanks for quick reply. I did try your fork too, with the same result.

@wutz
Copy link
Contributor

wutz commented Apr 30, 2019

You can execute grep job_id /proc/fs/lustre/obdfilter/*/job_stats to get all job id information, and check whether match regexp:

https://github.com/HewlettPackard/lustre_exporter/pull/137/files#diff-fde95e813ded08bf1be0acad8e83c4cfR665

@lszentannai
Copy link

lszentannai commented Apr 30, 2019

it looks like it's not matching the last jobid
changing the regex to (?ms:job_id:.*?(-|\\z|$)) does, but won't help.

I get the same messages again, like:

Apr 30 13:01:46 oss-1 lustre_exporter[23544]: ter:<value:0 > was collected before with the same name and label values\n* collected metric lustre_job_stats_total label:<name:"component" value:"ost" > label:<name:"jobid" value:"293395" > label:<name:"operation" value:"punch" > label:<name:"target" value:"scratch-OST0008" > counter:<value:0 > was collected before with the same name and label values\n* collected metric lustre_job_stats_total label:<name:"component" value:"ost" > label:<name:"jobid" value:"293395" > label:<name:"operation" value:"destroy" > label:<name:"target" value:"scratch-OST0008" > counter:<value:0 > was collected before with the same name and label values\n* collected metric lustre_job_stats_total label:<name:"component" value:"ost" > label:<name:"jobid" value:"293395" > label:<name:"operation" value:"create" > label:<name:"target" value:"scratch-OST0008" > counter:<value:0 > was collected before with the same name and label values\n* collected metric lustre_job_stats_total label:<name:"component" value:"ost" > label:<name:"jobid" value:"293395" > label:<name:"operation" value:"get_info" > label:<name:"target" value:"scratch-OST0008" > counter:<value:0 > was collected before with the same name and label values\n* collected metric lustre_job_stats_total label:<name:"component" value:"ost" > label:<name:"jobid" value:"293395" > label:<name:"operation" value:"set_info" > label:<name:"target" value:"scratch-OST0008" > counter:<value:0 > was collected before with the same name and label values\n* collected metric lustre_job_stats_total label:<name:"component" value:"ost" > label:<name:"jobid" value:"293395" > label:<name:"operation" value:"quotactl" > label:<name:"target" value:"scratch-OST0008" > counter:<value:0 > was collected before with the same name and label values\n* collected metric lustre_job_stats_total label:<name:"component" value:"ost" > label:<name:"jobid" value:"289305" > label:<name:"operation" value:"getattr" > label:<name:"target" value:"scratch-OST0008" > counter:<value:0 > was collected before with the same name and label values\n* collected metric lustre_job_stats_total label:<name:"component" value:"ost" > label:<name:"jobid" value:"289305" > label:<name:"operation" value:"setattr" > label:<name:"target" value:"scratch-OST0008" > counter:<value:0 > was collected before with the same name and label values\n* collected metric lustre_job_stats_total label:<name:"component" value:"ost" > label:<name:"jobid" value:"289305" > label:<name:"operation" value:"statfs" > label:<name:"target" value:"scratch-OST0008" > counter:<value:0 > was collected before with the same name and label values\n* collected metric lustre_job_stats_total label:<name:"component" value:"ost" > label:<name:"jobid" value:"289305" > label:<name:"operation" value:"sync" > label:<name:"target" value:"scratch-OST0008" > counter:<value:0 > was collected before with the same name and label values\n* collected metric lustre_job_stats_total label:<name:"component" value:"ost" > label:<name:"jobid" value:"289305" > label:<name:"operation" value:"punch" > label:<name:"target" value:"scratch-OST0008" > counter:<value:0 > was collected before with the same name and label values\n* collected metric lustre_job_stats_total label:<name:"component" value:"ost" > label:<name:"jobid" value:"289305" > label:<name:"operation" value:"destroy" > label:<name:"target" value:"scratch-OST0008" > counter:<value:0 > was collected before with the same name and label values\n* collected metric lustre_job_stats_total label:<name:"component" value:"ost" > label:<name:"jobid" value:"289305" > label:<name:"operation" value:"create" > label:<name:"target" value:"scratch-OST0008" > counter:<value:0 > was collected before with the same name and label values\n* collected metric lustre_job_stats_total label:<name:"component" value:"ost" > label:<name:"jobid" value:"289305" > label:<name:"operation" value:"get_info" > label:<name:"target" value:"scratch-OST0008" > counter:<value:0 > was collected before with the same name and label values\n* collected metric lustre_job_stats_total label:<name:"component" value:"ost" > label:<name:"jobid" value:"289305" > label:<name:"operation" value:"set_info" > label:<name:"target" value:"scratch-OST0008" > counter:<value:0 > was collected before with the same name and label values\n* collected metric lustre_job_stats_total label:<name:"component" value:"ost" > label:<name:"jobid" value:"289305" > label:<name:"operation" value:"quotactl" > label:<name:"target" value:"scratch-OST0008" > counter:<value:0 > was collected before with the same name and label values\n* collected metric lustre_job_stats_total label:<name:"component" value:"ost" > label:<name:"jobid" value:"290119" > label:<name:"operation" value:"getattr" > label:<name:"target" value:"scratch-OST0008" > counter:<value:0 > was collected before with the same name and label values\n* collected metric lustre_job_stats_total label:<name:"component" value:"ost" > label:<name:"jobid" value:"290119" > label:<name:"operation" value:"setattr" > label:<name:"target" value:"scratch-OST0008" > counter:<value:0 > was collected before with the same name and label values\n* collected metric lustre_job_stats_total label:<name:"component" value:"ost" > label:<name:"jobid" value:"290119" > label:<name:"operation" value:"statfs" > label:<name:"target" value:"scratch-OST0008" > counter:<value:0 > was collected before with the same name and label values\n* collected metric lustre_job_stats_total label:<name:"component" value:"ost" > label:<name:"jobid" value:"290119" > label:<name:"operation" value:"sync" > label:<name:"target" value:"scratch-OST0008" > counter:<value:0 > was collected before with the same name and label values\n* collected metric lustre_job_stats_total label:<name:"component" value:"ost" > label:<name:"jobid" value:"290119" > label:<name:"operation" value:"punch" > label:<name:"target" value:"scratch-OST0008" > counter:<value:0 > was collected before with the same name and label values\n* collected metric lustre_job_stats_total label:<name:"component" value:"ost" > label:<name:"jobid" value:"290119" > label:<name:"operation" value:"destroy" > label:<name:"target" value:"scratch-OST0008" > counter:<value:0 > was collected before with the same name and label values\n* collected metric lustre_job_stats_total label:<name:"component" value:"ost" > label:<name:"jobid" value:"290119" > label:<name:"operation" value:"create" > label:<name:"target" value:"scratch-OST0008" > counter:<value:0 > was collected before with the same name and label values\n* collected metric lustre_job_stats_total label:<name:"component" value:"ost" > label:<name:"jobid" value:"290119" > label:<name:"operation" value:"get_info" > label:<name:"target" value:"scratch-OST0008" > counter:<value:0 > was collected before with the same name and label values\n* collected metric lustre_job_stats_total label:<name:"component" value:"ost" > label:<name:"jobid" value:"290119" > label:<name:"operation" value:"set_info" > label:<name:"target" value:"scratch-OST0008" > counter:<value:0 > was collected before with the same name and label values\n* collected metric lustre_job_stats_total label:<name:"component" value:"ost" > label:<name:"jobid" value:"290119" > label:<name:"operation" value:"quotactl" > label:<name:"target" value:"scratch-OST0008" > counter:<value:0 > was collected before with the same name and label values\n* collected metric lustre_job_stats_total label:<name:"component" value:"ost" > label:<name:"jobid" value:"290074" > label:<name:"operation" value:"getattr" > label:<name:"target" value:"scratch-OST0008" > counter:<value:0 > was collected before with the same name

@gabrieleiannetti
Copy link

gabrieleiannetti commented Dec 10, 2021

Situation has been improved for jobstats that do not have any UID set...

...
lustre_job_read_samples_total{component="ost",jobid="loop4",target="hebe-OST0263"} 575
lustre_job_read_samples_total{component="ost",jobid="loop4..0",target="hebe-OST0263"} 1395
lustre_job_read_samples_total{component="ost",jobid="loop4.0",target="hebe-OST0263"} 2.989077e+06
lustre_job_read_samples_total{component="ost",jobid="loop4.00",target="hebe-OST0263"} 408
lustre_job_read_samples_total{component="ost",jobid="loop40",target="hebe-OST0263"} 546
lustre_job_read_samples_total{component="ost",jobid="loop40.",target="hebe-OST0263"} 136
lustre_job_read_samples_total{component="ost",jobid="loop40.0",target="hebe-OST0263"} 3.263271e+06
lustre_job_read_samples_total{component="ost",jobid="loop400",target="hebe-OST0263"} 157
lustre_job_read_samples_total{component="ost",jobid="loop4000",target="hebe-OST0263"} 15
...

GSI-HPC@6adb8e0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants