Grafana dashboard #45

knweiss · 2017-05-11T10:21:01Z

Is anyone already working on a (public) Lustre dashboard for Grafana using
Prometheus as a data source?

My search in the Grafana dashboard community didn't turn up with a match.

joehandzik · 2017-05-11T14:27:50Z

There aren't any public dashboards at the moment, no. We're still trying to decide what our strategy is with that particular distribution option (we have some dashboards internally, but they are earmarked for a separate project that we can't open source yet). If you happen to work on something though, we'd be happy to update our README to point to it!

mjtrangoni · 2017-05-12T18:51:38Z

Hi @joehandzik,

I found this Images upstream, but without a Dashboards source.

I have to check if those metrics are already exported/supported or not.

joehandzik · 2017-05-12T19:06:08Z

Thanks for taking a look, @mjtrangoni! At a glance, it looks like we support most or all of those metrics, so t should be doable to emulate using our Prometheus metrics. Feel free to file an issue if we're missing anything critical.

diegolmoreno · 2018-11-21T12:59:12Z

Hello, is this question still opened? I have developed some Grafana dashboards that I could share with the project if needed.

jcftang · 2018-11-22T06:57:59Z

Maybe post a link to it from the grafana registry?

mtds · 2018-11-23T15:57:28Z

I have made something (screenshot included) but I am not sure it's the right way to visualize Lustre metrics, so I never published it on the Grafana web site:

Any thoughts?

jcftang · 2018-11-25T14:52:38Z

LGTM, I have a similar dashboard at my site with very very similar graphs

diegolmoreno · 2018-11-26T08:26:10Z

In my case I have 4 different panels but they combine also other non-Lustre exporters for server load or disk utilization:

Lustre overview with:

Number of active jobs
Number of exports connected to MDS
MD operations per filesystem, MDS & CPU
Data bandwidth
Data IOPS
Usage and capacity
Top metadata and data jobstats

Lustre advanced adding to the latter:

Disk IO busy percentage
LNET rate statistics
DIsk IO size ratio

General Lustre jobstats: Just a subset of the first panel
Lustre jobstats by jobid: Details per server or Lustre target for specific jobIDs.

I'll probably need to triage the non-Lustre exporter stats from the panels (disk utilization, CPU) or we just assume that someone running the Lustre exporter might also be running this exporter as well.

mtds · 2018-11-26T14:40:34Z

@diegolmoreno just a couple of questions:

which version of Lustre are you using? We are quite behind the development, still on 2.5.3
(planned to switch on 2.10 soon, though).
I have seen you are using LSF for job scheduling. Are those read/write metrics extracted
from the Lustre exporter itself or you did an additional integration with other exporters?

The Lustre dashboard I have developed is indeed not particular useful to my colleagues
and the reason is simple: it's complicated to link the job activities (we're using Slurm)
with Lustre itself and in particular to I/O load on specific OSSs.

There are ton of metrics exported but aggregation proved to be quite difficult and
getting insights about the state of the entire file system is devilish complicated.

If anybody is willing to add comments/insights/prometheus queries on this thread I would
be grateful :-)

diegolmoreno · 2018-11-26T15:04:24Z

@mtds

This is with Lustre 2.7 and I have another 2.10 filesystem where I'm planning to deploy this setup by the end of this week. If I see any issues I will report them here.
No extras to get LSF integrated with this Lustre exporter and Grafana. Just enabling LSF jobstats as per the Lustre Operations Manual description. I used SLURM in the past and I did the same as for LSF.

Jobstats and Grafana has been without any doubt the most painful thing on this setup. The reason is that having sometimes up to 4k jobs running on the cluster the amount of stats generated on the jobstats side is massive. For that reason you cannot be very greedy when trying to integrate jobstats in Grafana. You need to find a good compromise between resolution and information or your query will time out. We created a specific jobstas panel and from there we can go to another panel for all specific details about a job in particular.

joehandzik · 2018-11-26T15:08:14Z

Hey all,

Sorry for the radio silence across a lot of this repo, we've been heads down on some different work for a bit.

In general, I think given the variety of dashboards that folks are discussing here, I think we'd be more than happy to collect links to wherever others are uploading their dashboards (Grafana's website or even your own repos, similar to what @jcftang suggested). That way the dashboards can evolve separately from our codebase here, and it provides flexibility to link to dashboards that go beyond the Lustre export (like @diegolmoreno's).

Based on the response that @diegolmoreno just sent, some of the same issues we've run into have been experienced by others. You really need to balance needs vs wants, as was mentioned. We have waffled back and forth on if it was a design "mistake" to enable the exporter to generate that massive amount of data, but it seems folks are finding a way to use it beyond just us. It seems like Prometheus itself does a decent job of keeping up with the metrics that Lustre can spit out, IMO the problem is that Grafana chokes on large amounts of data. It's worth ongoing investigation for sure.

Appreciate the ongoing discussion here.

mtds · 2018-11-26T16:22:34Z

As @joehandzik said, Grafana is not meant to present a dashboard with dozens of panels
and thousands of metrics. Also, I believe it's not good for a monitoring point of view: no human
could keep up to process a huge amount of information and the idea would be to easily spot
a trend or a potential issue.

I have also enabled Lustre job stats on our (Slurm) side but I found it difficult to integrate them
correctly into a dashboard. Mostly, the problem is that we are talking about 'transient' data:
the stat counters about "Job activity" are zeroed over an interval of one hour (this interval
is also configurable, of course).

On the Prometheus side, if you want to push these metrics you may fall on 'batch jobs' on a
Push gateway but still the problems remains: I cannot add a single time series per jobID, this
would kill any kind of Prometheus instance sooner or later.

I have used a method which is also proposed on the 'Lustre Statistics' wiki page and it
is described in these slides while the script itself is available here.

The output could be useful on the fly:

>>> lctl get_param *.*.job_stats | perl /usr/local/sbin/show_high_jobstats.pl -s 1000000000
[...]
Job 24141632 has done read operations with sum of 2,547,593,216 bytes.
Job 24141632 has done write operations with sum of 59,826,833,885 bytes.
Job 24381103 has done read operations with sum of 1,786,433,536 bytes.
Job 24381103 has done write operations with sum of 1,149,587,115 bytes.
Job 24381143 has done read operations with sum of 1,780,846,592 bytes.
Job 24381143 has done write operations with sum of 1,149,688,421 bytes.

Also, the type of operations executed by a certain job (above a certain threshold) can
be looked up:

>>> lctl get_param *.*.job_stats | perl show_high_jobstats.pl -o 100000

Job 24141632 has done 237650 write operations.
List of found job IDs: 24141632

but I still did not find an easy way to integrate this kind of information into a dashboard in a meaningful way.

Another issue is aggregation: how to aggregate info about all the OSSs for certain metrics?
Prometheus query can trip you up easily and I mostly found examples about CPU load,
which is fine but it's a very narrow case.

diegolmoreno · 2019-01-15T13:34:14Z

Hi all,

First of all sorry for the radio silence in the last couple of months. I had other stuff and some dashboards needed some refactoring to be shareable. I've finally uploaded in the Grafana repository our 4 dashboards that use the prometheus data exported by the lustre exporter here but also by node exporter (CPU, load, memory, etc...). Just be aware that there're lots and lots of information, some people will find all they need in just 1 or 2 dashboards, others might end up removing some panels, it's up to you. The dashboards:

https://grafana.com/dashboards/9658: Lustre overview. Information usable for 90% of the admins. It also has some jobstats information which, as we discussed, is tricky. The time shift between measures is quite high (up to 1 hour) but we sometimes have more than 9k jobs running at the same time...
https://grafana.com/dashboards/9670: It's just a subset of the Lustre jobstats in dashboard 9658. Made in case it's preferable to only work with jobstats on one specific dashboard and to isolate specific jobs that need further analysis. It has a link that, with the proper ID job filled in the JOBID box, goes to the next dashboard with the details on one specific job.
https://grafana.com/dashboards/9671: Dashboard with details on one specific jobid (either coming from the previous dashboard or just as part of the browser address with the proper variable set to the jobid under review)
https://grafana.com/dashboards/9666: Lots of information for advanced debugging on what's going on when there are issues on the servers or the filesystem. At some point it's an extension of the very first dashboard.

We hope this helps the Lustre community as much as this exporter did to us. Comments or any bug reporting are welcome.

sjpb · 2021-02-02T11:42:02Z

@diegolmoreno thanks for the dashboards. I had a flick through the dashboard definitions and I couldn't see anything which used the metrics exported when using the client flag? Do the dashboards only show metrics from the ost/mdt/mgs/mds side of things?

diegolmoreno · 2021-02-02T13:27:20Z

There're unfortunately no client stats dashboard since I only run the exporter on the servers and not on the thousands of clients I have though this could be a nice option in the future.

…

On Tue, 2 Feb 2021 at 12:42, Steve Brasier ***@***.***> wrote: @diegolmoreno <https://github.com/diegolmoreno> thanks for the dashboards. I had a flick through the dashboard definitions and I couldn't see anything which used the metrics exported when using the client flag? Do the dashboards only show metrics from the ost/mdt/mgs/mds side of things? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#45 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AFJ76FUF7JZBVG5VFD4POT3S47QJXANCNFSM4DK7GA6A> .

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Grafana dashboard #45

Grafana dashboard #45

knweiss commented May 11, 2017

joehandzik commented May 11, 2017

mjtrangoni commented May 12, 2017

joehandzik commented May 12, 2017

diegolmoreno commented Nov 21, 2018

jcftang commented Nov 22, 2018

mtds commented Nov 23, 2018

jcftang commented Nov 25, 2018

diegolmoreno commented Nov 26, 2018

mtds commented Nov 26, 2018

diegolmoreno commented Nov 26, 2018

joehandzik commented Nov 26, 2018

mtds commented Nov 26, 2018 •

edited

Loading

diegolmoreno commented Jan 15, 2019

sjpb commented Feb 2, 2021

diegolmoreno commented Feb 2, 2021 via email

Grafana dashboard #45

Grafana dashboard #45

Comments

knweiss commented May 11, 2017

joehandzik commented May 11, 2017

mjtrangoni commented May 12, 2017

joehandzik commented May 12, 2017

diegolmoreno commented Nov 21, 2018

jcftang commented Nov 22, 2018

mtds commented Nov 23, 2018

jcftang commented Nov 25, 2018

diegolmoreno commented Nov 26, 2018

mtds commented Nov 26, 2018

diegolmoreno commented Nov 26, 2018

joehandzik commented Nov 26, 2018

mtds commented Nov 26, 2018 • edited Loading

diegolmoreno commented Jan 15, 2019

sjpb commented Feb 2, 2021

diegolmoreno commented Feb 2, 2021 via email

mtds commented Nov 26, 2018 •

edited

Loading