Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Grafana dashboard #45

Open
knweiss opened this issue May 11, 2017 · 15 comments
Open

Grafana dashboard #45

knweiss opened this issue May 11, 2017 · 15 comments

Comments

@knweiss
Copy link

knweiss commented May 11, 2017

Is anyone already working on a (public) Lustre dashboard for Grafana using
Prometheus as a data source?

My search in the Grafana dashboard community didn't turn up with a match.

@joehandzik
Copy link
Contributor

There aren't any public dashboards at the moment, no. We're still trying to decide what our strategy is with that particular distribution option (we have some dashboards internally, but they are earmarked for a separate project that we can't open source yet). If you happen to work on something though, we'd be happy to update our README to point to it!

@mjtrangoni
Copy link
Contributor

Hi @joehandzik,

I found this Images upstream, but without a Dashboards source.

I have to check if those metrics are already exported/supported or not.

@joehandzik
Copy link
Contributor

Thanks for taking a look, @mjtrangoni! At a glance, it looks like we support most or all of those metrics, so t should be doable to emulate using our Prometheus metrics. Feel free to file an issue if we're missing anything critical.

@diegolmoreno
Copy link

Hello, is this question still opened? I have developed some Grafana dashboards that I could share with the project if needed.

@jcftang
Copy link

jcftang commented Nov 22, 2018

Maybe post a link to it from the grafana registry?

@mtds
Copy link

mtds commented Nov 23, 2018

I have made something (screenshot included) but I am not sure it's the right way to visualize Lustre metrics, so I never published it on the Grafana web site:

lustre_grafana

Any thoughts?

@jcftang
Copy link

jcftang commented Nov 25, 2018

LGTM, I have a similar dashboard at my site with very very similar graphs

@diegolmoreno
Copy link

In my case I have 4 different panels but they combine also other non-Lustre exporters for server load or disk utilization:

  • Lustre overview with:
  • Number of active jobs
  • Number of exports connected to MDS
  • MD operations per filesystem, MDS & CPU
  • Data bandwidth
  • Data IOPS
  • Usage and capacity
  • Top metadata and data jobstats
  • Lustre advanced adding to the latter:
  • Disk IO busy percentage
  • LNET rate statistics
  • DIsk IO size ratio
  • General Lustre jobstats: Just a subset of the first panel

  • Lustre jobstats by jobid: Details per server or Lustre target for specific jobIDs.

I'll probably need to triage the non-Lustre exporter stats from the panels (disk utilization, CPU) or we just assume that someone running the Lustre exporter might also be running this exporter as well.

screen shot 2018-11-21 at 13 41 36
screen shot 2018-11-21 at 13 42 23
screen shot 2018-11-21 at 13 43 26

@mtds
Copy link

mtds commented Nov 26, 2018

@diegolmoreno just a couple of questions:

  1. which version of Lustre are you using? We are quite behind the development, still on 2.5.3
    (planned to switch on 2.10 soon, though).
  2. I have seen you are using LSF for job scheduling. Are those read/write metrics extracted
    from the Lustre exporter itself or you did an additional integration with other exporters?

The Lustre dashboard I have developed is indeed not particular useful to my colleagues
and the reason is simple: it's complicated to link the job activities (we're using Slurm)
with Lustre itself and in particular to I/O load on specific OSSs.

There are ton of metrics exported but aggregation proved to be quite difficult and
getting insights about the state of the entire file system is devilish complicated.

If anybody is willing to add comments/insights/prometheus queries on this thread I would
be grateful :-)

@diegolmoreno
Copy link

@mtds

  1. This is with Lustre 2.7 and I have another 2.10 filesystem where I'm planning to deploy this setup by the end of this week. If I see any issues I will report them here.

  2. No extras to get LSF integrated with this Lustre exporter and Grafana. Just enabling LSF jobstats as per the Lustre Operations Manual description. I used SLURM in the past and I did the same as for LSF.

Jobstats and Grafana has been without any doubt the most painful thing on this setup. The reason is that having sometimes up to 4k jobs running on the cluster the amount of stats generated on the jobstats side is massive. For that reason you cannot be very greedy when trying to integrate jobstats in Grafana. You need to find a good compromise between resolution and information or your query will time out. We created a specific jobstas panel and from there we can go to another panel for all specific details about a job in particular.

@joehandzik
Copy link
Contributor

Hey all,

Sorry for the radio silence across a lot of this repo, we've been heads down on some different work for a bit.

In general, I think given the variety of dashboards that folks are discussing here, I think we'd be more than happy to collect links to wherever others are uploading their dashboards (Grafana's website or even your own repos, similar to what @jcftang suggested). That way the dashboards can evolve separately from our codebase here, and it provides flexibility to link to dashboards that go beyond the Lustre export (like @diegolmoreno's).

Based on the response that @diegolmoreno just sent, some of the same issues we've run into have been experienced by others. You really need to balance needs vs wants, as was mentioned. We have waffled back and forth on if it was a design "mistake" to enable the exporter to generate that massive amount of data, but it seems folks are finding a way to use it beyond just us. It seems like Prometheus itself does a decent job of keeping up with the metrics that Lustre can spit out, IMO the problem is that Grafana chokes on large amounts of data. It's worth ongoing investigation for sure.

Appreciate the ongoing discussion here.

@mtds
Copy link

mtds commented Nov 26, 2018

As @joehandzik said, Grafana is not meant to present a dashboard with dozens of panels
and thousands of metrics. Also, I believe it's not good for a monitoring point of view: no human
could keep up to process a huge amount of information and the idea would be to easily spot
a trend or a potential issue.

I have also enabled Lustre job stats on our (Slurm) side but I found it difficult to integrate them
correctly into a dashboard. Mostly, the problem is that we are talking about 'transient' data:
the stat counters about "Job activity" are zeroed over an interval of one hour (this interval
is also configurable, of course).

On the Prometheus side, if you want to push these metrics you may fall on 'batch jobs' on a
Push gateway but still the problems remains: I cannot add a single time series per jobID, this
would kill any kind of Prometheus instance sooner or later.

I have used a method which is also proposed on the 'Lustre Statistics' wiki page and it
is described in these slides while the script itself is available here.

The output could be useful on the fly:

>>> lctl get_param *.*.job_stats | perl /usr/local/sbin/show_high_jobstats.pl -s 1000000000
[...]
Job 24141632 has done read operations with sum of 2,547,593,216 bytes.
Job 24141632 has done write operations with sum of 59,826,833,885 bytes.
Job 24381103 has done read operations with sum of 1,786,433,536 bytes.
Job 24381103 has done write operations with sum of 1,149,587,115 bytes.
Job 24381143 has done read operations with sum of 1,780,846,592 bytes.
Job 24381143 has done write operations with sum of 1,149,688,421 bytes.

Also, the type of operations executed by a certain job (above a certain threshold) can
be looked up:

>>> lctl get_param *.*.job_stats | perl show_high_jobstats.pl -o 100000

Job 24141632 has done 237650 write operations.
List of found job IDs: 24141632

but I still did not find an easy way to integrate this kind of information into a dashboard in a meaningful way.

Another issue is aggregation: how to aggregate info about all the OSSs for certain metrics?
Prometheus query can trip you up easily and I mostly found examples about CPU load,
which is fine but it's a very narrow case.

@diegolmoreno
Copy link

Hi all,

First of all sorry for the radio silence in the last couple of months. I had other stuff and some dashboards needed some refactoring to be shareable. I've finally uploaded in the Grafana repository our 4 dashboards that use the prometheus data exported by the lustre exporter here but also by node exporter (CPU, load, memory, etc...). Just be aware that there're lots and lots of information, some people will find all they need in just 1 or 2 dashboards, others might end up removing some panels, it's up to you. The dashboards:

  • https://grafana.com/dashboards/9658: Lustre overview. Information usable for 90% of the admins. It also has some jobstats information which, as we discussed, is tricky. The time shift between measures is quite high (up to 1 hour) but we sometimes have more than 9k jobs running at the same time...

  • https://grafana.com/dashboards/9670: It's just a subset of the Lustre jobstats in dashboard 9658. Made in case it's preferable to only work with jobstats on one specific dashboard and to isolate specific jobs that need further analysis. It has a link that, with the proper ID job filled in the JOBID box, goes to the next dashboard with the details on one specific job.

  • https://grafana.com/dashboards/9671: Dashboard with details on one specific jobid (either coming from the previous dashboard or just as part of the browser address with the proper variable set to the jobid under review)

  • https://grafana.com/dashboards/9666: Lots of information for advanced debugging on what's going on when there are issues on the servers or the filesystem. At some point it's an extension of the very first dashboard.

We hope this helps the Lustre community as much as this exporter did to us. Comments or any bug reporting are welcome.

@sjpb
Copy link

sjpb commented Feb 2, 2021

@diegolmoreno thanks for the dashboards. I had a flick through the dashboard definitions and I couldn't see anything which used the metrics exported when using the client flag? Do the dashboards only show metrics from the ost/mdt/mgs/mds side of things?

@diegolmoreno
Copy link

diegolmoreno commented Feb 2, 2021 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants