-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Capture tool error stats in Grafana #178
Comments
We did this at one point. It was not very advanced, but I did not find it to be very actionable data unfortunately. Maybe for you all it would be nice to say "oh this destination is misbehaving" but since we only had 1-2 destinations it was not so interesting. Edit: I am not trying to dissuade you, just wanted to let you know our experience. I hope you find ways to make this actionable |
I thinking more about usage metrics, but if you pool enough usage, certainly server/cluster issues could be detected (Main's jobs go many queues...). Even something very simple would be useful but I'm a bit naive about what actual details could be parsed out. Something like this? Rather than waiting for bug reports or Qs.
Then see if we can find patterns about failure rate differences over time & get an early warning for tools that have away-from-historical-mean fail/success spikes, trigger an email once we know what the "normal" variances are. Goes a few steps beyond tool tests, are actual user usage metrics, big picture. Could indicate server problems but also tools that may need more love: tool form help, or defensive input "bad/incomplete" entry warnings, could use a tutorial/FAQ. I can think of lots of ways to use that data kind of data to make decisions. Totally open to other ideas. Goals are just to get some basic usage metrics. If a tool is failing too much, we should find out why and try to remedy that, or least be aware of it so can prioritize what to work on (including "soft" changes: form help/tips, faqs, tutorials, form element placement). Thoughts? |
I think those are all really good questions and actionable at that!
These are answerable questions though, I didn't have this clear of a picture in my head for the original issue. Additionally I did not find the time to ask all of these good questions when we originally did it + we did not have the volume of data needed to make it useful. |
Not super common but when it does happen, usually manifests as a flurry of bug reports then firefighting proceeds (sometime immediate, sometimes not). There has to be a better way of proactively tracking failures rather than relying on bug reports. Plus, bug numbers don't give the actual failure numbers -- only numbers about how many reported it and even that isn't really captured well anywhere (..could be another set of data points, parsed from submitted bugs). Talking with @natefoo it seems the first step is to get the data into Grafana, then we can test out ways to graph/interpret the data. So sort of not knowing exactly how we'll use it at first is Ok. |
yep, exactly what I'd suggest as well, it's easy enough to start collecting the data and can just figure it out later :) There are two options probably @natefoo, it looks like I already added an 'influx' type backend for error reporting, that's what we used when we last tried it. But I don't think that reports which queue a job went to, or writing some sql queries you could helpfully add to gxadmin and then run with telegraf ;)
This was exactly the reason I tried measuring once, without any specific agenda, just I was worried that I was missing a lot of failures due to people not reporting them as much as they should. |
Grafana has some status now (tools 100% failing, one-week "blocks" of time, only testing usegalaxy.org). The gxyadmin utility has more options but could be tuned. https://stats.galaxyproject.org/d/Q3_EmS_Wk/main-stats?orgId=1 We can work from that to get the rest. Maybe create a toy database that isn't private. Run some public data/workflows. Hope to produce failures. Customize on a smaller dataset. |
@natefoo Last time we talked the first step was to load the error data into Grafana so we could all test out different ways of graphing, setting alerts, etc. Any updates about when that could happen? Or is the data in there but I missed it?
Capturing this in a ticket so we don't lose track of status
The text was updated successfully, but these errors were encountered: