-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add web UI to display the state of Thanos bucket #1248
Conversation
🎉 Awesome! I will review tomorrow (: |
Just a heads up, the program crashed a few times for me and I'm fixing the error handling right now to make it a bit more robust. Will update the PR later today. |
Heads up! Google is being sort of evil and won't allow the JS to be vendored locally. I didn't see that coming :( How serious of an issue is it for Thanos? I personally would hate it to be not able to run this without internet or relying on big G. I might have to rewrite the UI now, but thankfully its not much code. /shrug https://developers.google.com/chart/interactive/faq?csw=1#offline |
Wow, that is quite unexpected! Tracking everywhere ;p I think this is totally fine for first iteration. You need internet usually to access object storage as well, right? Let's stick to this and add GH issue improving this later. What do you think? @jaseemabid ? |
FWIW I'm in the same boat, self-hosting Thanos is important together with not requiring internet access. |
Works for now @bwplotka, but I'd probably rewrite it soon. Could you review the backend code for now? We could get some feedback from users (give it a try in one of your envs?) and use all that to iterate on a v2. I know that we might need some kind of filtering on label, maybe zoom into specific timeranges and better toooltips for example. |
I've had this version run on prod for a few days now, I'll report tomorrow if it had any errors or crashes and the resource usage. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! It looks great, but.. it does not work for me. ): Plus some minor suggestions.
Am i doing something wrong? Did not look into details yet. Nothing in dev browser console. I got nice syncing meta
message and then Cannot read property 'length' of null
):
Logs:
level=info ts=2019-06-18T14:34:17.31119598Z caller=main.go:149 msg="Tracing will be disabled"
level=info ts=2019-06-18T14:34:17.311764023Z caller=factory.go:39 msg="loading bucket configuration"
level=info ts=2019-06-18T14:34:17.311845218Z caller=bucket.go:334 msg="Listening for query and metrics" address=0.0.0.0:8080
level=info ts=2019-06-18T14:34:17.312509383Z caller=bucket.go:393 msg="synchronizing block metadata"
level=info ts=2019-06-18T14:35:21.963786639Z caller=bucket.go:415 msg="downloaded block metadata"
Also, failing unit test indicates you need to invoke make docs
(: https://circleci.com/gh/improbable-eng/thanos/3882?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link.
I ran it as:
export GOOGLE_APPLICATION_CREDENTIALS="my-service-account"
go run $(go list ./cmd/thanos/...) bucket web --objstore.config="type: GCS
config:
bucket: thanos-stable"
Did not reviewed the HTML, JS code yet, but looks fine from first glance.
cmd/thanos/bucket.go
Outdated
"github.com/go-kit/kit/log" | ||
opentracing "github.com/opentracing/opentracing-go" | ||
"github.com/prometheus/common/route" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our Go linter is crippled so you need to remove newlines and do make format
(or goimports
) again.. ):
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can manually fix this now, but make format
isn't fixing this for me and with the latest version of goimports, this is how it formats bu default. Do you have a local config override?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, yes it's manual work for now until #984 is fixed.
cmd/thanos/bucket.go
Outdated
defer iterCancel() | ||
|
||
// If there are no errors and the data isn't stale, wait a while before updating again. | ||
if bucketUI.Err == nil && time.Since(bucketUI.RefreshedAt) < (5*time.Minute) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it mean that delay cannot be lower than 5 minutes? Worth to add to flag description and maybe add some validation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was trying to keep this loop really simple and not make it download a lot of data from s3. Since compact activity is a fairly slow operation, I guess data that's stale by a few minutes is not a big issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed this check and added a simple warning if the user updates at a frequency less than 5m.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This minimum value makes sense, but yes we can check it early on instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cmd/thanos/bucket.go
Outdated
delay := time.Duration(interval) * time.Minute | ||
|
||
return runutil.Repeat(delay, ctx.Done(), func() error { | ||
iterCtx, iterCancel := context.WithTimeout(ctx, 5*time.Minute) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we might need to configure timeout at some point as well? Also it's tricky as timeout will mean that we will wait until ... next interval which can be in X minutes? We could simplify flow here a bit I think. I would retry until success OR timeout with hard failure on server. Continuing and waiting for next interval tick is like retrying with backoff to me (:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not addressed (:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bwplotka I'vent seen that UI error before and I wonder if its because of some difference in the way data is returned from GCS vs s3? |
There is none as we have everything in our How can I give you more data for this error? |
It is some JS error, so wonder if it's not some JS versions mismatch? Or browser? I use chrome on linux. |
This is roughly how the generated JS block looks like for me;
There is no length function in any of the new code we added, so it might be somewhere in the library and its hard to debug without a line number. Could you privately share the html file with me? I'd debug and throw it away. |
Nice, will give it a go |
For small environments with a unique label to identify each Prometheus replica, set the title with a `--label`. If multiple labels are required to duplicate shards, use the default to get an external legend. See screenshots in PR thanos-io#1248 for details.
Quick update on this! @bwplotka mentioned that using a single label to annotate the timelines won't work for everyone so now we have an option to customize it! ⚙️ From 0057ae6,
If you have a unique label (like I do), pass If you need multiple labels, you can skip the Admittedly this is only a first draft and in @bwplotka's words, a shows some very ametuer web design skills! But, almost all of our thanos instability issues were due to huge compact backlog and this is the graph that helped us track missing blocks and make it work reliably again. It's not pretty, but this UI might help you debug some nasty issues. I'd love to get this out to a bunch of users and get some feedback before reiterating again. Anyone with interest and skills in modern frontend programming could rewrite this in one afternoon and I'd be very happy when that happens. Till then hopefully this will be of some use to a bunch of people 🤞 |
For small environments with a unique label to identify each Prometheus replica, set the title with a `--label`. If multiple labels are required to duplicate shards, use the default to get an external legend. See screenshots in PR thanos-io#1248 for details.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎉 it works! Thanks!
LGTM, just some minor suggestions, before we merge. Let me know what you think (:
Also can we add CHANGELOG entry?
cmd/thanos/bucket.go
Outdated
delay := time.Duration(interval) * time.Minute | ||
|
||
return runutil.Repeat(delay, ctx.Done(), func() error { | ||
iterCtx, iterCancel := context.WithTimeout(ctx, 5*time.Minute) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not addressed (:
pkg/ui/templates/bucket.html
Outdated
|
||
<script src="{{ pathPrefix }}/static/vendor/mustache/mustache.min.js?v={{ buildVersion }}"></script> | ||
<script src="{{ pathPrefix }}/static/vendor/js/jquery.selection.js?v={{ buildVersion }}"></script> | ||
<!-- <script src="{{ pathPrefix }}/static/vendor/js/jquery.hotkeys.js?v={{ buildVersion }}"></script> --> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this commented thing needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed bebabfa
Exposes a web interface for the state of Thanos buckets just like `pprof web`. This is an early preview, the code is rough on the edges and could use some review. See the screenshots for some examples. * What works? * - Can display the timeline of blocks per shard - Can print error messages if any - Reloads data from remote storage every 30m - Reloads browser slightly smarter than that * TODOS * - Is any of the run.Group, dummy actor etc needed? - Some download code is duplicated from `ls -o json`; refactor as needed - Some web code is duplicated from query UI; refactor as needed - Do we need any of the route prefix? * Notes for the reviewer * - goimports is slightly off on master, the changes in kingpin is not intentional - Multiple `client.NewBucket` will panic; the downloader avoids that. - `downloader` can be rewritten if required - Asset reloading doesn't work with `DEBUG=1`, I had to add an extra `-debug`
1. Groups seems like a much better pattern than spawning adhoc goroutines; in this case its desirable that a group of threads shutdown cleanly if any fail. 2. runutil has some nice utilities, better than sleep loops 3. Closures in Go is painful to read, create a client at the top level and pass it as argument instead. 4. Log and retry download errors than propagate it all the way up.
This needed some minor tweaks to work on mac, the sed command ./scripts/genflagdocs.sh only works with gnu sed instead of the BSD version.
This is unfortunate since ideally the loop should handle errors that can be retried and bubble up unexpected errors. The bucket iterator fails with a remote read error while a new block is being uploaded and over the last couple of days, this program crashed multiple times with the same error. See logs for example. Ideally this patch should solve it. ``` level=error ts=2019-06-19T09:22:22.810872915Z caller=bucket.go:411 err="meta.json bkt get for 01DDQ6D62GFFHQKYCJ7TCHP307: The specified key does not exist." msg="Failed to downloaded block metadata" level=error ts=2019-06-19T09:22:22.81573321Z caller=runutil.go:88 msg="function failed. Retrying in next tick" err="meta.json bkt get for 01DDQ6D62GFFHQKYCJ7TCHP307: The specified key does not exist." level=error ts=2019-06-19T09:23:09.970878178Z caller=main.go:194 msg="running command failed" err="meta.json bkt get for 01DDQ6D62GFFHQKYCJ7TCHP307: The specified key does not exist." ```
For small environments with a unique label to identify each Prometheus replica, set the title with a `--label`. If multiple labels are required to duplicate shards, use the default to get an external legend. See screenshots in PR thanos-io#1248 for details.
Download errors are ignored for now, but context cancellations will be propagated up and the process would die. I wish I had simple supervision trees here.
Every `duration` (defaults to 30m), we try to download the blocks within `timeout` (defaults to 5m). This avoids the complicated interaction between multiple timers - the previous version reduced the 1 minute between `Repeat` attempts from the timeout and that wasn't very smart.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bwplotka I guess this is good to go! This is probably the longest time or number of iterations I've ever had to spend on a PR! 😂You were super helpful! Thank you so much :D |
<div class="container-fluid" id="Compactions" style="height: 100%; width: 100%;"></div> | ||
|
||
<!-- | ||
TODO(jaseemabid): The legend is only a small temporary workaround till we have better ways of showing the labels. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perfect 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot @jaseemabid this is amazing contribution and I am pretty sure everyone loves that (: I mean I know that at least 3 people already use it although it's still in PR ;p
This is probably the longest time or iterations I've ever had to spend on a PR! joyYou were super helpful! Thank you so much :D
You can blame me, I am too picky ): and we can blame that weird JS library as well (: so all good 👍
Amazing work 💪 , merging on green.
I'm equally picky with PRs and I'm honestly happy that you did what you did. Cheers! 🍺 |
Exposes a web interface for the state of Thanos buckets just like
pprof web
.This is an early preview, the code is rough on the edges and could use some
review. See the screenshots for some examples.
What works?
TODOS
ls -o json
; refactor as neededNotes for the reviewer
client.NewBucket
will panic; the downloader avoids that.downloader
can be rewritten if requiredDEBUG=1
, I had to add an extra-debug
Default view:
Loading indication:
Other errors: