Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add web UI to display the state of Thanos bucket #1248

Merged
merged 16 commits into from
Jun 28, 2019
Merged

Conversation

jaseemabid
Copy link
Contributor

@jaseemabid jaseemabid commented Jun 12, 2019

Exposes a web interface for the state of Thanos buckets just like pprof web.
This is an early preview, the code is rough on the edges and could use some
review. See the screenshots for some examples.

What works?

  • Can display the timeline of blocks per shard
  • Can print error messages if any
  • Reloads data from remote storage every 30m
  • Reloads browser slightly smarter than that

TODOS

  • Some download code is duplicated from ls -o json; refactor as needed
  • Some web code is duplicated from query UI; refactor as needed
  • Do we need any of the route prefix?
  • Unfortunately learned a bit late that the JS libraries cannot be vendored. That sucks :(
  • The tooltip is pretty basic and can contain a lot more useful info.
  • Allow some kind of filtering and other bells and whistles to the UI, but this can be part of later PR.

Notes for the reviewer

  • goimports is slightly off on master, the changes in kingpin is not intentional
  • Multiple client.NewBucket will panic; the downloader avoids that.
  • downloader can be rewritten if required
  • Asset reloading doesn't work with DEBUG=1, I had to add an extra -debug

Default view:
Screenshot 2019-06-12 at 18 07 52

Loading indication:
Screenshot 2019-06-12 at 18 08 05

Other errors:
Screenshot 2019-06-12 at 18 19 08

@bwplotka
Copy link
Member

🎉 Awesome! I will review tomorrow (:

@jaseemabid
Copy link
Contributor Author

Just a heads up, the program crashed a few times for me and I'm fixing the error handling right now to make it a bit more robust. Will update the PR later today.

@jaseemabid
Copy link
Contributor Author

jaseemabid commented Jun 14, 2019

Heads up! Google is being sort of evil and won't allow the JS to be vendored locally. I didn't see that coming :(

How serious of an issue is it for Thanos? I personally would hate it to be not able to run this without internet or relying on big G. I might have to rewrite the UI now, but thankfully its not much code. /shrug

https://developers.google.com/chart/interactive/faq?csw=1#offline

image

@bwplotka
Copy link
Member

Wow, that is quite unexpected! Tracking everywhere ;p

I think this is totally fine for first iteration. You need internet usually to access object storage as well, right? Let's stick to this and add GH issue improving this later. What do you think? @jaseemabid ?

@filippog
Copy link
Contributor

How serious of an issue is it for Thanos? I personally would hate it to be not able to run this without internet or relying on big G. I might have to rewrite the UI now, but thankfully its not much code. /shrug

FWIW I'm in the same boat, self-hosting Thanos is important together with not requiring internet access.

@jaseemabid
Copy link
Contributor Author

Works for now @bwplotka, but I'd probably rewrite it soon. Could you review the backend code for now? We could get some feedback from users (give it a try in one of your envs?) and use all that to iterate on a v2. I know that we might need some kind of filtering on label, maybe zoom into specific timeranges and better toooltips for example.

@jaseemabid
Copy link
Contributor Author

I've had this version run on prod for a few days now, I'll report tomorrow if it had any errors or crashes and the resource usage.

Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! It looks great, but.. it does not work for me. ): Plus some minor suggestions.

Am i doing something wrong? Did not look into details yet. Nothing in dev browser console. I got nice syncing meta message and then Cannot read property 'length' of null ):

Logs:

level=info ts=2019-06-18T14:34:17.31119598Z caller=main.go:149 msg="Tracing will be disabled"
level=info ts=2019-06-18T14:34:17.311764023Z caller=factory.go:39 msg="loading bucket configuration"
level=info ts=2019-06-18T14:34:17.311845218Z caller=bucket.go:334 msg="Listening for query and metrics" address=0.0.0.0:8080
level=info ts=2019-06-18T14:34:17.312509383Z caller=bucket.go:393 msg="synchronizing block metadata"
level=info ts=2019-06-18T14:35:21.963786639Z caller=bucket.go:415 msg="downloaded block metadata"

image

Also, failing unit test indicates you need to invoke make docs (: https://circleci.com/gh/improbable-eng/thanos/3882?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link.

I ran it as:

export GOOGLE_APPLICATION_CREDENTIALS="my-service-account"
go run $(go list ./cmd/thanos/...) bucket web --objstore.config="type: GCS
config:
  bucket: thanos-stable"

Did not reviewed the HTML, JS code yet, but looks fine from first glance.

"github.com/go-kit/kit/log"
opentracing "github.com/opentracing/opentracing-go"
"github.com/prometheus/common/route"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our Go linter is crippled so you need to remove newlines and do make format (or goimports) again.. ):

Copy link
Contributor Author

@jaseemabid jaseemabid Jun 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can manually fix this now, but make format isn't fixing this for me and with the latest version of goimports, this is how it formats bu default. Do you have a local config override?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, yes it's manual work for now until #984 is fixed.

cmd/thanos/bucket.go Outdated Show resolved Hide resolved
cmd/thanos/bucket.go Outdated Show resolved Hide resolved
cmd/thanos/bucket.go Outdated Show resolved Hide resolved
cmd/thanos/bucket.go Outdated Show resolved Hide resolved
defer iterCancel()

// If there are no errors and the data isn't stale, wait a while before updating again.
if bucketUI.Err == nil && time.Since(bucketUI.RefreshedAt) < (5*time.Minute) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it mean that delay cannot be lower than 5 minutes? Worth to add to flag description and maybe add some validation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to keep this loop really simple and not make it download a lot of data from s3. Since compact activity is a fairly slow operation, I guess data that's stale by a few minutes is not a big issue?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed this check and added a simple warning if the user updates at a frequency less than 5m.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This minimum value makes sense, but yes we can check it early on instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

83dda59 simplifies the loop again and a7012e0 made the arguments configurable. Does this look ok?

cmd/thanos/bucket.go Outdated Show resolved Hide resolved
delay := time.Duration(interval) * time.Minute

return runutil.Repeat(delay, ctx.Done(), func() error {
iterCtx, iterCancel := context.WithTimeout(ctx, 5*time.Minute)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we might need to configure timeout at some point as well? Also it's tricky as timeout will mean that we will wait until ... next interval which can be in X minutes? We could simplify flow here a bit I think. I would retry until success OR timeout with hard failure on server. Continuing and waiting for next interval tick is like retrying with backoff to me (:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not addressed (:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bwplotka 83dda59 simplifies the loop again and a7012e0 made the arguments configurable.

cmd/thanos/bucket.go Outdated Show resolved Hide resolved
cmd/thanos/bucket.go Outdated Show resolved Hide resolved
@jaseemabid
Copy link
Contributor Author

@bwplotka I'vent seen that UI error before and I wonder if its because of some difference in the way data is returned from GCS vs s3?

@bwplotka
Copy link
Member

There is none as we have everything in our pkg/objstore abstraction. Maybe something in meta.json that is unusual?

How can I give you more data for this error?

@bwplotka
Copy link
Member

It is some JS error, so wonder if it's not some JS versions mismatch? Or browser? I use chrome on linux.

@jaseemabid
Copy link
Contributor Author

@bwplotka I've pushed 2 more commits to make the error handling even more robust. More details in the commit message.

ed0590d Add runutil.Forever since runutil.Repeat stops at first error
566ac2b Make thanos bucket web refresh forever ignoring errors ..

@jaseemabid
Copy link
Contributor Author

This is roughly how the generated JS block looks like for me;

   var blocks = [{"version":1,"ulid":"01CT...;
         err =  null ,
         refreshedAt = "2019-06-19T12:58:31.770801942Z";

There is no length function in any of the new code we added, so it might be somewhere in the library and its hard to debug without a line number. Could you privately share the html file with me? I'd debug and throw it away.

@bwplotka
Copy link
Member

Nice, will give it a go

pkg/runutil/runutil.go Outdated Show resolved Hide resolved
jaseemabid added a commit to monzo/thanos that referenced this pull request Jun 27, 2019
For small environments with a unique label to identify each Prometheus replica,
set the title with a `--label`. If multiple labels are required to duplicate
shards, use the default to get an external legend. See screenshots in PR
thanos-io#1248 for details.
@jaseemabid
Copy link
Contributor Author

Quick update on this! @bwplotka mentioned that using a single label to annotate the timelines won't work for everyone so now we have an option to customize it! ⚙️

From 0057ae6,

Allow customizing timeline title with label

For small environments with a unique label to identify each Prometheus replica, set the title with a --label. If multiple labels are required to duplicate shards, use the default to get an external legend.

If you have a unique label (like I do), pass --label prometheus_replica and you get something like this. ✨

Screenshot 2019-06-27 at 18 12 49

If you need multiple labels, you can skip the --label option and get a full legend below like this 🎹

Screenshot 2019-06-27 at 18 17 49

Admittedly this is only a first draft and in @bwplotka's words, a shows some very ametuer web design skills!

But, almost all of our thanos instability issues were due to huge compact backlog and this is the graph that helped us track missing blocks and make it work reliably again. It's not pretty, but this UI might help you debug some nasty issues.

I'd love to get this out to a bunch of users and get some feedback before reiterating again. Anyone with interest and skills in modern frontend programming could rewrite this in one afternoon and I'd be very happy when that happens. Till then hopefully this will be of some use to a bunch of people 🤞

jaseemabid added a commit to monzo/thanos that referenced this pull request Jun 27, 2019
For small environments with a unique label to identify each Prometheus replica,
set the title with a `--label`. If multiple labels are required to duplicate
shards, use the default to get an external legend. See screenshots in PR
thanos-io#1248 for details.
Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉 it works! Thanks!

LGTM, just some minor suggestions, before we merge. Let me know what you think (:

Also can we add CHANGELOG entry?

delay := time.Duration(interval) * time.Minute

return runutil.Repeat(delay, ctx.Done(), func() error {
iterCtx, iterCancel := context.WithTimeout(ctx, 5*time.Minute)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not addressed (:

pkg/runutil/runutil.go Outdated Show resolved Hide resolved

<script src="{{ pathPrefix }}/static/vendor/mustache/mustache.min.js?v={{ buildVersion }}"></script>
<script src="{{ pathPrefix }}/static/vendor/js/jquery.selection.js?v={{ buildVersion }}"></script>
<!-- <script src="{{ pathPrefix }}/static/vendor/js/jquery.hotkeys.js?v={{ buildVersion }}"></script> -->
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this commented thing needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed bebabfa

pkg/ui/templates/bucket.html Show resolved Hide resolved
pkg/ui/templates/bucket.html Show resolved Hide resolved
Exposes a web interface for the state of Thanos buckets just like `pprof web`.
This is an early preview, the code is rough on the edges and could use some
review. See the screenshots for some examples.

* What works? *

- Can display the timeline of blocks per shard
- Can print error messages if any
- Reloads data from remote storage every 30m
- Reloads browser slightly smarter than that

* TODOS *

- Is any of the run.Group, dummy actor etc needed?
- Some download code is duplicated from `ls -o json`; refactor as needed
- Some web code is duplicated from query UI; refactor as needed
- Do we need any of the route prefix?

* Notes for the reviewer *

- goimports is slightly off on master, the changes in kingpin is not intentional
- Multiple `client.NewBucket` will panic; the downloader avoids that.
- `downloader` can be rewritten if required
- Asset reloading doesn't work with `DEBUG=1`, I had to add an extra `-debug`
1. Groups seems like a much better pattern than spawning adhoc goroutines; in
   this case its desirable that a group of threads shutdown cleanly if any fail.

2. runutil has some nice utilities, better than sleep loops

3. Closures in Go is painful to read, create a client at the top level and pass
   it as argument instead.

4. Log and retry download errors than propagate it all the way up.
This needed some minor tweaks to work on mac, the sed command
./scripts/genflagdocs.sh only works with gnu sed instead of the BSD version.
This is unfortunate since ideally the loop should handle errors that can be
retried and bubble up unexpected errors.

The bucket iterator fails with a remote read error while a new block is being
uploaded and over the last couple of days, this program crashed multiple times
with the same error. See logs for example. Ideally this patch should solve it.

```
level=error ts=2019-06-19T09:22:22.810872915Z caller=bucket.go:411 err="meta.json bkt get for 01DDQ6D62GFFHQKYCJ7TCHP307: The specified key does not exist." msg="Failed to downloaded block metadata"
level=error ts=2019-06-19T09:22:22.81573321Z caller=runutil.go:88 msg="function failed. Retrying in next tick" err="meta.json bkt get for 01DDQ6D62GFFHQKYCJ7TCHP307: The specified key does not exist."
level=error ts=2019-06-19T09:23:09.970878178Z caller=main.go:194 msg="running command failed" err="meta.json bkt get for 01DDQ6D62GFFHQKYCJ7TCHP307: The specified key does not exist."
```
For small environments with a unique label to identify each Prometheus replica,
set the title with a `--label`. If multiple labels are required to duplicate
shards, use the default to get an external legend. See screenshots in PR
thanos-io#1248 for details.
Download errors are ignored for now, but context cancellations will be
propagated up and the process would die. I wish I had simple supervision trees
here.
Every `duration` (defaults to 30m), we try to download the blocks within
`timeout` (defaults to 5m). This avoids the complicated interaction between
multiple timers - the previous version reduced the 1 minute between `Repeat`
attempts from the timeout and that wasn't very smart.
Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise:

shipit

@jaseemabid
Copy link
Contributor Author

jaseemabid commented Jun 28, 2019

@bwplotka I guess this is good to go!

This is probably the longest time or number of iterations I've ever had to spend on a PR! 😂You were super helpful! Thank you so much :D

<div class="container-fluid" id="Compactions" style="height: 100%; width: 100%;"></div>

<!--
TODO(jaseemabid): The legend is only a small temporary workaround till we have better ways of showing the labels.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perfect 👍

Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot @jaseemabid this is amazing contribution and I am pretty sure everyone loves that (: I mean I know that at least 3 people already use it although it's still in PR ;p

This is probably the longest time or iterations I've ever had to spend on a PR! joyYou were super helpful! Thank you so much :D

You can blame me, I am too picky ): and we can blame that weird JS library as well (: so all good 👍

Amazing work 💪 , merging on green.

@jaseemabid
Copy link
Contributor Author

I'm equally picky with PRs and I'm honestly happy that you did what you did. Cheers! 🍺

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants