Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add prometheus metrics #604

Merged
merged 9 commits into from
May 14, 2021
Merged

Add prometheus metrics #604

merged 9 commits into from
May 14, 2021

Conversation

oliver006
Copy link
Contributor

This PR adds a flag to the tile38 server to expose Prometheus metrics via a configurable http endpoint. By default, it's disabled and no metrics are exposed.

The changes in main.go here add a new http listener if the cmd line flag is set.
You can run it like this:
go run cmd/tile38-server/main.go --metrics-addr=0.0.0.0:4321

I utilized the existing code from stats.go to boostrap some basic metrics and then broke out the strings/objects/point counts per collection to show what's possible. I've not yet exposed all existing SERVER EXT metrics but the path forward is pretty clear.
You can get see the metrics by running:

curl -s localhost:4321/metrics
...
# HELP tile38_collection_points Total number of points per collection
# TYPE tile38_collection_points gauge
tile38_collection_points{col="fleet"} 2
tile38_collection_points{col="fleet2"} 4
tile38_collection_points{col="fleet3"} 1
...

I added measuring of cmd processing time here - it uses a Prometheus summary to record the timings of each cmd processed and exposes them in 50%, 90%, 95%, and 99% quantiles.

Example (I ran scan only once so the data exposed is not that exciting):

curl -s localhost:4321/metrics  | grep cmd_duration_
# HELP tile38_cmd_duration_seconds
# TYPE tile38_cmd_duration_seconds summary
tile38_cmd_duration_seconds{cmd="scan",quantile="0.5"} 0.000230938
tile38_cmd_duration_seconds{cmd="scan",quantile="0.9"} 0.000230938
tile38_cmd_duration_seconds{cmd="scan",quantile="0.95"} 0.000230938
tile38_cmd_duration_seconds{cmd="scan",quantile="0.99"} 0.000230938
tile38_cmd_duration_seconds_sum{cmd="scan"} 0.000230938
tile38_cmd_duration_seconds_count{cmd="scan"} 1

This is a start but I think a lot more metrics can be exposed similar to the way it's done in this PR.

Let me know what you think!

@stephenlacy
Copy link
Contributor

Are the vendor and gofmt changes required for this pr? Seems excessive

@oliver006
Copy link
Contributor Author

The vendor changes are due to pulling in prometheus/client_golang (i put the vendor update in its own commit, happy to combine it though) and the go fmt change is just cleaning up 5 lines in one file - doesn't seem excessive to me.

@stephenlacy
Copy link
Contributor

Just that the vendor additions did +47,426 −1,303 changes

@tidwall
Copy link
Owner

tidwall commented May 7, 2021

The prometheus/client_golang is a heavyweight package for sure, but it's par for the course for Tile38.
Short of removing vending, I'm not sure what else can be done about the additions.

@oliver006
Copy link
Contributor Author

@tidwall - right, you could remove vendoring the dependencies but that's a bigger change and definitely out of scope for this PR (if you even want to remove vendoring at all).

@oliver006
Copy link
Contributor Author

@tidwall - what would be helpful is know if this PR is going in the right direction.
E.g. that using the cmd line flag and spinning up the additional listener makes sense and generally is in line with how you think things should look like.

If you're ok with this approach then I'll clean this PR up a bit, add a few more metrics and a test or so and get this ready for a real review.

@tidwall
Copy link
Owner

tidwall commented May 10, 2021

A cursory look is a-ok. I'll take a deeper dive in the morning.

@tidwall
Copy link
Owner

tidwall commented May 10, 2021

The tests are failing at tests/mock_test.go:55. Your new metricsAddr param is missing

The make test allows for running tests locally.

@tidwall
Copy link
Owner

tidwall commented May 10, 2021

Other than the broken make test and missing lock in the Collect function I good with the PR.

@oliver006
Copy link
Contributor Author

Thanks @tidwall !
I'll make the suggested changes later today and also do the remaining cleanup/polish/tests/etc.

@oliver006 oliver006 changed the title [wip] Add prometheus metrics Add prometheus metrics May 12, 2021
@oliver006
Copy link
Contributor Author

@tidwall - addressed your comments, added a few more metrics and also tests, this is ready for a real review now.

One more question: should I add a blurb to the README about how to enable Prometheus metrics? Don't really see a section where command line flags are covered so not really sure where this would fit. Maybe a short section right after To run a single server: ?

@tidwall
Copy link
Owner

tidwall commented May 13, 2021

Should I add a blurb to the README about how to enable Prometheus metrics?

I think that would be useful. Perhaps above the Network protocols section.

It would also make a great topic on the documentation page

There's one test failure, but otherwise your code looks good.

@oliver006
Copy link
Contributor Author

Fixed the tests and added instructions on how to expose the metrics (added it as a sub-item to the "Running" section).

I'll see that I get the documentation page updated later this week, or next.

@tidwall tidwall merged commit 559081e into tidwall:master May 14, 2021
@oliver006
Copy link
Contributor Author

@tidwall
Copy link
Owner

tidwall commented May 15, 2021

Thanks again. I plan on including it in a new version next week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants