Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move from slog to tracing #317

Closed
XAMPPRocky opened this issue Jul 7, 2021 · 5 comments · Fixed by #457
Closed

Move from slog to tracing #317

XAMPPRocky opened this issue Jul 7, 2021 · 5 comments · Fixed by #457
Assignees
Labels
area/performance Anything to do with Quilkin being slow, or making it go faster. area/user-experience Pertaining to developers trying to use Quilkin, e.g. cli interface, configuration, etc good first issue Good for newcomers help wanted Extra attention is needed priority/medium Issues that we want to resolve, but don't require immediate resolution.

Comments

@XAMPPRocky
Copy link
Collaborator

tracing is a diagnostic built and maintained by the Tokio team. It handles logging, but it provides a lot more as it's design for general instrumentation, for example there's a companion crate that uses the output to provide flamegraphs from the events. So using tracing would provide us with a lot more insight into what's happening at runtime than our current logging solution.

@XAMPPRocky XAMPPRocky added good first issue Good for newcomers help wanted Extra attention is needed area/performance Anything to do with Quilkin being slow, or making it go faster. area/user-experience Pertaining to developers trying to use Quilkin, e.g. cli interface, configuration, etc priority/medium Issues that we want to resolve, but don't require immediate resolution. labels Jul 7, 2021
@markmandel
Copy link
Member

I'm curious - a while ago I was looking into https://github.com/tikv/pprof-rs to get better visibility into where bottlenecks are in the codebase. My thought being, that then we would't have to manually instrument the codebase. Would something like that be used in conjunction with something like tracing, or are they more of competitors?

(I've been meaning to ping you on what tooling you like for performance tracing for Rust).

@XAMPPRocky
Copy link
Collaborator Author

Would something like that be used in conjunction with something like tracing, or are they more of competitors?

Well these two particular libraries are probably competitors right now, because pprof uses the backtraces to figure out what happened, where as tracing uses crafted events. For our use case (async) I think tracing is way forward for quite a few reasons. One, it's what being focused on by the Rust and Tokio developers (you can read more about that here http://blog.pnkfx.org/blog/2021/04/26/road-to-turbowish-part-1-goals/ ). Two, it's like log in that provides a clear separation from providing the events to subscribing to the events (it's essentially log + slog + spans). So for example you would be able to create a flamegraph from the same events that you spit out to your logging output, which IMO is a lot better than backtrace based profiling because that can be quite noisy with async code, since the tasks are passed into tokio several layers deep in stack.

@markmandel
Copy link
Member

From a logging point of view, as long as we still get structured JSON logging, I'm happy 😄, and it looks like https://docs.rs/tracing-subscriber/0.2.19/tracing_subscriber/fmt/format/struct.Json.html or tracing-serde will help solve this. (also our INFO and ERROR level logging are quite quiet, so I don't think there is much adjustment needed).

where as tracing uses crafted events

That's the only concern I really have - how much work does it take to build out crafted events? Could we possibly miss a bottleneck / potential performance improvement spot if we have to manually craft our events?

That all being said - the integration with opentelemetry and distributed tracing is certainly very interesting (I've been thinking a bit about how we could do tracing across proxies, so this seems prescient):
https://github.com/tokio-rs/tracing/blob/master/examples/examples/opentelemetry.rs

As well as the general focus on how to debug async operations. 👍🏻

This is super interesting! I chose slog because it was simple, and closest to what I knew from doing similar things in Go - but the use cases I had in the past were different.

The other thing I like about this - is it decouples our logging from the quilkin library, and if we have a consumer as a library that wants to align the logging output with what they do (which may be different from our own), they can do that through their own collector.

I would recommend one of us doing a small POC first, see what the end results are, and if we are all happy, let's move ahead with it?

I really liked this article for really getting my head around how tracing works:
https://betterprogramming.pub/production-grade-logging-in-rust-applications-2c7fffd108a6

👍🏻 on my end!

@XAMPPRocky
Copy link
Collaborator Author

That's the only concern I really have - how much work does it take to build out crafted events? Could we possibly miss a bottleneck / potential performance improvement spot if we have to manually craft our events?

It's not a lot of work to get something useful. You just tag every function you want to instrument with #[tracing::instrument], replace slog's macros with tracing's and call it a day.

@markmandel
Copy link
Member

Yeah - this is very exciting. I like this a lot.

One thing that came up recently in a chat I was having - since we can export this to OpenTelemetry, we can export traces with routing tokens to a distributed tracing store (Honeycomb, Google Cloud monitoring, Grafana Tempo, Datadog etc) -- then you can start to tie player data to tracing info and metrics, which can be another avenue to identify bad actors.

This has a lot of very interesting and useful applications both for performance testing as well as at scale 👍🏻

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/performance Anything to do with Quilkin being slow, or making it go faster. area/user-experience Pertaining to developers trying to use Quilkin, e.g. cli interface, configuration, etc good first issue Good for newcomers help wanted Extra attention is needed priority/medium Issues that we want to resolve, but don't require immediate resolution.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants