RSDK-13768 Support Event Tracing for Windows (ETW) to improve logging performance#6026
Open
EvanDorsky wants to merge 33 commits into
Open
RSDK-13768 Support Event Tracing for Windows (ETW) to improve logging performance#6026EvanDorsky wants to merge 33 commits into
EvanDorsky wants to merge 33 commits into
Conversation
Introduces sessionController interface and logmanSessionController that shells out to logman.exe to manage an ETW session. Used by the upcoming ETW appender to capture provider events to a .etl file. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
logging/windows_etw_logger.go registers an ETW provider with a pinned GUID, attaches it as an Appender, and starts a logman-managed session that captures events to a .etl file at the path the caller supplies. Each zap entry becomes a TraceLogging-style event with structured fields (time, level, logger, caller, message, fields-as-JSON). logging/noop_etw_logger.go provides the matching no-op for non-Windows builds. Promotes github.com/Microsoft/go-winio from indirect to direct dependency (its pkg/etw is what the appender uses). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds RegisterETWLogger alongside the existing RegisterEventLogger call. Output path is $VIAM_HOME/logs/viam-agent-trace.etl (or <UserHomeDir>/.viam/logs/viam-agent-trace.etl when VIAM_HOME is unset), matching the existing rutils.ViamDotDir convention. Both appenders run in parallel; eventlog stays in place. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Exercises Write, Close, and Write-after-close on the appender directly (no session) plus a smoke test of RegisterETWLogger end-to-end. Both paths must not panic; Write-after-close is a no-op via pkg/etw's provider.enabled guard. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Default flush behavior (-ft 0) only writes buffers to disk when full or when the session stops. A hard kill of viam-server therefore loses the events leading up to the crash, which is the most diagnostically valuable window. -ft 1 caps that loss at ~1 second of events for negligible cost. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
logman create trace -ets silently drops some flags (e.g. -ft, -mode bincirc) — the runtime session ends up with default values regardless of what we asked for. Switching to the persistent-definition path (create without -ets, then start separately) honors all flags. stop+delete first makes Start idempotent across any prior state, and ensures config changes between viam-server versions overwrite stale definitions instead of being silently ignored. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
logman's flag for file format is -f, not -mode. Passing -mode bincirc was a typo'd flag that logman silently ignored, which is why every prior run showed Circular: Off in logman query — we'd never actually asked for circular mode. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
|
Code LGTM assuming fully tested 🪟 cc @dgottlieb who set up the original Event Logger infra for RDK (#4713) Should we consider doing the same in Agent, so we don't have Agent logging to the Event Logger only and RDK logging to ETW only?
Why does the 1st plot have both blue (eventlog.tsv) & orange (trace.tsv)? |
Member
I'm unfamiliar with these windows APIs and whether there are more interesting constraints, but yes. I'd expect both to log to the same thing. |
Member
Author
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

Background
Motivation for this PR was to speed up logging on Windows, which was found to be very slow on some Kongsberg machines (occasionally taking over 1 second for a single log line).
It looks like @aldenh-viam's fixes addressed whatever was causing these excessively slow logs, but the new ETW-based path for logging is still more modern and more performant than the old Event Logger path.
Changes
Windows now logs by default using the ETW path instead of the Event Logging path. Event Logging will still be used as a fallback if ETW fails to initialize.
ETW works differently from Event Logging. With ETW, there are two steps required to write logs to disk:
If there is no trace session for a given provider, logs are just thrown away.
In this implementation,
viam-servermanages the trace session itself by making calls tologman, a mercifully straightforward tool for managing ETW subscribers. This is the lifecycle:viam-serverwith the GUID66AFF7FE-2451-47AA-A0E3-8E3D2E432B30viam-server-traceand kill themviam-servercalledviam-server-traceviam-server-trace.Log retention and rotation
logmanwrites logs to disk in the default log directory$VIAM_HOME/logsinto files namedviam-server-trace-YYYY-MM-DDThh-mm-ss.etl.A new
.etlfile is created every timeviam-serveris launched, and each.etlfile has a 128 MB maximum size (a change from Event Logging which has a default 20 MB limit). If an individual file reaches the size limit, logs will begin to roll over in a circular manner with the newest logs replacing the oldest logs.On startup,
viam-serverchecks the disk usage of all.etlfiles in the logs folder. If the total usage is above the limit of 384 MB, it will delete log files (starting with the oldest) until total usage is below the limit.I chose this strategy because
logmanoffers two write modes - circular mode and append mode - which cannot be combined. Circular mode always overwrites an existing file, and append mode will just stop logging when the file reaches its size limit.With this strategy:
The default values were just chosen by me so I'm very open to changing them, or changing the strategy if there's something that would better fit the use cases for local logs.
Viewing ETW logs
PerfView can be used to collect and view ETW traces.
Also,
web/cmd/winlogproc, provided here, can be used to convert.etltraces into friendlier.tsvfiles for inspection / automated processing:This command will pull the latest Event Logging logs and ETW logs into your current directory, and process them into sorted
.tsvfiles. There are arguments for specifying the time window of logs to pull.Performance improvements
Trace logging seems less likely to have instances of "really slow logs" (max log duration is called out with the vertical dotted lines)


Aside from that, baseline performance looks slightly improved. These plots were made with logs from two separate runs of
viam-server, one with only ETW logging enabled and one with only Event Logging enabled. The machine config was otherwise the same, a modified kongsberg config with a log duration tester.Testing
There are some new tests for the ETW logger along the same lines as the existing EventLogger test.
There is also an e2e test (currently gated behind the
etw_e2eflag) that stands up a server, logs both Event Logging and ETW logs, and then pulls the logs and compares them line-by-line. I used this test while developing but then flagged it out, mostly because there's no way right now to run both Event Logging and ETW logs at the same time so the test is guaranteed to fail.Future work
ETW has a bunch of other features that we could start using to do very detailed performance characterization in Windows, at least in
viam-server. It shouldn't be too hard to add them now that the scaffolding is in.Other paths considered
I prototyped a NetAppender-style queue for logging with Event Logging and found that this actually made logging slightly slower, so I abandoned that path.