Skip to content

RSDK-13768 Support Event Tracing for Windows (ETW) to improve logging performance#6026

Open
EvanDorsky wants to merge 33 commits into
viamrobotics:mainfrom
EvanDorsky:winio-etw-appender
Open

RSDK-13768 Support Event Tracing for Windows (ETW) to improve logging performance#6026
EvanDorsky wants to merge 33 commits into
viamrobotics:mainfrom
EvanDorsky:winio-etw-appender

Conversation

@EvanDorsky
Copy link
Copy Markdown
Member

@EvanDorsky EvanDorsky commented May 19, 2026

Background

Motivation for this PR was to speed up logging on Windows, which was found to be very slow on some Kongsberg machines (occasionally taking over 1 second for a single log line).

It looks like @aldenh-viam's fixes addressed whatever was causing these excessively slow logs, but the new ETW-based path for logging is still more modern and more performant than the old Event Logger path.

Changes

Windows now logs by default using the ETW path instead of the Event Logging path. Event Logging will still be used as a fallback if ETW fails to initialize.

ETW works differently from Event Logging. With ETW, there are two steps required to write logs to disk:

  1. The application that will emit logs registers a trace provider (with a GUID) which it will send logs to.
  2. A trace session must subscribe to the trace provider. The trace session receives the logs and can write them to disk.

If there is no trace session for a given provider, logs are just thrown away.

In this implementation, viam-server manages the trace session itself by making calls to logman, a mercifully straightforward tool for managing ETW subscribers. This is the lifecycle:

  1. On startup, register the log provider viam-server with the GUID 66AFF7FE-2451-47AA-A0E3-8E3D2E432B30
  2. Check for any existing ETW sessions named viam-server-trace and kill them
  3. Create a new ETW session subscribed to viam-server called viam-server-trace
  4. On shutdown, kill any ETW sessions called viam-server-trace.

Log retention and rotation

logman writes logs to disk in the default log directory $VIAM_HOME/logs into files named viam-server-trace-YYYY-MM-DDThh-mm-ss.etl.

A new .etl file is created every time viam-server is launched, and each .etl file has a 128 MB maximum size (a change from Event Logging which has a default 20 MB limit). If an individual file reaches the size limit, logs will begin to roll over in a circular manner with the newest logs replacing the oldest logs.

On startup, viam-server checks the disk usage of all .etl files in the logs folder. If the total usage is above the limit of 384 MB, it will delete log files (starting with the oldest) until total usage is below the limit.

I chose this strategy because logman offers two write modes - circular mode and append mode - which cannot be combined. Circular mode always overwrites an existing file, and append mode will just stop logging when the file reaches its size limit.

With this strategy:

  • maximum filesystem usage is capped at ~512 MB (128 + 384)
  • many past runs can be retained as long as the logs are not too loud
  • in the worst case, with very loud logging, the past 4 runs will still be retained
  • the implementation was fairly straightforward
  • logs should never be dropped during a run, because rotation only happens at startup

The default values were just chosen by me so I'm very open to changing them, or changing the strategy if there's something that would better fit the use cases for local logs.

Viewing ETW logs

PerfView can be used to collect and view ETW traces.

Also, web/cmd/winlogproc, provided here, can be used to convert .etl traces into friendlier .tsv files for inspection / automated processing:

go run .\web\cmd\winlogproc

This command will pull the latest Event Logging logs and ETW logs into your current directory, and process them into sorted .tsv files. There are arguments for specifying the time window of logs to pull.

Performance improvements

Trace logging seems less likely to have instances of "really slow logs" (max log duration is called out with the vertical dotted lines)
plot zoomed out
plot zoomed in

Aside from that, baseline performance looks slightly improved. These plots were made with logs from two separate runs of viam-server, one with only ETW logging enabled and one with only Event Logging enabled. The machine config was otherwise the same, a modified kongsberg config with a log duration tester.

Testing

There are some new tests for the ETW logger along the same lines as the existing EventLogger test.

There is also an e2e test (currently gated behind the etw_e2e flag) that stands up a server, logs both Event Logging and ETW logs, and then pulls the logs and compares them line-by-line. I used this test while developing but then flagged it out, mostly because there's no way right now to run both Event Logging and ETW logs at the same time so the test is guaranteed to fail.

Future work

ETW has a bunch of other features that we could start using to do very detailed performance characterization in Windows, at least in viam-server. It shouldn't be too hard to add them now that the scaffolding is in.

Other paths considered

I prototyped a NetAppender-style queue for logging with Event Logging and found that this actually made logging slightly slower, so I abandoned that path.

EvanDorsky and others added 30 commits April 29, 2026 12:30
Introduces sessionController interface and logmanSessionController that
shells out to logman.exe to manage an ETW session. Used by the upcoming
ETW appender to capture provider events to a .etl file.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
logging/windows_etw_logger.go registers an ETW provider with a pinned GUID,
attaches it as an Appender, and starts a logman-managed session that
captures events to a .etl file at the path the caller supplies. Each zap
entry becomes a TraceLogging-style event with structured fields (time,
level, logger, caller, message, fields-as-JSON).

logging/noop_etw_logger.go provides the matching no-op for non-Windows
builds.

Promotes github.com/Microsoft/go-winio from indirect to direct dependency
(its pkg/etw is what the appender uses).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds RegisterETWLogger alongside the existing RegisterEventLogger call.
Output path is $VIAM_HOME/logs/viam-agent-trace.etl (or
<UserHomeDir>/.viam/logs/viam-agent-trace.etl when VIAM_HOME is unset),
matching the existing rutils.ViamDotDir convention. Both appenders run
in parallel; eventlog stays in place.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Exercises Write, Close, and Write-after-close on the appender directly
(no session) plus a smoke test of RegisterETWLogger end-to-end. Both
paths must not panic; Write-after-close is a no-op via pkg/etw's
provider.enabled guard.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Default flush behavior (-ft 0) only writes buffers to disk when full or
when the session stops. A hard kill of viam-server therefore loses the
events leading up to the crash, which is the most diagnostically valuable
window. -ft 1 caps that loss at ~1 second of events for negligible cost.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
logman create trace -ets silently drops some flags (e.g. -ft, -mode
bincirc) — the runtime session ends up with default values regardless
of what we asked for. Switching to the persistent-definition path
(create without -ets, then start separately) honors all flags.

stop+delete first makes Start idempotent across any prior state, and
ensures config changes between viam-server versions overwrite stale
definitions instead of being silently ignored.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
logman's flag for file format is -f, not -mode. Passing -mode bincirc
was a typo'd flag that logman silently ignored, which is why every
prior run showed Circular: Off in logman query — we'd never actually
asked for circular mode.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@EvanDorsky EvanDorsky requested review from aldenh-viam and cheukt May 19, 2026 19:40
@viambot viambot added the safe to test This pull request is marked safe to test from a trusted zone label May 19, 2026
@EvanDorsky EvanDorsky marked this pull request as draft May 19, 2026 19:48
@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels May 19, 2026
@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels May 19, 2026
@EvanDorsky EvanDorsky marked this pull request as ready for review May 19, 2026 21:09
@aldenh-viam
Copy link
Copy Markdown
Contributor

Code LGTM assuming fully tested 🪟

cc @dgottlieb who set up the original Event Logger infra for RDK (#4713)

Should we consider doing the same in Agent, so we don't have Agent logging to the Event Logger only and RDK logging to ETW only?

These plots were made with logs from two separate runs of viam-server, one with only ETW logging enabled and one with only Event Logging enabled.

Why does the 1st plot have both blue (eventlog.tsv) & orange (trace.tsv)?

@dgottlieb
Copy link
Copy Markdown
Member

Should we consider doing the same in Agent, so we don't have Agent logging to the Event Logger only and RDK logging to ETW only?

I'm unfamiliar with these windows APIs and whether there are more interesting constraints, but yes. I'd expect both to log to the same thing.

@EvanDorsky
Copy link
Copy Markdown
Member Author

Ah good point, I'll see what it takes to migrate agent to this.

Why does the 1st plot have both blue (eventlog.tsv) & orange (trace.tsv)?

That's my fault, the plots are hard to interpret - both plots have data from eventlog and trace, but they're really quite close, so they overlap and it looks like there's only data from trace since it plots on top. If you zoom in really close on the second plot, you can see that eventlog (blue) has a slightly longer tail than trace:

image

On the first plot it's obvious that there's data from both log types because I put vertical lines by the min and max value for each - which shows that the slowest log for eventlog is slower than for trace.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

safe to test This pull request is marked safe to test from a trusted zone

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants