Skip to content

RSDK-13386 Do not emit logs from testAppender if the associated test has completed#6025

Open
benjirewis wants to merge 5 commits into
viamrobotics:mainfrom
benjirewis:avoid-late-oue-logs
Open

RSDK-13386 Do not emit logs from testAppender if the associated test has completed#6025
benjirewis wants to merge 5 commits into
viamrobotics:mainfrom
benjirewis:avoid-late-oue-logs

Conversation

@benjirewis
Copy link
Copy Markdown
Member

@benjirewis benjirewis commented May 19, 2026

RSDK-13386

What

Adds logic to the testAppender to avoid emitting logs if the associated test has already completed.

Why

These logs could emit after tests have already completed, causing data races like the following:

WARNING: DATA RACE
Read at 0x00c0000f300b by goroutine 116:
  testing.(*common).destination()
      /home/testbot/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.25.1.linux-arm64/src/testing/testing.go:1055 +0x17c
  testing.(*common).log()
      /home/testbot/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.25.1.linux-arm64/src/testing/testing.go:1027 +0x98
  testing.(*common).Log()
      /home/testbot/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.25.1.linux-arm64/src/testing/testing.go:1180 +0x6c
  go.viam.com/rdk/logging.(*testAppender).Write()
      /__w/rdk/rdk/logging/testing.go:73 +0x534
  go.viam.com/rdk/logging.(*impl).Write()
      /__w/rdk/rdk/logging/impl.go:313 +0x44c
  go.viam.com/rdk/logging.(*impl).Infow()
      /__w/rdk/rdk/logging/impl.go:483 +0x98
  go.viam.com/rdk/module/modmanager.(*Manager).startModule.(*Manager).newOnUnexpectedExitHandler.func2()
      /__w/rdk/rdk/module/modmanager/manager.go:939 +0x628
  go.viam.com/utils/pexec.(*managedProcess).manage.func3()
      /home/testbot/go/pkg/mod/go.viam.com/utils@v0.4.19/pexec/managed_process.go:351 +0xfc

Previous write at 0x00c0000f300b by goroutine 27:
  testing.tRunner.func1()
      /home/testbot/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.25.1.linux-arm64/src/testing/testing.go:1921 +0x5e8
  runtime.deferreturn()
      /home/testbot/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.25.1.linux-arm64/src/runtime/panic.go:589 +0x5c
  testing.(*T).Run.gowrap1()
      /home/testbot/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.25.1.linux-arm64/src/testing/testing.go:1997 +0x3c

Goroutine 116 (running) created at:
  go.viam.com/utils/pexec.(*managedProcess).manage()
      /home/testbot/go/pkg/mod/go.viam.com/utils@v0.4.19/pexec/managed_process.go:348 +0x2bc
  go.viam.com/utils/pexec.(*managedProcess).start.func1()
      /home/testbot/go/pkg/mod/go.viam.com/utils@v0.4.19/pexec/managed_process.go:288 +0x54
  go.viam.com/utils.ManagedGo.func1()
      /home/testbot/go/pkg/mod/go.viam.com/utils@v0.4.19/runtime.go:184 +0x34
  go.viam.com/utils.PanicCapturingGoWithCallback.func1()
      /home/testbot/go/pkg/mod/go.viam.com/utils@v0.4.19/runtime.go:176 +0x6c

Goroutine 27 (finished) created at:
  testing.(*T).Run()
      /home/testbot/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.25.1.linux-arm64/src/testing/testing.go:1997 +0x6e0
  testing.runTests.func1()
      /home/testbot/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.25.1.linux-arm64/src/testing/testing.go:2477 +0x74
  testing.tRunner()
      /home/testbot/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.25.1.linux-arm64/src/testing/testing.go:1934 +0x164
  testing.runTests()
      /home/testbot/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.25.1.linux-arm64/src/testing/testing.go:2475 +0x734
  testing.(*M).Run()
      /home/testbot/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.25.1.linux-arm64/src/testing/testing.go:2337 +0xaf4
  main.main()
      _testmain.go:73 +0x100
==================

That data race is complaining about the fact that a reader is trying to check if the testing object is done while the testing object is writing done to be true. In other words, a test is logging at manager.go:939 after the test has finished.

We should modify our test appender to be aware of when the associated test has completed.

Testing

Verified that if I put a time.Sleep(20 * time.Millisecond) before the Restart context canceled log and run go test -v -count=100 ./module/modmanager -run TestCrashShortCircuit -race, I get a race like the one shown above (or a panic complaining about a test logging after the test completes). I do not get the same race when running with my changes.

@benjirewis benjirewis requested a review from jmatth May 19, 2026 19:12
@viambot viambot added the safe to test This pull request is marked safe to test from a trusted zone label May 19, 2026
@benjirewis
Copy link
Copy Markdown
Member Author

Test failures are from tests that assert on the now-missing logs; looking into it.

@dgottlieb
Copy link
Copy Markdown
Member

dgottlieb commented May 20, 2026

@benjirewis just throwing out a half baked idea (given the test failure fallout): instead of removing the logs, you could have the problematic test pass in a custom logger that finds a way to avoid the problem. Either with an heuristic (e.g: set the level to errors only) or with a more formal switch/Close method on test end that informs the logger to stop emitting logs. Or some combination of the two.

@benjirewis benjirewis changed the title RSDK-13386 Remove OUE logs that may emit after tests complete RSDK-13386 Do not emit logs from testAppender if the associated test has completed May 20, 2026
@benjirewis
Copy link
Copy Markdown
Member Author

with a more formal switch/Close method on test end that informs the logger to stop emitting logs

Thanks! I think this is a good solution and should better help avoid this issue in the future. Threw some synchronization into testAppender that seems to get rid of my local race reproduction. Let me know what you think.

@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels May 20, 2026
@benjirewis benjirewis requested a review from dgottlieb May 20, 2026 16:25
Comment thread logging/testing.go Outdated
func (tapp *testAppender) Write(entry zapcore.Entry, fields []zapcore.Field) error {
tapp.tb.Helper()

log := func(toPrint []string) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks a bit unwieldy. Was there a reason to only hold the lock around the tapp.tb.Log call? Particularly given you've already went for a RWMutex. All of the work here seems to be CPU (read: non-blocking) stuff, so locking is broadly safe. And because it's RW, we'll fully utilize the CPU to the same degree we already are.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was there a reason to only hold the lock around the tapp.tb.Log call?

Switched to RLocking around the whole method. I had started off by minimally holding the lock around the read of completed and the call that would cause the race with test completion (tapp.tb.Log). So, no, not much of a reason beyond making a "small" change. RLocking around the entire method also avoids creating a log entry that we're never going to log, which is nice.

Also some questions that are just for my own edification:

This looks a bit unwieldy.

As in the callback with mutex logic inside could be easily miswritten or misunderstood?

All of the work here seems to be CPU (read: non-blocking) stuff, so locking is broadly safe

In other words: because there's no network or disk IO, for example, in this method, there's no chance of holding the lock for any longer than it takes to the CPU to build up a list of strings for fields and encode the zap entry?

And because it's RW, we'll fully utilize the CPU to the same degree we already are

Because multiple concurrent Writes can RLock?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As in the callback with mutex logic inside could be easily miswritten or misunderstood?

To me, having to define a lambda to wrap a small set of instructions for a small handful of early returns I consider unwieldy. Sometimes unwieldy is the only solution that meets other constraints.

Sometimes there's a risk of misunderstanding code written that way (e.g: the processResource lambda -- which further has a section of goutils.PanicCapturingGo). But that's when the function gets large. In this case it's small enough to be hard to misread.

For what it's worth, I think having three early returns that each have to remember to log is also unwieldy. But in that case I preferred the early returns with unwieldy penalty over if blocks (and perhaps even nested if-blocks).

In other words: because there's no network or disk IO, for example, in this method, there's no chance of holding the lock for any longer than it takes to the CPU to build up a list of strings for fields and encode the zap entry?

Yep exactly. And when reasoning about I/O in the worst case sense "holding the lock for any longer than" could actually mean "hold the lock forever because the internet died".

Because multiple concurrent Writes can RLock?

Yep

Comment thread logging/testing.go
func NewTestAppender(tb testing.TB) Appender {
return &testAppender{tb}
tapp := &testAppender{tb: tb}
tb.Cleanup(func() {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the consequence of every test appender buying into this behavior?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Probably negligibly) increased latency of Write calls and inability for any test to observe any log that is emitted after the test completes.

w.r.t. a doNotLogAfterCompletion parameter or something similar (not that that's what you're suggesting) I'm struggling to come up with a test where you'd want a logger that is allowed to emit logs after the test completes.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Probably negligibly) increased latency of Write calls and inability for any test to observe any log that is emitted after the test completes.

Not worried about that one -- as you so annotated.

I'm struggling to come up with a test where you'd want a logger that is allowed to emit logs after the test completes.

Getting closer -- I think the philosophical "problem" this patch creates is that we can mask tests that leak goroutines (which subsequently log). Whether tests leak goroutines directly or tests correctly call "Close", but Close is buggy.

goleak itself is pretty poor at identifying leaks because it only works on the test suite boundary. And not per-test. So it's almost necessary for short-lived goroutine leaks to require being the last test run to show up.

@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels May 20, 2026
Copy link
Copy Markdown
Member

@dgottlieb dgottlieb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I assume one thing worth checking is the test in question is failing because it's supposed to be leaking a goroutine and not waiting for it.

Assuming that's true, I'm fine with disabling the testing log output at test finish. But I do want to make sure the choice is intentional to be giving up coverage it is empircally giving us of (mostly test-specific) goroutine leaks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

safe to test This pull request is marked safe to test from a trusted zone

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants