Description
We have recently started seeing reports around startup issues. These manifest as needing to wait a long time, or install a couple times (which triggers process restarts). Log files tend to be a bit muddled (partly due to #1205), but often will show some issues around the socket.
I believe this is a return of The Monterey Bug (#813 and internal issue).
I spent awhile staring at how launcher starts up, and I'm sure it's still kinda racy. There is huge amount of complexity in our runtime.
launcher/pkg/osquery/runtime/runner.go
Lines 354 to 365 in 7a926d0
I believe the first steps are to:
- Move enrollment info off the socket: I think a big part of the startup contention is that osquery enrollment goes through launcher, which needs to talk to osquery to get the extra details. While this isn't inherently a race, it's very very finicky. I've come to think this is way too clever, and we should simply invoke osquery in single mode and run what we need. Get enrollment details via exec, not the thrift socket #1213
osquery-go
should enforce a single user: One of the deep issues here, is that osquery/thrift only supports a single user of the thrift socket at a time. And, if you have goroutines, it's very easy to get that confused. See work in Allow for passing in an ExtensionManagerClient to NewExtensionManagerServer osquery/osquery-go#107 and Add context and lock functionality to client interface osquery/osquery-go#108 (both extracted from Add mutexes to osquery-go osquery/osquery-go#99)
I suspect that will completely remove underlying startup issue.
I think we should also rewrite the entire runtime to be much simpler. At the very core, its:
- Starting osquery
- Registering
kolide_grpc
- Registering
kolide_tables
- Providing a
Querier
to the rest of the code
But it has mountains of spaghetti that have accrued over the years.