Skip to content

Inconsistent issues when starting up #1211

Closed

Description

We have recently started seeing reports around startup issues. These manifest as needing to wait a long time, or install a couple times (which triggers process restarts). Log files tend to be a bit muddled (partly due to #1205), but often will show some issues around the socket.

I believe this is a return of The Monterey Bug (#813 and internal issue).

I spent awhile staring at how launcher starts up, and I'm sure it's still kinda racy. There is huge amount of complexity in our runtime.

// Here be dragons
//
// There are two thorny issues. First, we "invert" control of
// the osquery process. We don't really know when osquery will
// be running, so we need a bunch of retries on these connections
//
// Second, because launcher supplements the enroll
// information, this Start function must return fast enough
// that osquery can use the registered tables for
// enrollment. *But* there's been a lot of racy behaviors,
// likely due to time spent registering tables, and subtle
// ordering issues.

I believe the first steps are to:

  1. Move enrollment info off the socket: I think a big part of the startup contention is that osquery enrollment goes through launcher, which needs to talk to osquery to get the extra details. While this isn't inherently a race, it's very very finicky. I've come to think this is way too clever, and we should simply invoke osquery in single mode and run what we need. Get enrollment details via exec, not the thrift socket #1213
  2. osquery-go should enforce a single user: One of the deep issues here, is that osquery/thrift only supports a single user of the thrift socket at a time. And, if you have goroutines, it's very easy to get that confused. See work in Allow for passing in an ExtensionManagerClient to NewExtensionManagerServer osquery/osquery-go#107 and Add context and lock functionality to client interface osquery/osquery-go#108 (both extracted from Add mutexes to osquery-go osquery/osquery-go#99)

I suspect that will completely remove underlying startup issue.

I think we should also rewrite the entire runtime to be much simpler. At the very core, its:

  1. Starting osquery
  2. Registering kolide_grpc
  3. Registering kolide_tables
  4. Providing a Querier to the rest of the code

But it has mountains of spaghetti that have accrued over the years.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions