Skip to content

[external-api] Add health field to update status#10271

Open
karencfv wants to merge 5 commits intooxidecomputer:mainfrom
karencfv:include-health-in-update-status-api
Open

[external-api] Add health field to update status#10271
karencfv wants to merge 5 commits intooxidecomputer:mainfrom
karencfv:include-health-in-update-status-api

Conversation

@karencfv
Copy link
Copy Markdown
Contributor

@karencfv karencfv commented Apr 15, 2026

This PR is the last piece for a minimal system health check for update status. It is a new field in the system/update/status API called is_system_healthy which is either true or false based on the information in the latest inventory collection. Once #10027 is merged, we'll include stale sagas as well.

Disclaimer: I used the claude code skill to make the endpoint edit, and also for part of the code (trying to learn how to use it here). I checked the code several times and tested manually, but just thought I'd mention it here.

Manual tests:

There are unhealthy services

$ ./target/debug/omdb db inventory collections show latest --db-url "postgresql://root@[::1]:59809/omicron?sslmode=disable"
<...>
    zpools
      0be6eab2-9e27-4c3e-bbaf-11435e393ed2: total size: 16 GiB health: online
      4ac3f3b4-a423-46cb-93d1-bc393545b9e1: total size: 16 GiB health: online
      77468dca-740c-49f3-b10e-a21a3d9e6462: total size: 16 GiB health: online
<...>
SMF SERVICES STATUS
    4 SMF services enabled but not online at 2026-04-16T06:27:35.387Z
        FMRI                                ZONE       STATE       
        svc:/site/fake-service2:default     global     maintenance 
        svc:/site/fake-service3:default     global     offline     
        svc:/site/fake-service4:default     global     degraded    
        svc:/site/fake-service:default      global     maintenance
<...>
$ curl -b cookies.txt -H "api-version: 2026041500.0.0"   http://127.0.0.1:12220/v1/system/update/status | jq
  % Total    % Received % Xferd  Average Speed  Time    Time    Time   Current
                                 Dload  Upload  Total   Spent   Left   Speed
100    189 100    189   0      0    959      0                              0
{
  "target_release": null,
  "components_by_release_version": {
    "install dataset": 8,
    "unknown": 11
  },
  "time_last_step_planned": "2026-04-16T07:08:43.121286Z",
  "suspended": false,
  "is_system_healthy": false
}

Everything is happy!

$ ./target/debug/omdb db inventory collections show latest --db-url "postgresql://root@[::1]:59809/omicron?sslmode=disable"
<...>
    zpools
      337ab774-358d-4cb4-bdf4-5672caa90d5f: total size: 16 GiB health: online
      c8118f52-a5f4-451a-87ce-cf331b80988c: total size: 16 GiB health: online
      e2b28628-9c8e-4be3-9086-5c52082c3f85: total size: 16 GiB health: online
<...>
SMF SERVICES STATUS
    0 SMF services enabled but not online at 2026-04-16T07:11:45.570Z
<...>
$ curl -b cookies.txt -H "api-version: 2026041500.0.0"   http://127.0.0.1:12220/v1/system/update/status | jq
  % Total    % Received % Xferd  Average Speed  Time    Time    Time   Current
                                 Dload  Upload  Total   Spent   Left   Speed
100    188 100    188   0      0   1197      0                              0
{
  "target_release": null,
  "components_by_release_version": {
    "install dataset": 8,
    "unknown": 11
  },
  "time_last_step_planned": "2026-04-16T07:11:46.268131Z",
  "suspended": false,
  "is_system_healthy": true
}

Closes: #9418

@karencfv karencfv marked this pull request as ready for review April 16, 2026 07:37
@david-crespo
Copy link
Copy Markdown
Contributor

It worries me slightly to tell the user the system is unhealthy at times when that's expected.

@karencfv
Copy link
Copy Markdown
Contributor Author

It worries me slightly to tell the user the system is unhealthy at times when that's expected.

I totally get it. My first instinct was to call this "is_system_updateable" or something like that. We discussed somewhere, but I think it was during a meeting or something. I was looking for the discussion but couldn't find it. I don't remember the specifics, but I think the reasoning behind this naming was to make sure users don't ignore this issue if they encounter an "unhealthy" system and they do call support.

Maybe @davepacheco can expand

An idea was floated around that the console could hide the status while there was an ongoing update, @david-crespo what is your take on this?

@david-crespo
Copy link
Copy Markdown
Contributor

That’s interesting, so it would be like health/unhealthy, unless less than 100% of components are on the target version, in which case we’re “updating” or something. I guess I wonder what “unhealthy” is supposed to tell the user. I’d much rather have it in the form of an active problem.

@karencfv
Copy link
Copy Markdown
Contributor Author

karencfv commented Apr 17, 2026

The idea of this work is to take the place of the health check script the support team currently runs before and after each update until we have a proper FM implementation. We want it specifically tied to the update process https://rfd.shared.oxide.computer/rfd/0612. More detail here #9876.

Perhaps we can chat further on the topic at the next update sync to make sure we're all on the same page?

@david-crespo
Copy link
Copy Markdown
Contributor

david-crespo commented Apr 17, 2026

That's helpful, I'll read that issue. Off the top of my head I think it would feel better to me (and possibly be more useful to support) to have all the sub-checks as separate booleans rather than synthesizing them all into one big AND. And it doesn't really feel like that update-specific, even though it's used during update. So maybe it belongs in its own endpoint?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Expose health check information via "update status" API

2 participants