Skip to content

Refine update delivery messages from MGS to SP #1684

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Sep 8, 2022
Merged

Conversation

jgallagher
Copy link
Contributor

The largest change is that UpdateStart has been broken into
UpdatePrepare followed by a series of UpdatePrepareStatus messages
(until the SP responds that update preparation is done). This allows for
slow update prep process, such as erasing the host boot flash (which can
take multiple minutes).

Minor changes include:

  • A new "update abort" message to abandon an in-progress (presumably
    from a failed MGS) update
  • Update messages now include a random stream-id to allow the SP to
    better correlate them and reject unrelated update packets

In total, these are the changes necessary to support delivering host boot flash updates via faux-mgs. Delivering via MGS proper will require a small subsequent PR to add or modify an API (see the new TODO in http_entrypoints.rs).

The largest change is that `UpdateStart` has been broken into
`UpdatePrepare` followed by a series of `UpdatePrepareStatus` messages
(until the SP responds that update preparation is done). This allows for
slow update prep process, such as erasing the host boot flash (which can
take multiple minutes).

Minor changes include:

* A new "update abort" message to abandon an in-progress (presumably
  from a failed MGS) update
* Update messages now include a random `stream-id` to allow the SP to
  better correlate them and reject unrelated update packets
}

// We can't break out of the for loop without retrying usize::MAX times,
// which isn't phsyically possible (but the compiler doesn't know that).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stupid compiler!

@@ -240,6 +327,18 @@ impl SingleSp {
Ok(data)
}

/// Abort an in-progress update.
pub async fn update_abort(&self, component: SpComponent) -> Result<()> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A random thought here. It's possible an abort gets delayed in the network and comes in after a new stream starts. I don't think this is a major problem, but it may be worth considering in a follow up if we want to request the current stream-id from the aborting MGS, and then cancel using that so if we start a new one we don't have this issue.

Copy link
Contributor

@andrewjstone andrewjstone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@jgallagher jgallagher merged commit e6eccff into main Sep 8, 2022
@jgallagher jgallagher deleted the mgs-host-boot-update branch September 8, 2022 19:27
jgallagher added a commit to oxidecomputer/hubris that referenced this pull request Sep 13, 2022
This tracks the corresponding changes to MGS messaging merged in
oxidecomputer/omicron#1684:

1. `UpdateStart` has been broken into `UpdatePrepare` and
   `UpdatePrepareStatus` (which MGS will continue to send periodically
   until we respond that preparation is done), allowing for updates that
   have a potentially-long running prep step (like updating host flash,
   which can take up to several minutes to erase!).
2. Update messages now include a stream-id that we use to correlate
   related messages; we reject update messages that don't match our
   current stream ID.
3. Add handling for the new `UpdateAbort` abort message to cancel an
   in-progress update.

By far the most complex bit of this is 1: I've moved setting the system
timer out of `mgs_gimlet` (which previously set it only in relation to
flushing serial console uart packets out to MGS) and into `main`: now
the MGS handler only returns the deadline it wants `main` to set. If
we're in the process of prepping for a host flash update (i.e., we need
to erase the host flash), we'll set our deadline to 1 tick from now.
When it fires, we'll erase 8 sectors (takes about 1 second, worst case),
then return and allow `main` to check for other work. This allows us to
continue to be responsive to incoming notifications, importantly network
requests (allowing us to respond to the `UpdatePrepareStatus` messages
in a timely way!).
jgallagher added a commit to oxidecomputer/hubris that referenced this pull request Sep 14, 2022
This tracks the corresponding changes to MGS messaging merged in
oxidecomputer/omicron#1684:

1. `UpdateStart` has been broken into `UpdatePrepare` and
   `UpdatePrepareStatus` (which MGS will continue to send periodically
   until we respond that preparation is done), allowing for updates that
   have a potentially-long running prep step (like updating host flash,
   which can take up to several minutes to erase!).
2. Update messages now include a stream-id that we use to correlate
   related messages; we reject update messages that don't match our
   current stream ID.
3. Add handling for the new `UpdateAbort` abort message to cancel an
   in-progress update.

By far the most complex bit of this is 1: I've moved setting the system
timer out of `mgs_gimlet` (which previously set it only in relation to
flushing serial console uart packets out to MGS) and into `main`: now
the MGS handler only returns the deadline it wants `main` to set. If
we're in the process of prepping for a host flash update (i.e., we need
to erase the host flash), we'll set our deadline to 1 tick from now.
When it fires, we'll erase 8 sectors (takes about 1 second, worst case),
then return and allow `main` to check for other work. This allows us to
continue to be responsive to incoming notifications, importantly network
requests (allowing us to respond to the `UpdatePrepareStatus` messages
in a timely way!).
jgallagher added a commit to oxidecomputer/hubris that referenced this pull request Sep 14, 2022
* mgmt-gateway: update host flash

This tracks the corresponding changes to MGS messaging merged in
oxidecomputer/omicron#1684:

1. `UpdateStart` has been broken into `UpdatePrepare` and
   `UpdatePrepareStatus` (which MGS will continue to send periodically
   until we respond that preparation is done), allowing for updates that
   have a potentially-long running prep step (like updating host flash,
   which can take up to several minutes to erase!).
2. Update messages now include a stream-id that we use to correlate
   related messages; we reject update messages that don't match our
   current stream ID.
3. Add handling for the new `UpdateAbort` abort message to cancel an
   in-progress update.

By far the most complex bit of this is 1: I've moved setting the system
timer out of `mgs_gimlet` (which previously set it only in relation to
flushing serial console uart packets out to MGS) and into `main`: now
the MGS handler only returns the deadline it wants `main` to set. If
we're in the process of prepping for a host flash update (i.e., we need
to erase the host flash), we'll set our deadline to 1 tick from now.
When it fires, we'll erase 8 sectors (takes about 1 second, worst case),
then return and allow `main` to check for other work. This allows us to
continue to be responsive to incoming notifications, importantly network
requests (allowing us to respond to the `UpdatePrepareStatus` messages
in a timely way!).
FawazTirmizi pushed a commit to rivosinc/hubris that referenced this pull request Oct 3, 2022
* mgmt-gateway: update host flash

This tracks the corresponding changes to MGS messaging merged in
oxidecomputer/omicron#1684:

1. `UpdateStart` has been broken into `UpdatePrepare` and
   `UpdatePrepareStatus` (which MGS will continue to send periodically
   until we respond that preparation is done), allowing for updates that
   have a potentially-long running prep step (like updating host flash,
   which can take up to several minutes to erase!).
2. Update messages now include a stream-id that we use to correlate
   related messages; we reject update messages that don't match our
   current stream ID.
3. Add handling for the new `UpdateAbort` abort message to cancel an
   in-progress update.

By far the most complex bit of this is 1: I've moved setting the system
timer out of `mgs_gimlet` (which previously set it only in relation to
flushing serial console uart packets out to MGS) and into `main`: now
the MGS handler only returns the deadline it wants `main` to set. If
we're in the process of prepping for a host flash update (i.e., we need
to erase the host flash), we'll set our deadline to 1 tick from now.
When it fires, we'll erase 8 sectors (takes about 1 second, worst case),
then return and allow `main` to check for other work. This allows us to
continue to be responsive to incoming notifications, importantly network
requests (allowing us to respond to the `UpdatePrepareStatus` messages
in a timely way!).
FawazTirmizi pushed a commit to rivosinc/hubris that referenced this pull request Oct 3, 2022
* mgmt-gateway: update host flash

This tracks the corresponding changes to MGS messaging merged in
oxidecomputer/omicron#1684:

1. `UpdateStart` has been broken into `UpdatePrepare` and
   `UpdatePrepareStatus` (which MGS will continue to send periodically
   until we respond that preparation is done), allowing for updates that
   have a potentially-long running prep step (like updating host flash,
   which can take up to several minutes to erase!).
2. Update messages now include a stream-id that we use to correlate
   related messages; we reject update messages that don't match our
   current stream ID.
3. Add handling for the new `UpdateAbort` abort message to cancel an
   in-progress update.

By far the most complex bit of this is 1: I've moved setting the system
timer out of `mgs_gimlet` (which previously set it only in relation to
flushing serial console uart packets out to MGS) and into `main`: now
the MGS handler only returns the deadline it wants `main` to set. If
we're in the process of prepping for a host flash update (i.e., we need
to erase the host flash), we'll set our deadline to 1 tick from now.
When it fires, we'll erase 8 sectors (takes about 1 second, worst case),
then return and allow `main` to check for other work. This allows us to
continue to be responsive to incoming notifications, importantly network
requests (allowing us to respond to the `UpdatePrepareStatus` messages
in a timely way!).
leftwo pushed a commit that referenced this pull request Mar 26, 2025
Crucible changes are:
Make `ImpactedBlocks` an absolute range (#1685)
update omicron (#1687)
Update Rust crate chrono to v0.4.40 (#1682)
Update Rust crate tempfile to v3.19.0 (#1676)
Log loops during crutest replay (#1674)
Refactor live-repair start and send (#1673)
Update Rust crate libc to v0.2.171 (#1684)
Update Rust crate clap to v4.5.32 (#1683)
Update Rust crate async-trait to 0.1.88 (#1680)
Update Rust crate bytes to v1.10.1 (#1681)
Update Rust crate anyhow to v1.0.97 (#1679)
Update Rust crate http to v1 (#1678)
Update Rust crate tokio-util to v0.7.14 (#1675)
Update Rust crate thiserror to v2 (#1657)
Update Rust crate omicron-zone-package to 0.12.0 (#1647)
Update Rust crate opentelemetry to 0.28.0 (#1543)
Update Rust crate itertools to 0.14.0 (#1646)
Update Rust crate clearscreen to v4 (#1656)
nightly test polish (#1665)
Update Rust crate strum_macros to 0.27 (#1654)
Update Rust crate strum to 0.27 (#1653)
Update Rust crate rusqlite to 0.34 (#1652)
Better return semantics from should_send
Trying to simplify enqueue
Remove unnecessary argument
Remove unused code from `ImpactedBlocks` (#1668)
Log when pantry/volume layers activate things (#1670)
Fix unused import (#1669)
Use `RangeSet` instead of tracking every complete job (#1663)
Update Rust crate dropshot to 0.16.0 (#1645)
Simplify pending work tracking (#1662)
Remove rusqlite dependency from crucible-common (#1659)
Update Rust crate reedline to 0.38.0 (#1651)
Update Rust crate proptest to 1.6.0 (#1648)
Update Rust crate bytes to v1.10.0 (#1644)
Update Rust crate tracing-subscriber to 0.3.19 (#1643)
Update Rust crate tracing to v0.1.41 (#1642)
Update Rust crate toml to v0.8.20 (#1641)
Update Rust crate thiserror to v1.0.69 (#1640)
Update Rust crate serde_json to v1.0.139 (#1639)
Update Rust crate serde to v1.0.218 (#1638)
Update Rust crate reqwest to v0.12.12 (#1637)
Update Rust crate libc to v0.2.170 (#1636)
Update Rust crate indicatif to 0.17.11 (#1635)
Update Rust crate httptest to 0.16.3 (#1634)
Update Rust crate clap to v4.5.31 (#1632)
Update Rust crate chrono to v0.4.39 (#1631)
Update Rust crate byte-unit to 5.1.6 (#1630)
Update Rust crate async-trait to 0.1.86 (#1629)
Update Rust crate csv to 1.3.1 (#1633)
Update Rust crate anyhow to v1.0.96 (#1628)
Fix a test flake (#1626)
Bitpack per-block slot data in memory (#1625)
Use `div_ceil` instead of `(x + 7) / 8` (#1624)
Add repair server dynamometer (#1618)

Propolis changes are:
Paper over minor() differences
Update crucible (#886)
Test oximeter metrics end to end (#855)
server: improve behavior of starting VMs that are waiting for Crucible activation (#873)
server: extend API request queue's memory of prior requests (#880)
Use production propolis-server flags in PHD CI builds (#878)

Added a new field to the Board struct that propolis requires.
leftwo added a commit that referenced this pull request Mar 27, 2025
Crucible changes are:
Make `ImpactedBlocks` an absolute range (#1685)
update omicron (#1687)
Update Rust crate chrono to v0.4.40 (#1682)
Update Rust crate tempfile to v3.19.0 (#1676)
Log loops during crutest replay (#1674)
Refactor live-repair start and send (#1673)
Update Rust crate libc to v0.2.171 (#1684)
Update Rust crate clap to v4.5.32 (#1683)
Update Rust crate async-trait to 0.1.88 (#1680)
Update Rust crate bytes to v1.10.1 (#1681)
Update Rust crate anyhow to v1.0.97 (#1679)
Update Rust crate http to v1 (#1678)
Update Rust crate tokio-util to v0.7.14 (#1675)
Update Rust crate thiserror to v2 (#1657)
Update Rust crate omicron-zone-package to 0.12.0 (#1647) Update Rust
crate opentelemetry to 0.28.0 (#1543)
Update Rust crate itertools to 0.14.0 (#1646)
Update Rust crate clearscreen to v4 (#1656)
nightly test polish (#1665)
Update Rust crate strum_macros to 0.27 (#1654)
Update Rust crate strum to 0.27 (#1653)
Update Rust crate rusqlite to 0.34 (#1652)
Better return semantics from should_send
Trying to simplify enqueue
Remove unnecessary argument
Remove unused code from `ImpactedBlocks` (#1668)
Log when pantry/volume layers activate things (#1670) Fix unused import
(#1669)
Use `RangeSet` instead of tracking every complete job (#1663) Update
Rust crate dropshot to 0.16.0 (#1645)
Simplify pending work tracking (#1662)
Remove rusqlite dependency from crucible-common (#1659) Update Rust
crate reedline to 0.38.0 (#1651)
Update Rust crate proptest to 1.6.0 (#1648)
Update Rust crate bytes to v1.10.0 (#1644)
Update Rust crate tracing-subscriber to 0.3.19 (#1643) Update Rust crate
tracing to v0.1.41 (#1642)
Update Rust crate toml to v0.8.20 (#1641)
Update Rust crate thiserror to v1.0.69 (#1640)
Update Rust crate serde_json to v1.0.139 (#1639)
Update Rust crate serde to v1.0.218 (#1638)
Update Rust crate reqwest to v0.12.12 (#1637)
Update Rust crate libc to v0.2.170 (#1636)
Update Rust crate indicatif to 0.17.11 (#1635)
Update Rust crate httptest to 0.16.3 (#1634)
Update Rust crate clap to v4.5.31 (#1632)
Update Rust crate chrono to v0.4.39 (#1631)
Update Rust crate byte-unit to 5.1.6 (#1630)
Update Rust crate async-trait to 0.1.86 (#1629)
Update Rust crate csv to 1.3.1 (#1633)
Update Rust crate anyhow to v1.0.96 (#1628)
Fix a test flake (#1626)
Bitpack per-block slot data in memory (#1625)
Use `div_ceil` instead of `(x + 7) / 8` (#1624)
Add repair server dynamometer (#1618)

Propolis changes are:
Paper over minor() differences
Update crucible (#886)
Test oximeter metrics end to end (#855)
server: improve behavior of starting VMs that are waiting for Crucible
activation (#873) server: extend API request queue's memory of prior
requests (#880) Use production propolis-server flags in PHD CI builds
(#878)

---------

Co-authored-by: Alan Hanson <alan@oxide.computer>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants