-
Notifications
You must be signed in to change notification settings - Fork 45
[sled agent] Compare dladm results with datalink mgmt cache file #1685
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like more detailed error messages.
<begin soapbox>
For the Unexpected object in dladm cache
, this does seem a little specific, and by adding this in, we are hard coding in the sled agent both the path to the dladm cache file and the format of that file. Since this is all our stuff, we can probably get away with such shenanigans, but their may come a time when this comes back to haunt us.
<end soapbox>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thanks. I think the error message could be more pedantic about exactly what the developer needs to do to work around this, but otherwise LGTM.
Also, it occurs to me that this really should resolve #1679. We delete all "visible" OPTE ports when the sled agent starts. Anything that's "defunct" will be deleted by this change. In that case, I don't see how a datalink with the relevant naming scheme could already exist when we ask to create it. Does that sound right to you? |
I think it should resolve the symptoms of #1679 , but I do not think it resolves the cause. I figured we'd want to end up in a state where devices "either are visible in dladm, and can be tracked by the sled agent, or are fully deleted" -- this "half-deleted state" seems like the crux of the issue to me. Furthermore, if we move to a world where Sled Agent does not try to delete VNICs on reboot, but tries to re-construct a meaningful view of the world, I don't really know how to handle these types of defunct devices. TL;DR: Symptoms should go away short-term, but I want us to fix the underlying datalink issue before marking #1679 resolved, unless we have appropriate tracking elsewhere. |
@smklein @leftwo Robert brought up the same stability concerns that Alan mentioned. The basic problem is that this PR relies on private and unstable interfaces and formats. I think a reasonable alternative approach is to just skip any name that results in I this alternative is acceptable we have two options:
Of course we need to root-cause why these links exist at all. But in the meantime? |
Ack; if we're okay punting this to later on in runtime, I can close this PR. The catch-EEXIST-and-retry seems like a viable workaround too. |
Crucible changes are: Make `ImpactedBlocks` an absolute range (#1685) update omicron (#1687) Update Rust crate chrono to v0.4.40 (#1682) Update Rust crate tempfile to v3.19.0 (#1676) Log loops during crutest replay (#1674) Refactor live-repair start and send (#1673) Update Rust crate libc to v0.2.171 (#1684) Update Rust crate clap to v4.5.32 (#1683) Update Rust crate async-trait to 0.1.88 (#1680) Update Rust crate bytes to v1.10.1 (#1681) Update Rust crate anyhow to v1.0.97 (#1679) Update Rust crate http to v1 (#1678) Update Rust crate tokio-util to v0.7.14 (#1675) Update Rust crate thiserror to v2 (#1657) Update Rust crate omicron-zone-package to 0.12.0 (#1647) Update Rust crate opentelemetry to 0.28.0 (#1543) Update Rust crate itertools to 0.14.0 (#1646) Update Rust crate clearscreen to v4 (#1656) nightly test polish (#1665) Update Rust crate strum_macros to 0.27 (#1654) Update Rust crate strum to 0.27 (#1653) Update Rust crate rusqlite to 0.34 (#1652) Better return semantics from should_send Trying to simplify enqueue Remove unnecessary argument Remove unused code from `ImpactedBlocks` (#1668) Log when pantry/volume layers activate things (#1670) Fix unused import (#1669) Use `RangeSet` instead of tracking every complete job (#1663) Update Rust crate dropshot to 0.16.0 (#1645) Simplify pending work tracking (#1662) Remove rusqlite dependency from crucible-common (#1659) Update Rust crate reedline to 0.38.0 (#1651) Update Rust crate proptest to 1.6.0 (#1648) Update Rust crate bytes to v1.10.0 (#1644) Update Rust crate tracing-subscriber to 0.3.19 (#1643) Update Rust crate tracing to v0.1.41 (#1642) Update Rust crate toml to v0.8.20 (#1641) Update Rust crate thiserror to v1.0.69 (#1640) Update Rust crate serde_json to v1.0.139 (#1639) Update Rust crate serde to v1.0.218 (#1638) Update Rust crate reqwest to v0.12.12 (#1637) Update Rust crate libc to v0.2.170 (#1636) Update Rust crate indicatif to 0.17.11 (#1635) Update Rust crate httptest to 0.16.3 (#1634) Update Rust crate clap to v4.5.31 (#1632) Update Rust crate chrono to v0.4.39 (#1631) Update Rust crate byte-unit to 5.1.6 (#1630) Update Rust crate async-trait to 0.1.86 (#1629) Update Rust crate csv to 1.3.1 (#1633) Update Rust crate anyhow to v1.0.96 (#1628) Fix a test flake (#1626) Bitpack per-block slot data in memory (#1625) Use `div_ceil` instead of `(x + 7) / 8` (#1624) Add repair server dynamometer (#1618) Propolis changes are: Paper over minor() differences Update crucible (#886) Test oximeter metrics end to end (#855) server: improve behavior of starting VMs that are waiting for Crucible activation (#873) server: extend API request queue's memory of prior requests (#880) Use production propolis-server flags in PHD CI builds (#878) Added a new field to the Board struct that propolis requires.
Crucible changes are: Make `ImpactedBlocks` an absolute range (#1685) update omicron (#1687) Update Rust crate chrono to v0.4.40 (#1682) Update Rust crate tempfile to v3.19.0 (#1676) Log loops during crutest replay (#1674) Refactor live-repair start and send (#1673) Update Rust crate libc to v0.2.171 (#1684) Update Rust crate clap to v4.5.32 (#1683) Update Rust crate async-trait to 0.1.88 (#1680) Update Rust crate bytes to v1.10.1 (#1681) Update Rust crate anyhow to v1.0.97 (#1679) Update Rust crate http to v1 (#1678) Update Rust crate tokio-util to v0.7.14 (#1675) Update Rust crate thiserror to v2 (#1657) Update Rust crate omicron-zone-package to 0.12.0 (#1647) Update Rust crate opentelemetry to 0.28.0 (#1543) Update Rust crate itertools to 0.14.0 (#1646) Update Rust crate clearscreen to v4 (#1656) nightly test polish (#1665) Update Rust crate strum_macros to 0.27 (#1654) Update Rust crate strum to 0.27 (#1653) Update Rust crate rusqlite to 0.34 (#1652) Better return semantics from should_send Trying to simplify enqueue Remove unnecessary argument Remove unused code from `ImpactedBlocks` (#1668) Log when pantry/volume layers activate things (#1670) Fix unused import (#1669) Use `RangeSet` instead of tracking every complete job (#1663) Update Rust crate dropshot to 0.16.0 (#1645) Simplify pending work tracking (#1662) Remove rusqlite dependency from crucible-common (#1659) Update Rust crate reedline to 0.38.0 (#1651) Update Rust crate proptest to 1.6.0 (#1648) Update Rust crate bytes to v1.10.0 (#1644) Update Rust crate tracing-subscriber to 0.3.19 (#1643) Update Rust crate tracing to v0.1.41 (#1642) Update Rust crate toml to v0.8.20 (#1641) Update Rust crate thiserror to v1.0.69 (#1640) Update Rust crate serde_json to v1.0.139 (#1639) Update Rust crate serde to v1.0.218 (#1638) Update Rust crate reqwest to v0.12.12 (#1637) Update Rust crate libc to v0.2.170 (#1636) Update Rust crate indicatif to 0.17.11 (#1635) Update Rust crate httptest to 0.16.3 (#1634) Update Rust crate clap to v4.5.31 (#1632) Update Rust crate chrono to v0.4.39 (#1631) Update Rust crate byte-unit to 5.1.6 (#1630) Update Rust crate async-trait to 0.1.86 (#1629) Update Rust crate csv to 1.3.1 (#1633) Update Rust crate anyhow to v1.0.96 (#1628) Fix a test flake (#1626) Bitpack per-block slot data in memory (#1625) Use `div_ceil` instead of `(x + 7) / 8` (#1624) Add repair server dynamometer (#1618) Propolis changes are: Paper over minor() differences Update crucible (#886) Test oximeter metrics end to end (#855) server: improve behavior of starting VMs that are waiting for Crucible activation (#873) server: extend API request queue's memory of prior requests (#880) Use production propolis-server flags in PHD CI builds (#878) --------- Co-authored-by: Alan Hanson <alan@oxide.computer>
This is highly related to #1679 , but doesn't fully resolve the problem.
It's also related to #1682 , which pokes slightly more at the underlying problem.
Instead, this PR focuses on making that problem much more visible.
Under some circumstances, it's possible for datalink devices to appear in the datalink management cache file, but not in the output of dladm. Regardless of how this happens, it results in a scenario where:
This PR attempts to compare the datalink cache file with sled agent's expected state, monitoring for devices that exist, even if they should have been deleted.
If this situation is encountered, a fairly verbose error is propagated upwards, with instructions on how-to-fix. If we're confident that we can delete datalinks without resulting in this "zombie" state, we can remove this workaround.