Skip to content

[sled agent] Compare dladm results with datalink mgmt cache file #1685

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

smklein
Copy link
Collaborator

@smklein smklein commented Sep 7, 2022

This is highly related to #1679 , but doesn't fully resolve the problem.
It's also related to #1682 , which pokes slightly more at the underlying problem.

Instead, this PR focuses on making that problem much more visible.

Under some circumstances, it's possible for datalink devices to appear in the datalink management cache file, but not in the output of dladm. Regardless of how this happens, it results in a scenario where:

This PR attempts to compare the datalink cache file with sled agent's expected state, monitoring for devices that exist, even if they should have been deleted.

If this situation is encountered, a fairly verbose error is propagated upwards, with instructions on how-to-fix. If we're confident that we can delete datalinks without resulting in this "zombie" state, we can remove this workaround.

@smklein smklein requested a review from bnaecker September 7, 2022 22:10
@smklein smklein added networking Related to the networking. Sled Agent Related to the Per-Sled Configuration and Management labels Sep 7, 2022
@smklein smklein requested a review from leftwo September 7, 2022 22:11
Copy link
Contributor

@leftwo leftwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like more detailed error messages.

<begin soapbox>
For the Unexpected object in dladm cache, this does seem a little specific, and by adding this in, we are hard coding in the sled agent both the path to the dladm cache file and the format of that file. Since this is all our stuff, we can probably get away with such shenanigans, but their may come a time when this comes back to haunt us.
<end soapbox>

Copy link
Collaborator

@bnaecker bnaecker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks. I think the error message could be more pedantic about exactly what the developer needs to do to work around this, but otherwise LGTM.

@bnaecker
Copy link
Collaborator

bnaecker commented Sep 8, 2022

Also, it occurs to me that this really should resolve #1679. We delete all "visible" OPTE ports when the sled agent starts. Anything that's "defunct" will be deleted by this change. In that case, I don't see how a datalink with the relevant naming scheme could already exist when we ask to create it. Does that sound right to you?

@smklein
Copy link
Collaborator Author

smklein commented Sep 8, 2022

Also, it occurs to me that this really should resolve #1679. We delete all "visible" OPTE ports when the sled agent starts. Anything that's "defunct" will be deleted by this change. In that case, I don't see how a datalink with the relevant naming scheme could already exist when we ask to create it. Does that sound right to you?

I think it should resolve the symptoms of #1679 , but I do not think it resolves the cause.

I figured we'd want to end up in a state where devices "either are visible in dladm, and can be tracked by the sled agent, or are fully deleted" -- this "half-deleted state" seems like the crux of the issue to me. Furthermore, if we move to a world where Sled Agent does not try to delete VNICs on reboot, but tries to re-construct a meaningful view of the world, I don't really know how to handle these types of defunct devices.

TL;DR: Symptoms should go away short-term, but I want us to fix the underlying datalink issue before marking #1679 resolved, unless we have appropriate tracking elsewhere.

@smklein smklein enabled auto-merge (squash) September 8, 2022 17:36
@bnaecker
Copy link
Collaborator

bnaecker commented Sep 8, 2022

@smklein @leftwo Robert brought up the same stability concerns that Alan mentioned.

The basic problem is that this PR relies on private and unstable interfaces and formats. I think a reasonable alternative approach is to just skip any name that results in EEXIST from the attempt to create the link. The sled-agent creates names with an incrementing counter, such as opte0, opte1, etc. It currently deletes all interfaces that appear in dladm output now, which is anything that's not defunct. If the sled-agent tries to create opteN, and gets EEXIST, I think we can be pretty confident we're in this situation of a defunct link. We can log a warning/error message for visibility, but I see no reason the sled-agent cannot just then use the name opte{N+1}. There's an issue of the number of times we allow EEXIST before "truly" failing, but I think we can find a reasonable place to draw the line.

I this alternative is acceptable we have two options:

  1. Integrate this, and revert it as soon as possible.
  2. Don't integrate it, and instead take an alternative approach instead.

Of course we need to root-cause why these links exist at all. But in the meantime?

@smklein smklein disabled auto-merge September 8, 2022 17:40
@smklein
Copy link
Collaborator Author

smklein commented Sep 8, 2022

@smklein @leftwo Robert brought up the same stability concerns that Alan mentioned.

The basic problem is that this PR relies on private and unstable interfaces and formats. I think a reasonable alternative approach is to just skip any name that results in EEXIST from the attempt to create the link. The sled-agent creates names with an incrementing counter, such as opte0, opte1, etc. It currently deletes all interfaces that appear in dladm output now, which is anything that's not defunct. If the sled-agent tries to create opteN, and gets EEXIST, I think we can be pretty confident we're in this situation of a defunct link. We can log a warning/error message for visibility, but I see no reason the sled-agent cannot just then use the name opte{N+1}. There's an issue of the number of times we allow EEXIST before "truly" failing, but I think we can find a reasonable place to draw the line.

I this alternative is acceptable we have two options:

  1. Integrate this, and revert it as soon as possible.
  2. Don't integrate it, and instead take an alternative approach instead.

Of course we need to root-cause why these links exist at all. But in the meantime?

Ack; if we're okay punting this to later on in runtime, I can close this PR. The catch-EEXIST-and-retry seems like a viable workaround too.

@smklein smklein closed this Sep 8, 2022
@smklein smklein deleted the debug-dladm-issues branch September 8, 2022 17:45
leftwo pushed a commit that referenced this pull request Mar 26, 2025
Crucible changes are:
Make `ImpactedBlocks` an absolute range (#1685)
update omicron (#1687)
Update Rust crate chrono to v0.4.40 (#1682)
Update Rust crate tempfile to v3.19.0 (#1676)
Log loops during crutest replay (#1674)
Refactor live-repair start and send (#1673)
Update Rust crate libc to v0.2.171 (#1684)
Update Rust crate clap to v4.5.32 (#1683)
Update Rust crate async-trait to 0.1.88 (#1680)
Update Rust crate bytes to v1.10.1 (#1681)
Update Rust crate anyhow to v1.0.97 (#1679)
Update Rust crate http to v1 (#1678)
Update Rust crate tokio-util to v0.7.14 (#1675)
Update Rust crate thiserror to v2 (#1657)
Update Rust crate omicron-zone-package to 0.12.0 (#1647)
Update Rust crate opentelemetry to 0.28.0 (#1543)
Update Rust crate itertools to 0.14.0 (#1646)
Update Rust crate clearscreen to v4 (#1656)
nightly test polish (#1665)
Update Rust crate strum_macros to 0.27 (#1654)
Update Rust crate strum to 0.27 (#1653)
Update Rust crate rusqlite to 0.34 (#1652)
Better return semantics from should_send
Trying to simplify enqueue
Remove unnecessary argument
Remove unused code from `ImpactedBlocks` (#1668)
Log when pantry/volume layers activate things (#1670)
Fix unused import (#1669)
Use `RangeSet` instead of tracking every complete job (#1663)
Update Rust crate dropshot to 0.16.0 (#1645)
Simplify pending work tracking (#1662)
Remove rusqlite dependency from crucible-common (#1659)
Update Rust crate reedline to 0.38.0 (#1651)
Update Rust crate proptest to 1.6.0 (#1648)
Update Rust crate bytes to v1.10.0 (#1644)
Update Rust crate tracing-subscriber to 0.3.19 (#1643)
Update Rust crate tracing to v0.1.41 (#1642)
Update Rust crate toml to v0.8.20 (#1641)
Update Rust crate thiserror to v1.0.69 (#1640)
Update Rust crate serde_json to v1.0.139 (#1639)
Update Rust crate serde to v1.0.218 (#1638)
Update Rust crate reqwest to v0.12.12 (#1637)
Update Rust crate libc to v0.2.170 (#1636)
Update Rust crate indicatif to 0.17.11 (#1635)
Update Rust crate httptest to 0.16.3 (#1634)
Update Rust crate clap to v4.5.31 (#1632)
Update Rust crate chrono to v0.4.39 (#1631)
Update Rust crate byte-unit to 5.1.6 (#1630)
Update Rust crate async-trait to 0.1.86 (#1629)
Update Rust crate csv to 1.3.1 (#1633)
Update Rust crate anyhow to v1.0.96 (#1628)
Fix a test flake (#1626)
Bitpack per-block slot data in memory (#1625)
Use `div_ceil` instead of `(x + 7) / 8` (#1624)
Add repair server dynamometer (#1618)

Propolis changes are:
Paper over minor() differences
Update crucible (#886)
Test oximeter metrics end to end (#855)
server: improve behavior of starting VMs that are waiting for Crucible activation (#873)
server: extend API request queue's memory of prior requests (#880)
Use production propolis-server flags in PHD CI builds (#878)

Added a new field to the Board struct that propolis requires.
leftwo added a commit that referenced this pull request Mar 27, 2025
Crucible changes are:
Make `ImpactedBlocks` an absolute range (#1685)
update omicron (#1687)
Update Rust crate chrono to v0.4.40 (#1682)
Update Rust crate tempfile to v3.19.0 (#1676)
Log loops during crutest replay (#1674)
Refactor live-repair start and send (#1673)
Update Rust crate libc to v0.2.171 (#1684)
Update Rust crate clap to v4.5.32 (#1683)
Update Rust crate async-trait to 0.1.88 (#1680)
Update Rust crate bytes to v1.10.1 (#1681)
Update Rust crate anyhow to v1.0.97 (#1679)
Update Rust crate http to v1 (#1678)
Update Rust crate tokio-util to v0.7.14 (#1675)
Update Rust crate thiserror to v2 (#1657)
Update Rust crate omicron-zone-package to 0.12.0 (#1647) Update Rust
crate opentelemetry to 0.28.0 (#1543)
Update Rust crate itertools to 0.14.0 (#1646)
Update Rust crate clearscreen to v4 (#1656)
nightly test polish (#1665)
Update Rust crate strum_macros to 0.27 (#1654)
Update Rust crate strum to 0.27 (#1653)
Update Rust crate rusqlite to 0.34 (#1652)
Better return semantics from should_send
Trying to simplify enqueue
Remove unnecessary argument
Remove unused code from `ImpactedBlocks` (#1668)
Log when pantry/volume layers activate things (#1670) Fix unused import
(#1669)
Use `RangeSet` instead of tracking every complete job (#1663) Update
Rust crate dropshot to 0.16.0 (#1645)
Simplify pending work tracking (#1662)
Remove rusqlite dependency from crucible-common (#1659) Update Rust
crate reedline to 0.38.0 (#1651)
Update Rust crate proptest to 1.6.0 (#1648)
Update Rust crate bytes to v1.10.0 (#1644)
Update Rust crate tracing-subscriber to 0.3.19 (#1643) Update Rust crate
tracing to v0.1.41 (#1642)
Update Rust crate toml to v0.8.20 (#1641)
Update Rust crate thiserror to v1.0.69 (#1640)
Update Rust crate serde_json to v1.0.139 (#1639)
Update Rust crate serde to v1.0.218 (#1638)
Update Rust crate reqwest to v0.12.12 (#1637)
Update Rust crate libc to v0.2.170 (#1636)
Update Rust crate indicatif to 0.17.11 (#1635)
Update Rust crate httptest to 0.16.3 (#1634)
Update Rust crate clap to v4.5.31 (#1632)
Update Rust crate chrono to v0.4.39 (#1631)
Update Rust crate byte-unit to 5.1.6 (#1630)
Update Rust crate async-trait to 0.1.86 (#1629)
Update Rust crate csv to 1.3.1 (#1633)
Update Rust crate anyhow to v1.0.96 (#1628)
Fix a test flake (#1626)
Bitpack per-block slot data in memory (#1625)
Use `div_ceil` instead of `(x + 7) / 8` (#1624)
Add repair server dynamometer (#1618)

Propolis changes are:
Paper over minor() differences
Update crucible (#886)
Test oximeter metrics end to end (#855)
server: improve behavior of starting VMs that are waiting for Crucible
activation (#873) server: extend API request queue's memory of prior
requests (#880) Use production propolis-server flags in PHD CI builds
(#878)

---------

Co-authored-by: Alan Hanson <alan@oxide.computer>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
networking Related to the networking. Sled Agent Related to the Per-Sled Configuration and Management
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants