Skip to content

Conversation

@gurasinghMS
Copy link
Contributor

@gurasinghMS gurasinghMS commented Jan 16, 2026

This PR fixes a couple of problems with the driver.

1. Namespace persisting and being reused after namespace removal from the controller

Figjamexport2.
It addresses the situation where a namespace is removed from the controller (and therefore from the disk exposed via VTL0), but the driver continues to hold a reference to that namespace. In this state, the driver would return a stale or invalid Namespace object—one that can no longer perform valid I/O.

The approach in this PR is to avoid keeping a strong Arc in the driver. Instead, the driver stores only a Weak for any namespace it hands out. When the disk is dropped, the strong reference disappears, causing the driver’s weak reference to naturally invalidate via normal Arc/Weak semantics. This eliminates the dangling reference issue without requiring additional bookkeeping.

As discussed during office hours, the window for failure with this approach is expected to be extremely narrow, as the driver relies on strict single-ownership of the Namespace object. To further enforce this single-ownership model, this PR introduces a non-cloneable wrapper around Arc. This wrapper exposes the same interface as Namespace and is consumed by the disk, but it prevents cloning of the underlying Arc anywhere along the creation path. This ensures the driver remains the sole authority over Namespace ownership and lifecycle.

2. Namespace being reused after disk removal from VTL2 settings.

image

This is the issue being described in #2656 . This PR brings back the check to verify single-ownership of the Namespace objects. This time around it is possible to verify using the above described Arc/Weak semantics.

@gurasinghMS gurasinghMS changed the title [wip] handle dangling references when controller namespaces are removed (Part 2) [wip] handle dangling references when controller namespaces are removed Jan 16, 2026
@alandau
Copy link
Contributor

alandau commented Jan 16, 2026

In general, Arc (shared ownership) is more brittle than unique ownership, and here we're going a step further of sometimes storing an Arc and sometimes a Weak, which is even more brittle. That said, this looks like an elegant way out, if you can prove (via tests and/or code review) that it covers all situations.

@gurasinghMS
Copy link
Contributor Author

In general, Arc (shared ownership) is more brittle than unique ownership, and here we're going a step further of sometimes storing an Arc and sometimes a Weak, which is even more brittle. That said, this looks like an elegant way out, if you can prove (via tests and/or code review) that it covers all situations.

Like you, I also have some concerns around the brittleness of this approach. Although in the classical sense of memory safety and availability I think rust should have us covered!

return Ok(handle.namespace.clone());
}
pub async fn namespace(&mut self, nsid: u32) -> Result<NamespaceHandle, NamespaceError> {
if let Some(namespace) = self.namespaces.get_mut(&nsid) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if there is a more idiomatic way to do this ..... it looks a little off to me.

@gurasinghMS gurasinghMS changed the title [wip] handle dangling references when controller namespaces are removed [nvme_driver] update driver to use Arc / Weak semantics to enforce single-owner semantics for namespaces Jan 20, 2026
@gurasinghMS gurasinghMS changed the title [nvme_driver] update driver to use Arc / Weak semantics to enforce single-owner semantics for namespaces [nvme_driver] update driver to use Arc / Weak semantics to enforce unique ownership for namespaces Jan 20, 2026
@gurasinghMS gurasinghMS changed the title [nvme_driver] update driver to use Arc / Weak semantics to enforce unique ownership for namespaces [nvme_driver] update driver to use Arc / Weak semantics to enforce unique ownership for namespaces by nsid Jan 20, 2026
@gurasinghMS gurasinghMS marked this pull request as ready for review January 20, 2026 23:26
@gurasinghMS gurasinghMS requested review from a team as code owners January 20, 2026 23:26
Copilot AI review requested due to automatic review settings January 20, 2026 23:26
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors the NVMe driver to use Arc/Weak semantics to address two critical issues: (1) namespace persistence after removal from the controller, and (2) namespace reuse after disk removal from VTL2 settings. The solution introduces a non-cloneable NamespaceHandle wrapper around Arc<Namespace> to enforce single-ownership, while the driver stores only Weak references internally.

Changes:

  • Introduced NamespaceHandle wrapper to prevent cloning and enforce single ownership of namespaces
  • Implemented WeakOrStrong enum to manage namespace lifecycle during normal operation and save/restore
  • Updated all callsites to use NamespaceHandle instead of Arc<Namespace>

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
vm/devices/storage/disk_nvme/nvme_driver/src/namespace.rs Added NamespaceHandle wrapper with delegated methods; made check_active() public; renamed error variant
vm/devices/storage/disk_nvme/nvme_driver/src/driver.rs Implemented WeakOrStrong enum; updated namespace() method with duplicate detection; modified save() logic
vm/devices/storage/disk_nvme/nvme_driver/src/lib.rs Updated exports to use NamespaceHandle instead of Namespace
vm/devices/storage/disk_nvme/src/lib.rs Updated NvmeDisk to use NamespaceHandle
openhcl/underhill_core/src/nvme_manager/*.rs Updated all interfaces to use NamespaceHandle

Comment on lines 677 to 680
let is_weak = namespace.is_weak(); // This value will change after invokeing get_arc().
if let Some(ns) = namespace.get_arc()
&& ns.check_active().is_ok()
&& is_weak
Copy link

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The condition is_weak in the save logic appears incorrect. When saving right after a restore (before any namespace() calls), all namespaces will be Strong, causing is_weak to be false for all of them. This means none of the restored namespaces will be saved, which breaks save-restore cycles.

The logic should save any namespace that can be successfully obtained via get_arc() and passes the check_active() test, regardless of whether it started as Weak or Strong. Consider removing the && is_weak condition or inverting it to && !is_weak if the intent is to only save Strong references (pre-handout namespaces).

Suggested change
let is_weak = namespace.is_weak(); // This value will change after invokeing get_arc().
if let Some(ns) = namespace.get_arc()
&& ns.check_active().is_ok()
&& is_weak
if let Some(ns) = namespace.get_arc()
&& ns.check_active().is_ok()

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

@gurasinghMS gurasinghMS Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point so I am keeping this comment open. The idea behind NOT saving strong ref namespaces is that they were never requested for by a driver and hence there is no reasonable expectation that they should be available post restore.
Current code cuts down the size of saved data. As copilot pointed out, repeated back-to-back servicing where the namespace has not yet been requested could cause missed saves. However, I don't think we have a reasonable expectation for hammering a system with servicing operations .

@gurasinghMS gurasinghMS force-pushed the handle-controller-namespaces-leaving branch from e903762 to 0af769f Compare January 21, 2026 00:17
@gurasinghMS gurasinghMS changed the title [nvme_driver] update driver to use Arc / Weak semantics to enforce unique ownership for namespaces by nsid nvme_driver: update driver to use Arc / Weak semantics to enforce unique ownership for namespaces by nsid Jan 21, 2026
@github-actions
Copy link

@github-actions
Copy link

@github-actions
Copy link

@github-actions
Copy link

@github-actions
Copy link

Copy link
Contributor

@alandau alandau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yay! All tests passed! Perhaps do a dummy change (or trigger another test run without a change if there's a way to do it) to make sure it's not flaky and won't hunt us down the road.

@mattkur
Copy link
Contributor

mattkur commented Jan 25, 2026

Yay! All tests passed! Perhaps do a dummy change (or trigger another test run without a change if there's a way to do it) to make sure it's not flaky and won't hunt us down the road.

I triggered a new test run by merging in main. Once you're comfy with this, @alandau, I'll approve and this can go in. Thanks for taking on the review of this change :)

@alandau
Copy link
Contributor

alandau commented Jan 26, 2026

@mattkur, it was consistently failing on every run, and now it passed on 2 runs in a row, and the branch is out of date again, so another run will be needed before merging. I think this is good to go, unless @gurasinghMS has reservations.

@alandau
Copy link
Contributor

alandau commented Jan 26, 2026

openvmm_openhcl_linux_x64_many_nvme_devices_servicing_very_heavy failed - not sure if it's related. The hot-add-remove test passed.

@github-actions
Copy link

@mattkur mattkur merged commit 34cc2bb into microsoft:main Jan 26, 2026
80 of 82 checks passed
gurasinghMS added a commit to gurasinghMS/openvmm that referenced this pull request Jan 26, 2026
… unique ownership for namespaces by nsid (microsoft#2657)

This PR fixes a couple of problems with the driver.
## 1. `Namespace` persisting and being reused after namespace removal
from the controller

![Figjamexport2](https://github.com/user-attachments/assets/3a098676-9c4d-428d-981c-6880ddaa7262).
It addresses the situation where a namespace is removed from the
controller (and therefore from the disk exposed via VTL0), but the
driver continues to hold a reference to that namespace. In this state,
the driver would return a stale or invalid Namespace object—one that can
no longer perform valid I/O.

The approach in this PR is to avoid keeping a strong Arc<Namespace> in
the driver. Instead, the driver stores only a Weak<Namespace> for any
namespace it hands out. When the disk is dropped, the strong reference
disappears, causing the driver’s weak reference to naturally invalidate
via normal Arc/Weak semantics. This eliminates the dangling reference
issue without requiring additional bookkeeping.

As discussed during office hours, the window for failure with this
approach is expected to be extremely narrow, as the driver relies on
strict single-ownership of the Namespace object. To further enforce this
single-ownership model, this PR introduces a non-cloneable wrapper
around Arc<Namespace>. This wrapper exposes the same interface as
Namespace and is consumed by the disk, but it prevents cloning of the
underlying Arc<Namespace> anywhere along the creation path. This ensures
the driver remains the sole authority over Namespace ownership and
lifecycle.

## 2. `Namespace` being reused after disk removal from VTL2 settings.
<img width="7096" height="2979" alt="image"
src="https://github.com/user-attachments/assets/f519db98-447c-4a95-a7a2-c70536e07ae0"
/>

This is the issue being described in microsoft#2656 . This PR brings back the
check to verify single-ownership of the `Namespace` objects. This time
around it is possible to verify using the above described `Arc`/`Weak`
semantics.

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Matt LaFayette (Kurjanowicz) <mattkur@microsoft.com>
Co-authored-by: Alex Landau <alexlandau@microsoft.com>
(cherry picked from commit 34cc2bb)
mattkur added a commit that referenced this pull request Jan 28, 2026
… unique ownership for namespaces by nsid (#2657) (#2697)

Clean cherry pick of PR #2657

This PR fixes a couple of problems with the driver. 
## 1. `Namespace` persisting and being reused after namespace removal
from the controller

![Figjamexport2](https://github.com/user-attachments/assets/3a098676-9c4d-428d-981c-6880ddaa7262).
It addresses the situation where a namespace is removed from the
controller (and therefore from the disk exposed via VTL0), but the
driver continues to hold a reference to that namespace. In this state,
the driver would return a stale or invalid Namespace object—one that can
no longer perform valid I/O.

The approach in this PR is to avoid keeping a strong Arc<Namespace> in
the driver. Instead, the driver stores only a Weak<Namespace> for any
namespace it hands out. When the disk is dropped, the strong reference
disappears, causing the driver’s weak reference to naturally invalidate
via normal Arc/Weak semantics. This eliminates the dangling reference
issue without requiring additional bookkeeping.

As discussed during office hours, the window for failure with this
approach is expected to be extremely narrow, as the driver relies on
strict single-ownership of the Namespace object. To further enforce this
single-ownership model, this PR introduces a non-cloneable wrapper
around Arc<Namespace>. This wrapper exposes the same interface as
Namespace and is consumed by the disk, but it prevents cloning of the
underlying Arc<Namespace> anywhere along the creation path. This ensures
the driver remains the sole authority over Namespace ownership and
lifecycle.

## 2. `Namespace` being reused after disk removal from VTL2 settings.
<img width="7096" height="2979" alt="image"
src="https://github.com/user-attachments/assets/f519db98-447c-4a95-a7a2-c70536e07ae0"
/>

This is the issue being described in #2656 . This PR brings back the
check to verify single-ownership of the `Namespace` objects. This time
around it is possible to verify using the above described `Arc`/`Weak`
semantics.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Matt LaFayette (Kurjanowicz) <mattkur@microsoft.com>
Co-authored-by: Alex Landau <alexlandau@microsoft.com>
@mattkur
Copy link
Contributor

mattkur commented Jan 28, 2026

Backported to release/1.7.2511 in #2697

@mattkur mattkur added backported_1.7.2511 PR that has been backported to release/1.7.2511 and removed backport_1.7.2511 labels Jan 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backported_1.7.2511 PR that has been backported to release/1.7.2511

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants