-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[PCIe] Community Guidelines and Roadmap #4894
base: poc/pcie
Are you sure you want to change the base?
[PCIe] Community Guidelines and Roadmap #4894
Conversation
6605c89
to
6b412fe
Compare
|
||
* **overheads:** supporting the full PCI specification might negatively impact the boot time and | ||
memory overheads of Firecracker VMs. | ||
* We can mitigate this by allowing for completely disabling PCIe support via VM configuration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this be done on-the fly on a per-vm basis, our would it require using a specific build of firecracker?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we were thinking a built time flag. In this way it is a conscious decision to enable PCIe, once is enabled Firecracker will make use of it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea is that it would be a per-VM, per-device configuration, without requiring a special build. On VMs not using any PCI devices, the PCI bus will not be emulated so the VM should not change compared to how it is today.
Ideally, we would wish to have no regression coming from PCI so we could drop MMIO entirely in the future, but we expect that some tradeoffs between the performance and set of features of PCI and the lightweight of MMIO will be present, so we may end up supporting both use-cases.
if possible, merged in rust-vmm, unless explicit exemption is granted by the maintainers. | ||
* Contributors should provide design documents in case of features spanning multiple PRs to receive | ||
early guidance from maintainers. | ||
* Contributors should not leave open PRs stale for more than two weeks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will the maintainers commit to not let the PRs go stale from their end?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should've just read ahead 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we do.
Maintainers will review new PRs to the feature branch within one week.
of course we are peer in this relationship that is why we took the commitment to answer within 1 week (half of the time respect to contributor) because it is our duty set the example and guarantee a nice developer experience.
Each milestone identifies a point in the project where a merge of the developed features in the main branch is possible. | ||
In order to accept the merge: | ||
|
||
* All Firecracker features and architectures are supported for PCIe (for example, Snapshot Resume, and ARM). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume that this comes with the caveat that consensus hasn't been reached for snapshot / resume?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct. Some exceptions can be made if a path forward is identified and agreed upon.
The way milestones are defined is to get the bulk of the PCI code into firecracker and possibly rust-vmm early on, so that VFIO and passthrough devices would be a small incremental and modular change on top of it (ideally just wiring a new device).
* All Firecracker features and architectures are supported for PCIe (for example, Snapshot Resume, and ARM). | ||
* All functional and security tests should pass with the PCIe feature enabled on all supported devices. | ||
* Open-source performance tests should not regress with the PCIe feature enabled compared to MMIO devices. | ||
* Internal performance tests should not regress with the PCIe feature enabled. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this be more specific? Does this mean that performance of MMIO devices should not be affected if PCIe devices are also attached to the VM?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both that virtio-pci devices need to have at least the same performance as virtio-mmio devices, and that the use of just mmio devices should not be impacted by the PCI changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, could this be written explicitly in the doc?
PCIe Support in Firecracker Community Roadmap
This documents describes the high-level changes required to support PCIe and device passthrough in Firecracker
and the main responsibilities of the maintainers and the community to achieve the success of the initiative.
This document will be discussed during the November 6, 2024 meeting.
I will upload this document as a PR to the poc/pcie
branch so that everybody will have the opportunity to leave comments along the way.
Motivation
Firecracker currently supports only MMIO devices.
By adding support for PCIe we would get the following benefits:
if we add support for multiple buses.
Challenges
Supporting PCIe in Firecracker and, in particular, device pass-through, introduces new challenges. Namely:
memory overheads of Firecracker VMs.
when more lightweight virtualization is preferred.
entire physical memory of the VM to allow for DMA from the device.
security posture of firecracker.
to be carefully evaluated.
therefore snapshot/resume will not be supported for active/online passed-through devices.
Contribution Guidelines
Before diving deeper into the required changes in Firecracker, it’s important to be clear on the
responsibility splitbetween the community contributors and the maintainers.
As this is a community-driven initiative, it will be responsibility of contributors to propose designs,
make changes, and work with the upstream rust-vmm community.
Maintainers of Firecracker will provide guidance, code reviews, project organization, facilitate rust-vmm
interactions, and automated testing of the new features.
Contributors
features/pcie
which maintainers will setup,with all the required CI artifacts and infrastructure.
merged into the feature branch.
For example, we need to rework FC device management to support PCI, the development will need to be done in main,
and then merged to the PCIe feature branch.
if possible, merged in rust-vmm, unless explicit exemption is granted by the maintainers.
early guidance from maintainers.
Maintainers
(every 3 weeks or on-demand in case of dependencies).
poc/pcie.
The POC is just a scrappy implementation and will need to be rewritten from scratch to meet the quality
and security bars of Firecracker.
PCIe support (eg guest kernels)
Two approvals from maintainers are required to merge a PR.
Maintainers should provide the required approvals or guidance to unblock the PR to unblock within two weeks.
before every merge of the feature branch in main.
Any finding will be shared with the community to help address the issues.
Acceptance Criteria
A proposal of the different milestones of the project is defined in the following sections.
Each milestone identifies a point in the project where a merge of the developed features in the main branch is possible.
In order to accept the merge:
In case of regressions, details and reproducers will be shared with the community.
In case of blockers, details will be shared with the community.
Exceptions can be granted if there is a path forward towards mitigation (for example, in the case of VFIO support).
Milestones
This section describes a proposed high-level plan of action to be discussed with the community.
A more detailed plan will need to be provided by contributors before starting the implementation,
which maintainers will help refine.
0. Proof of Concept and Definition of Goals
It is important that both maintainers and the community build confidence with the changes
and verify that it’s possible to achieve the respective goals with this solution.
For this reason, the Firecracker team has built a public proof-of-concept with basic PCI passthrough and virtio-pci support:
poc/pcie.
The implementation of the POC is scrappy and would require a complete rewrite from scratch that meets
Firecracker quality and security bars, but it showcases the main features (and drawbacks) of
PCIe-passthrough and virtio-pci devices.
Before starting the actual implementation below, we need to be able to answer:
1. virtio-pci support
The first milestone will be the support of the virtio-pci transport layer for virtio.
This is not strictly required for PCIe device passthrough, but we believe it is the easier way to get
the bulk of the PCI code merged into firecracker and rust-vmm, as there shouldn’t be any concerns from
the security and over-subscription point of view.
With this milestone, Firecracker customers will be able to configure any device to be attached on the
PCI bus instead of the MMIO bus through a per-device config.
If no device in the VM uses PCI, no PCI bus will be created and there will be no changes over the current state.
PCI support will be a first-class citizen of Firecracker and will be compiled in the official releases of Firecracker.
Maintainers will:
A proposed high-level plan for the contributions is presented below.
A more detailed plan will need to be provided by contributors before starting the implementation.
A good starting point is cloud-hypervisor implementation.
allowing for up to 2048 interrupt lines per device
Open questions:
Will it require using rust-vmm crates not yet used in Firecracker (vm-devices, vm-allocator, ...)?
How much work will it be to refactor FC device management to start using those crates as well?
2. PCIe-passthrough support design
The second milestone will be the design of the support of VFIO-based PCI-passthrough
which will allow passing to the guest any PCIe device from the host.
This design will need to answer the still open questions around snapshot/resume and VM oversubscriptability,
and will guide the implementation of the following milestones.
In particular, the main problems to solve are:
to remove sensible information from it, protecting it from speculative execution attacks.
To enable prototyping of this milestone, maintainers will setup test artifacts and infrastructure to
test on Nvidia GPUs on PR and nightly.
Maintainers will also start early consultation with Amazon Security to identify additional requirements.
3. Basic PCIe-passthrough support implementation
This proposed milestone will cover the basic implementation of PCIe device-passthrough via VFIO.
With this milestone, Firecracker customers will be able to attach any and as many VFIO devices to the VM before boot.
However, customers will not be able to oversubscribe memory of VMs with PCI-passthrough devices,
as the entire guest physical memory needs to be allocated for DMA.
It should be possible, depending on the investigations in milestone 2, to snapshot/resume a VM with an offlined VFIO device.
We expect this change to be fairly modular and self-contained as it builds upon the first milestone,
adding just an additional device type.
The biggest hurdle will be the thorough security review and the considerations around its usefulness for internal customers.
We expect the biggest hurdles for this change to be the security review, as it’s a change in the current Firecracker threat model.
Furthermore, a path forward towards full oversubscribability needs to be identified and prototyped for this milestone to be accepted.
4. Over-subscriptable PCIe-passthrough VMs
Depending on the investigations in milestone 2, we need to implement a way to oversubscribe memory
from VMs with PCI-passthrough devices.
The challenge is that the hypervisor needs to know in advance which guest physical memory ranges will be used by DMA.
One way to do it would be to ask the guest to configure a virtual IOMMU to enable DMA from the device.
In this case, the hypervisor will know which memory ranges the guest is using for DMA so that they can be granularly pre-allocated.
This could be done through the
virtio-iommu
device.One alternative could be PCI ATS/PRI or using a swiotlb in the guest.