Skip to content

feat: Intel AMX support #5065

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Mar 10, 2025
Merged

Conversation

zulinx86
Copy link
Contributor

@zulinx86 zulinx86 commented Mar 5, 2025

This is the first PR for Intel Sapphire Rapids (EC2 7th-gen Intel instance type) support.

Note that this PR focuses on Intel AMX support and any integration tests for Intel Sapphire Rapids will be added in the upcoming PR.

Changes

  • Request permission for Intel AMX via arch_prctl() before KVM_GET_SUPPORTED_CPUID to boot guests successfully
  • Change XSAVE type from kvm_xsave to support snapshot restore of Intel AMX

Reason

Intel AMX (Advanced Matrix Extensions) is introduced in Intel Sapphire Rapids and a new instruction set for deep learning / AI workloads. Intel AMX is supported in XSAVE/XRSTOR that are instructions to save/restore extensional CPU features' states into memory (e.g. for context switch). Which states to be saved/restored is configured by writing a bit vector to XCR0 via XSETBV instruction. Intel AMX introduces two new bits (TILECFG and TILEDATA) in (1) the bit vector (XCR0.TILECFG[bit 17] and XCR0.TILEDATA[bit 18]) as well as (2) CPUID to enumerate their supports (CPUID.(EAX=0DH,ECX=0):EAX.TILECFG[bit 17] and CPUID.(EAX=0DH,ECX=0):EAX.TILEDATA[bit 18]).

Since the memory size required to save TILEDATA state is 8KB and it is larger than previously statically allocated memory size (4KB), Linux kernel decided to disable TILEDATA by default and allows userspace applications to enable it dynamically via arch_prctl() syscall. This default disabling behavior is also the case with KVM. To enable TILEDATA for guests, VMM has to call arch_prctl(ARCH_REQ_XCOMP_GUEST_PERM, ...), which makes KVM_GET_SUPPORTED_CPUID return a value with CPUID.(EAX=0DH,ECX=0):EAX.TILEDATA[bit 18] set. Conversely, without arch_prctl() prior to KVM_GET_SUPPORTED_CPUID, it returns an inconsistent state where CPUID.(EAX=0DH,ECX=0):EAX.TILECFG[bit 17] is set but CPUID.(EAX=0DH,ECX=0):EAX.TILEDATA[bit 18] is not set. If such a AMX-half-enabled CPUID is passed to KVM_SET_CPUID2 as it is, guests will crash with general protection fault during boot (See Appendix). This is because Linux kernel attempts to execute XSETBV instruction with all XSAVE feature bits enumerated on CPUID during boot and XSETBV only accepts either of both Intel AMX bits enabled or disabled. This bug of KVM_GET_SUPPORTED_CPUID returning such a half-enabled state was fixed in kernel v6.4. But in any case, Firecracker supports the CPU template feature that enables users to mask CPU features (including Intel AMX), so we enable TILEDATA by default to make it work even on earlier kernels.

To support a dynamically-sized XSTATE buffer, the Linux kernel extended the existing kvm_xsave by adding a flexible array member (FAM) in the end. Along with it, KVM_GET_XSAVE2 API was added and KVM_SET_XSAVE API was extended. To support these changes, rust-vmm added (1) kvm_xsave2 that holds kvm_xsave and the length of the FAM, (2) Xsave as FamStructWrapper of kvm_xsave2, (3) get_xsave2() for KVM_GET_XSAVE2 and (4) set_xsave2() to take Xsave and call KVM_SET_XSAVE. Accordingly, use these methods and structs to support Intel AMX in Firecracker.

License Acceptance

By submitting this pull request, I confirm that my contribution is made under
the terms of the Apache 2.0 license. For more information on following Developer
Certificate of Origin and signing off your commits, please check
CONTRIBUTING.md.

PR Checklist

  • I have read and understand CONTRIBUTING.md.
  • I have run tools/devtool checkstyle to verify that the PR passes the
    automated style checks.
  • I have described what is done in these changes, why they are needed, and
    how they are solving the problem in a clear and encompassing way.
  • [ ] I have updated any relevant documentation (both in code and in the docs)
    in the PR.
  • [ ] I have mentioned all user-facing changes in CHANGELOG.md.
  • [ ] If a specific issue led to this PR, this PR closes the issue.
  • [ ] When making API changes, I have followed the
    Runbook for Firecracker API changes.
  • I have tested all new and changed functionalities in unit tests and/or
    integration tests.
  • [ ] I have linked an issue to every new TODO.

  • This functionality cannot be added in rust-vmm.

Appendix: GP fault on guest boot without Intel AMX enablement

[    0.048097] general protection fault, maybe for address 0x0: 0000 [#1] PREEMPT SMP NOPTI
[    0.048097] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.1.102 #1
[    0.048097] RIP: 0010:fpu__init_cpu_xstate+0x69/0xb0
[    0.048097] Code: 35 2d fd 00 b9 c4 01 00 00 48 89 c2 48 c1 ea 20 0f 30 65 48 89 05 07 36 ff 7e 48 8b 05 48 3d fd 00 31 c9 48 89 c2 48 c1 ea 20 <0f> 01 d1 48 8b 05 5d 7b 2d 01 a8 08 75 0a 48 8b 5d f8 c9 e9 1f 8e
[    0.048097] RSP: 0000:ffffffff82203e68 EFLAGS: 00010246
[    0.048097] RAX: 00000000000202e7 RBX: 0000000000000206 RCX: 0000000000000000
[    0.048097] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000770eb0
[    0.048097] RBP: ffffffff82203e70 R08: 0000000000000000 R09: 00000000ffffefff
[    0.048097] R10: ffffffff82203d28 R11: ffffffff822b0568 R12: 0000000000000000
[    0.048097] R13: 00000000000202e7 R14: 0000000000000200 R15: 0000000000000000
[    0.048097] FS:  0000000000000000(0000) GS:ffff888007a00000(0000) knlGS:0000000000000000
[    0.048097] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    0.048097] CR2: ffff888002800000 CR3: 0000000002210001 CR4: 0000000000770eb0
[    0.048097] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[    0.048097] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
[    0.048097] PKRU: 55555554
[    0.048097] Call Trace:
[    0.048097]  <TASK>
[    0.048097]  ? show_trace_log_lvl+0x1c5/0x28b
[    0.048097]  ? show_trace_log_lvl+0x1c5/0x28b
[    0.048097]  ? fpu__init_system_xstate+0x16d/0x3bb
[    0.048097]  ? show_regs.part.0+0x1e/0x24
[    0.048097]  ? die_addr.cold+0x8/0xd
[    0.048097]  ? exc_general_protection+0x1a7/0x3f0
[    0.048097]  ? asm_exc_general_protection+0x27/0x30
[    0.048097]  ? fpu__init_cpu_xstate+0x69/0xb0
[    0.048097]  ? fpu__init_cpu_xstate+0x31/0xb0
[    0.048097]  fpu__init_system_xstate+0x16d/0x3bb
[    0.048097]  fpu__init_system+0x136/0x161
[    0.048097]  arch_cpu_finalize_init+0x22/0x4c
[    0.048097]  start_kernel+0x400/0x499
[    0.048097]  x86_64_start_reservations+0x24/0x2a
[    0.048097]  x86_64_start_kernel+0x72/0x7c
[    0.048097]  secondary_startup_64_no_verify+0xcd/0xdb
[    0.048097]  </TASK>
[    0.048097] ---[ end trace 0000000000000000 ]---

With this PR, the GP fault doesn't happen. (Note that a functional test for CPU feature set will be added in an upcoming PR for functional integration tests support.)

[    0.048697] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
[    0.048697] x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
[    0.048697] x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
[    0.048697] x86/fpu: Supporting XSAVE feature 0x020: 'AVX-512 opmask'
[    0.048697] x86/fpu: Supporting XSAVE feature 0x040: 'AVX-512 Hi256'
[    0.048697] x86/fpu: Supporting XSAVE feature 0x080: 'AVX-512 ZMM_Hi256'
[    0.048697] x86/fpu: Supporting XSAVE feature 0x200: 'Protection Keys User registers'
[    0.048697] x86/fpu: Supporting XSAVE feature 0x20000: 'AMX Tile config'
[    0.048697] x86/fpu: Supporting XSAVE feature 0x40000: 'AMX Tile data'
[    0.048697] x86/fpu: xstate_offset[2]:  576, xstate_sizes[2]:  256
[    0.048697] x86/fpu: xstate_offset[5]:  832, xstate_sizes[5]:   64
[    0.048697] x86/fpu: xstate_offset[6]:  896, xstate_sizes[6]:  512
[    0.048697] x86/fpu: xstate_offset[7]: 1408, xstate_sizes[7]: 1024
[    0.048697] x86/fpu: xstate_offset[9]: 2432, xstate_sizes[9]:    8
[    0.048697] x86/fpu: xstate_offset[17]: 2496, xstate_sizes[17]:   64
[    0.048697] x86/fpu: xstate_offset[18]: 2560, xstate_sizes[18]: 8192
...
Ubuntu 24.04.1 LTS ubuntu-fc-uvm ttyS0
...
root@ubuntu-fc-uvm:~#

Copy link

codecov bot commented Mar 5, 2025

Codecov Report

Attention: Patch coverage is 74.71264% with 22 lines in your changes missing coverage. Please review.

Project coverage is 83.15%. Comparing base (db6d29b) to head (d13c537).
Report is 8 commits behind head on main.

Files with missing lines Patch % Lines
src/vmm/src/arch/x86_64/xstate.rs 59.52% 17 Missing ⚠️
src/vmm/src/vstate/vm/x86_64.rs 77.77% 4 Missing ⚠️
src/vmm/src/vstate/kvm.rs 50.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #5065      +/-   ##
==========================================
- Coverage   83.18%   83.15%   -0.03%     
==========================================
  Files         247      248       +1     
  Lines       26816    26901      +85     
==========================================
+ Hits        22306    22370      +64     
- Misses       4510     4531      +21     
Flag Coverage Δ
5.10-c5n.metal 83.52% <52.87%> (-0.11%) ⬇️
5.10-m5n.metal 83.50% <52.87%> (-0.12%) ⬇️
5.10-m6a.metal 82.71% <52.87%> (-0.12%) ⬇️
5.10-m6g.metal 79.60% <0.00%> (ø)
5.10-m6i.metal 83.49% <52.87%> (-0.12%) ⬇️
5.10-m7g.metal 79.60% <0.00%> (ø)
6.1-c5n.metal 83.57% <65.51%> (-0.06%) ⬇️
6.1-m5n.metal 83.55% <65.51%> (-0.06%) ⬇️
6.1-m6a.metal 82.76% <65.51%> (-0.06%) ⬇️
6.1-m6g.metal 79.60% <0.00%> (ø)
6.1-m6i.metal 83.55% <65.51%> (-0.06%) ⬇️
6.1-m7g.metal 79.60% <0.00%> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@zulinx86 zulinx86 force-pushed the sapphire_rapids branch 6 times, most recently from d7875f2 to 2ae659a Compare March 6, 2025 05:29
@zulinx86 zulinx86 added the Status: Awaiting review Indicates that a pull request is ready to be reviewed label Mar 6, 2025
@zulinx86 zulinx86 requested a review from Manciukic March 6, 2025 10:14
@zulinx86 zulinx86 force-pushed the sapphire_rapids branch 2 times, most recently from 9ed7e6c to c7331eb Compare March 7, 2025 07:30
@zulinx86 zulinx86 requested review from pb8o and Manciukic March 7, 2025 07:30
Copy link
Contributor

@roypat roypat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we already want to add a changelog entry for this? I know the title says "make perf tests work", but really what we're doing is making firecracker support AMX xP

@zulinx86
Copy link
Contributor Author

zulinx86 commented Mar 7, 2025

Do we already want to add a changelog entry for this? I know the title says "make perf tests work", but really what we're doing is making firecracker support AMX xP

I'd like to declare official support of Intel AMX when snapshot restore is supported :)

@zulinx86 zulinx86 force-pushed the sapphire_rapids branch 3 times, most recently from 27ae1dc to 9e074a2 Compare March 7, 2025 09:07
Manciukic
Manciukic previously approved these changes Mar 7, 2025
@zulinx86 zulinx86 changed the title Enable to run performance tests (other than snapshot restore) on Intel Sapphire Rapids feat: Intel AMX support Mar 10, 2025
@zulinx86 zulinx86 force-pushed the sapphire_rapids branch 3 times, most recently from e49f406 to 3d9e5de Compare March 10, 2025 10:14
@zulinx86 zulinx86 requested a review from xmarcalx as a code owner March 10, 2025 10:14
@zulinx86 zulinx86 requested a review from kalyazin as a code owner March 10, 2025 10:14
@zulinx86 zulinx86 force-pushed the sapphire_rapids branch 2 times, most recently from 17ceb3a to fc8effa Compare March 10, 2025 10:43
@zulinx86 zulinx86 requested review from roypat and Manciukic March 10, 2025 10:45
Copy link
Contributor

@Manciukic Manciukic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall lgtm, but I think that fam_len thing is so confusing right now. If we don't want/can't change the api, I think we should better explain in the comment what is happening because it makes my head spin every time I look at it

@zulinx86 zulinx86 force-pushed the sapphire_rapids branch 2 times, most recently from ea4225f to 0e8ec6c Compare March 10, 2025 13:01
Bindings for ARCH_REQ_XCOMP_GUEST_PERM and ARCH_XCOMP_TILEDATA are
required to enable Intel AMX's XTILEDATA for XSAVE. Note that the
required bits were added in kernel v6.4+.

Signed-off-by: Takahiro Itazuri <itazur@amazon.com>
Eq and PartialEq are not necessary for KvmError, rather disallows me to
add error variants wrapping std::io::Error to handle syscall errors.

Signed-off-by: Takahiro Itazuri <itazur@amazon.com>
Intel AMX (Advanced Matrix Extensions) was introduced in Intel Sapphire
Rapids to accelerate deep learning and AI workloads. Since it requires
a larger area to save its state, the TILEDATA feature is disabled by
default. We request permission for it by default because it can be
disabled via CPU template. Otherwise, kernels prior to v6.4 have a bug
where KVM_GET_SUPPORTED_CPUID returns an inconsistent state of TILECFG
enabled but TILEDATA disabled by default, causing guest's #GP fault on
xsetbv instruction.

Signed-off-by: Takahiro Itazuri <itazur@amazon.com>
roypat
roypat previously approved these changes Mar 10, 2025
Intel AMX is an XSTATE feature and TILEDATA is disabled by default
because it requires a larger area to save its state than the traditional
4096 bytes. Instead, Linux kernel allows VMMs to request the guest
permission via `arch_prctl()`. As such, the size of the XSTATE buffer
required to save XSTASTE is dynamic. To support dynamically-sized
buffer, `KVM_CAP_XSAVE2` was introduced with `KVM_GET_XSAVE2`.
Accordingly, kvm-bindings added `Xsave` that is an alias of
`FamStructWrapper` for the `kvm_xsave` struct with FAM in the end, and
kvm-ioctls added `get_xsave2()` for `KVM_GET_XSAVE2` and `set_xsave2()`
to take `Xsave` to call `KVM_SET_XSAVE`.

Change the type of `xsave` in `VcpuState` from `kvm_xsave` to `Xsave`.
Use `get_xsave2()` and `set_xsave2()`.

Signed-off-by: Takahiro Itazuri <itazur@amazon.com>
KVM_GET_XSAVE2 is called when taking a snapshot, so it has to be allowed
by seccomp filter.

Signed-off-by: Takahiro Itazuri <itazur@amazon.com>
Intel AMX support was introduced but it is only supported on Intel
Sapphire Rapids at the moment. We have to skip Intel AMX tests on older
processors.

Signed-off-by: Takahiro Itazuri <itazur@amazon.com>
To check that Intel AMX is indeed supported inside guest, check related
features are listed in CPUID output.

Signed-off-by: Takahiro Itazuri <itazur@amazon.com>
Now all the required changes for Intel AMX have been done :)

Signed-off-by: Takahiro Itazuri <itazur@amazon.com>
@roypat roypat merged commit 1accef6 into firecracker-microvm:main Mar 10, 2025
3 of 5 checks passed
@zulinx86 zulinx86 deleted the sapphire_rapids branch March 10, 2025 16:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Awaiting review Indicates that a pull request is ready to be reviewed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants