Skip to content

Commit e854660

Browse files
committed
x86/sme: Use percpu boolean to control wbinvd during kexec
On SME platforms, at hardware level dirty cachelines with and without encryption bit of the same memory can coexist, and the CPU can flush them back to memory in random order. During kexec, the caches must be flushed before jumping to the new kernel to avoid silent memory corruption when a cacheline with a different encryption property is written back over whatever encryption properties the new kernel is using. TDX also needs cache flush during kexec for the same reason. It would be good to implement a generic way to flush cache instead of scattering checks for each feature all around. During kexec, the kexec-ing CPU sends IPIs to all remote CPUs to stop them before it boots to the new kernel. For SME the kernel basically encrypts all memory including the kernel itself by manipulating page tables. A simple memory write from the kernel could dirty cachelines. Currently, the kernel uses WBINVD to flush cache for SME during kexec in two places: 1) the one in stop_this_cpu() for all remote CPUs when the kexec-ing CPU stops them; 2) the one in the relocate_kernel() where the kexec-ing CPU jumps to the new kernel. Unlike SME, TDX can only dirty cachelines when it is used (i.e., when SEAMCALLs are performed). Since there are no more SEAMCALLs after the aforementioned WBINVDs, leverage them for TDX. In order to have a generic way to cover both SME and TDX, use a percpu boolean to indicate the cache may be in an incoherent state thus cache flush is needed during kexec, and turn on the boolean for SME. TDX can then leverage it by also turning the boolean on. A percpu boolean isn't strictly needed for SME since it is activated at early boot time and on all CPUs. A global flag would be sufficient. But using a percpu variable has two benefits. Foremost, the state that is being tracked here (percpu cache coherency situation requiring a flush) is percpu, so a percpu state is a more direct and natural fit. Secondly, it will fit TDX's usage better since the percpu var can be set when a CPU makes a SEAMCALL, and cleared when another WBINVID on the CPU obviates the need for a kexec-time WBINVID. Saving kexec-time WBINVD is valuable, because there is an existing race[*] where kexec could proceed while another CPU is active. WBINVD could make this race worse, so it's worth skipping it when possible. Today the first WBINVD in the stop_this_cpu() is performed when the platform supports SME and the second WBINVD is done in relocate_kernel() when SME is activated by the kernel. Make things simple by changing to do the second WBINVD when the platform supports SME. This allows the kernel to simply turn on this percpu boolean when bringing up a CPU by checking whether the platform supports SME. No other functional change intended. Also, currently machine_check() has a comment to explain why no function call is allowed after load_segments(). After changing to use the new percpu boolean to control whether to perform WBINVD when calling the relocate_kernel(), that comment isn't needed anymore. But it is still a useful comment, so expand the comment around load_segments() to mention that due to depth tracking no function call can be made after load_segments(). [*] The "race" in native_stop_other_cpus() Commit 1f5e7eb: ("x86/smp: Make stop_other_cpus() more robust") introduced a new 'cpus_stop_mask' to resolve an "intermittent lockups on poweroff" issue which was caused by the WBINVD in stop_this_cpu(). Specifically, the new cpumask resolved the below problem mentioned in that commit: CPU0 CPU1 stop_other_cpus() send_IPIs(REBOOT); stop_this_cpu() while (num_online_cpus() > 1); set_online(false); proceed... -> hang wbinvd() While it fixed the reported issue, that commit explained the new cpumask "cannot plug all holes either". This is because it doesn't address the "race" mentioned in the #3 in the comment in native_stop_other_cpus(): /* * 1) Send an IPI on the reboot vector to all other CPUs. * * The other CPUs should react on it after leaving critical * sections and re-enabling interrupts. They might still hold * locks, but there is nothing which can be done about that. * * 2) Wait for all other CPUs to report that they reached the * HLT loop in stop_this_cpu() * * 3) If #2 timed out send an NMI to the CPUs which did not * yet report * * 4) Wait for all other CPUs to report that they reached the * HLT loop in stop_this_cpu() * * #3 can obviously race against a CPU reaching the HLT loop late. * That CPU will have reported already and the "have all CPUs * reached HLT" condition will be true despite the fact that the * other CPU is still handling the NMI. Again, there is no * protection against that as "disabled" APICs still respond to * NMIs. */ Consider below case: CPU 0 CPU 1 native_stop_other_cpus() stop_this_cpu() // sends REBOOT IPI to stop remote CPUs ... wbinvd(); // wait timesout, try NMI if (!cpumask_empty(&cpus_stop_mask)) { for_each_cpu(cpu, &cpus_stop_mask) { ... cpumask_clear_cpu(cpu, &cpus_stop_mask); hlt; // send NMI ---> wakeup from hlt and run stop_this_cpu(): // WAIT CPUs TO STOP while (!cpumask_empty( &cpus_stop_mask) && ...) } ... proceed ... wbinvd(); ... hlt; The "WAIT CPUs TO STOP" is supposed to wait until all remote CPUs are stopped, but actually it quits immediately because the remote CPUs have been cleared in cpus_stop_mask when stop_this_cpu() is called from the REBOOT IPI. Doing WBINVD in stop_this_cpu() could potentially increase the chance to trigger the above "race" despite it's still rare to happen. Signed-off-by: Kai Huang <kai.huang@intel.com>
1 parent 5e0c2d5 commit e854660

File tree

6 files changed

+44
-23
lines changed

6 files changed

+44
-23
lines changed

arch/x86/include/asm/kexec.h

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -115,7 +115,7 @@ relocate_kernel_fn(unsigned long indirection_page,
115115
unsigned long pa_control_page,
116116
unsigned long start_address,
117117
unsigned int preserve_context,
118-
unsigned int host_mem_enc_active);
118+
unsigned int cache_incoherent);
119119
#endif
120120
extern relocate_kernel_fn relocate_kernel;
121121
#define ARCH_HAS_KIMAGE_ARCH

arch/x86/include/asm/processor.h

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -732,6 +732,8 @@ void __noreturn stop_this_cpu(void *dummy);
732732
void microcode_check(struct cpuinfo_x86 *prev_info);
733733
void store_cpu_caps(struct cpuinfo_x86 *info);
734734

735+
DECLARE_PER_CPU(bool, cache_state_incoherent);
736+
735737
enum l1tf_mitigations {
736738
L1TF_MITIGATION_OFF,
737739
L1TF_MITIGATION_FLUSH_NOWARN,

arch/x86/kernel/cpu/amd.c

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -492,6 +492,22 @@ static void early_detect_mem_encrypt(struct cpuinfo_x86 *c)
492492
{
493493
u64 msr;
494494

495+
/*
496+
* Mark using wbinvd is needed during kexec on processors that
497+
* support SME. This provides support for performing a successful
498+
* kexec when going from SME inactive to SME active (or vice-versa).
499+
*
500+
* The cache must be cleared so that if there are entries with the
501+
* same physical address, both with and without the encryption bit,
502+
* they don't race each other when flushed and potentially end up
503+
* with the wrong entry being committed to memory.
504+
*
505+
* Test the CPUID bit directly because the machine might've cleared
506+
* X86_FEATURE_SME due to cmdline options.
507+
*/
508+
if (c->extended_cpuid_level >= 0x8000001f && (cpuid_eax(0x8000001f) & BIT(0)))
509+
__this_cpu_write(cache_state_incoherent, true);
510+
495511
/*
496512
* BIOS support is required for SME and SEV.
497513
* For SME: If BIOS has enabled SME then adjust x86_phys_bits by

arch/x86/kernel/machine_kexec_64.c

Lines changed: 11 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -29,6 +29,7 @@
2929
#include <asm/set_memory.h>
3030
#include <asm/cpu.h>
3131
#include <asm/efi.h>
32+
#include <asm/processor.h>
3233

3334
#ifdef CONFIG_ACPI
3435
/*
@@ -346,15 +347,15 @@ void __nocfi machine_kexec(struct kimage *image)
346347
{
347348
unsigned long reloc_start = (unsigned long)__relocate_kernel_start;
348349
relocate_kernel_fn *relocate_kernel_ptr;
349-
unsigned int host_mem_enc_active;
350+
unsigned int cache_incoherent;
350351
int save_ftrace_enabled;
351352
void *control_page;
352353

353354
/*
354-
* This must be done before load_segments() since if call depth tracking
355-
* is used then GS must be valid to make any function calls.
355+
* This must be done before load_segments(), since it resets
356+
* GS to 0 and percpu data needs the correct GS to work.
356357
*/
357-
host_mem_enc_active = cc_platform_has(CC_ATTR_HOST_MEM_ENCRYPT);
358+
cache_incoherent = this_cpu_read(cache_state_incoherent);
358359

359360
#ifdef CONFIG_KEXEC_JUMP
360361
if (image->preserve_context)
@@ -398,6 +399,11 @@ void __nocfi machine_kexec(struct kimage *image)
398399
*
399400
* I take advantage of this here by force loading the
400401
* segments, before I zap the gdt with an invalid value.
402+
*
403+
* load_segments() resets GS to 0. Don't make any function call
404+
* after here since call depth tracking uses percpu variables to
405+
* operate (relocate_kernel() is explicitly ignored by call depth
406+
* tracking).
401407
*/
402408
load_segments();
403409
/*
@@ -412,7 +418,7 @@ void __nocfi machine_kexec(struct kimage *image)
412418
virt_to_phys(control_page),
413419
image->start,
414420
image->preserve_context,
415-
host_mem_enc_active);
421+
cache_incoherent);
416422

417423
#ifdef CONFIG_KEXEC_JUMP
418424
if (image->preserve_context)

arch/x86/kernel/process.c

Lines changed: 3 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -87,6 +87,8 @@ EXPORT_PER_CPU_SYMBOL(cpu_tss_rw);
8787
DEFINE_PER_CPU(bool, __tss_limit_invalid);
8888
EXPORT_PER_CPU_SYMBOL_GPL(__tss_limit_invalid);
8989

90+
DEFINE_PER_CPU(bool, cache_state_incoherent);
91+
9092
/*
9193
* this gets called so that we can store lazy state into memory and copy the
9294
* current task into the new thread.
@@ -818,19 +820,7 @@ void __noreturn stop_this_cpu(void *dummy)
818820
disable_local_APIC();
819821
mcheck_cpu_clear(c);
820822

821-
/*
822-
* Use wbinvd on processors that support SME. This provides support
823-
* for performing a successful kexec when going from SME inactive
824-
* to SME active (or vice-versa). The cache must be cleared so that
825-
* if there are entries with the same physical address, both with and
826-
* without the encryption bit, they don't race each other when flushed
827-
* and potentially end up with the wrong entry being committed to
828-
* memory.
829-
*
830-
* Test the CPUID bit directly because the machine might've cleared
831-
* X86_FEATURE_SME due to cmdline options.
832-
*/
833-
if (c->extended_cpuid_level >= 0x8000001f && (cpuid_eax(0x8000001f) & BIT(0)))
823+
if (this_cpu_read(cache_state_incoherent))
834824
wbinvd();
835825

836826
/*

arch/x86/kernel/relocate_kernel_64.S

Lines changed: 11 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -60,7 +60,7 @@ SYM_CODE_START_NOALIGN(relocate_kernel)
6060
* %rsi pa_control_page
6161
* %rdx start address
6262
* %rcx preserve_context
63-
* %r8 host_mem_enc_active
63+
* %r8 cache_incoherent
6464
*/
6565

6666
/* Save the CPU context, used for jumping back */
@@ -117,7 +117,7 @@ SYM_CODE_START_LOCAL_NOALIGN(identity_mapped)
117117
/*
118118
* %rdi indirection page
119119
* %rdx start address
120-
* %r8 host_mem_enc_active
120+
* %r8 cache_incoherent
121121
* %r9 page table page
122122
* %r11 preserve_context
123123
* %r13 original CR4 when relocate_kernel() was invoked
@@ -179,14 +179,21 @@ SYM_CODE_START_LOCAL_NOALIGN(identity_mapped)
179179
movq %r9, %cr3
180180

181181
/*
182+
* If the memory cache is in incoherent state, e.g., due to
183+
* memory encryption, do wbinvd to flush cache.
184+
*
182185
* If SME is active, there could be old encrypted cache line
183186
* entries that will conflict with the now unencrypted memory
184187
* used by kexec. Flush the caches before copying the kernel.
188+
*
189+
* Note SME sets this flag to true when the platform supports
190+
* SME, so the wbinvd is performed even SME is not activated
191+
* by the kernel. But this has no harm.
185192
*/
186193
testq %r8, %r8
187-
jz .Lsme_off
194+
jz .Lnowbinvd
188195
wbinvd
189-
.Lsme_off:
196+
.Lnowbinvd:
190197

191198
call swap_pages
192199

0 commit comments

Comments
 (0)