drm/xe: Take PM ref SVM copy to SRAM #29

sys-i915-oscijenkins · 2025-04-17T13:57:14Z

From git@z Thu Jan 1 00:00:00 1970
Subject: [PATCH] drm/xe: Take PM ref SVM copy to SRAM
From: Matthew Brost matthew.brost@intel.com
Date: Wed, 16 Apr 2025 11:43:54 -0700
Message-Id: 20250416184354.419272-1-matthew.brost@intel.com
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: 7bit

It is possible, however unlikely, for the CPU to access memory which is
in the GPU triggering a fault without a PM reference. Ensure a PM ref is
held when doing a SVM copy to SRAM.

Fixes the below splat found in local testing:
[ 1269.500163] ------------[ cut here ]------------
[ 1269.500167] xe 0000:03:00.0: [drm] Missing outer runtime PM protection
[ 1269.500184] WARNING: CPU: 8 PID: 38648 at drivers/gpu/drm/xe/xe_pm.c:664 xe_pm_runtime_get_noresume+0x86/0xb0 [xe]
[ 1269.500226] Modules linked in: xe drm_gpusvm drm_gpuvm drm_ttm_helper ttm drm_exec gpu_sched drm_suballoc_helper drm_buddy drm_kms_helper snd_hda_codec_realtek snd_hda_codec_generic snd_hda_scodec_component snd_hda_intel snd_intel_dspcfg snd_hda_codec x86_pkg_temp_thermal snd_hwdep coretemp snd_hda_core i2c_i801 i2c_mux snd_pcm wmi_bmof i2c_smbus mei_pxp mei_hdcp video wmi mei_me mei fuse igb e1000e i2c_algo_bit ptp ghash_clmulni_intel pps_core intel_lpss_pci
[ 1269.500257] CPU: 8 UID: 0 PID: 38648 Comm: xe_exec_system_ Tainted: G W 6.15.0-rc2-xe+ torvalds#158 PREEMPT(undef)
[ 1269.500260] Tainted: [W]=WARN
[ 1269.500261] Hardware name: Intel Corporation Raptor Lake Client Platform/RPL-S ADP-S DDR5 UDIMM CRB, BIOS RPLSFWI1.R00.3492.A00.2211291114 11/29/2022
[ 1269.500262] RIP: 0010:xe_pm_runtime_get_noresume+0x86/0xb0 [xe]
[ 1269.500293] Code: ee 31 c0 48 85 db 48 0f 44 f8 4c 8b 67 50 4d 85 e4 74 2e e8 6c 0b 9a e1 4c 89 e2 48 c7 c7 80 d5 4e a0 48 89 c6 e8 aa 51 11 e1 <0f> 0b eb c1 48 8b 47 08 f0 ff 80 f8 02 00 00 5b 41 5c c3 cc cc cc
[ 1269.500294] RSP: 0000:ffffc9000ed439c0 EFLAGS: 00010282
[ 1269.500297] RAX: 0000000000000000 RBX: ffff888113568000 RCX: 0000000000000000
[ 1269.500298] RDX: 0000000000000002 RSI: 0000000000000001 RDI: 00000000ffffffff
[ 1269.500299] RBP: ffff888111bdf600 R08: ffff88888d5fffe8 R09: 00000000fffdffff
[ 1269.500300] R10: ffff88888c800000 R11: ffff88888d300000 R12: ffff888103b3dd10
[ 1269.500301] R13: ffffc9000ed43a70 R14: ffff88813e5a52c0 R15: ffff88813e5a52c0
[ 1269.500302] FS: 00007f1e596a3940(0000) GS:ffff88890ac15000(0000) knlGS:0000000000000000
[ 1269.500304] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1269.500305] CR2: 0000563cf48c3ee8 CR3: 000000031055e001 CR4: 0000000000f70ef0
[ 1269.500306] PKRU: 55555554
[ 1269.500307] Call Trace:
[ 1269.500308]
[ 1269.500310] xe_sched_job_create+0x159/0x330 [xe]
[ 1269.500342] xe_bb_create_migration_job+0x7c/0x510 [xe]
[ 1269.500359] ? rcu_is_watching+0x11/0x50
[ 1269.500363] ? __kmalloc_cache_noprof+0x255/0x330
[ 1269.500366] ? xelp_pte_encode_addr+0x34/0x1d0 [xe]
[ 1269.500394] xe_migrate_vram+0x2c5/0x620 [xe]
[ 1269.500558] ? __iommu_dma_map+0x99/0x170
[ 1269.500569] xe_svm_copy+0x486/0x620 [xe]
[ 1269.500613] drm_gpusvm_migrate_to_ram+0x290/0x330 [drm_gpusvm]
[ 1269.500624] do_swap_page+0xff7/0x2440
[ 1269.500633] ? __pfx_default_wake_function+0x10/0x10
[ 1269.500640] ? rcu_is_watching+0x11/0x50
[ 1269.500646] __handle_mm_fault+0x617/0x950
[ 1269.500658] handle_mm_fault+0xbf/0x250
[ 1269.500664] do_user_addr_fault+0x177/0x6a0
[ 1269.500672] exc_page_fault+0x63/0x1c0
[ 1269.500678] asm_exc_page_fault+0x26/0x30
[ 1269.500681] RIP: 0033:0x7f1e5b8b1b0f
[ 1269.500684] Code: 15 00 49 8d 0c 1a 49 39 d4 49 89 4c 24 60 0f 95 c2 48 29 d8 0f b6 d2 48 83 c8 01 48 c1 e2 02 48 09 da 48 83 ca 01 49 89 52 08 <48> 89 41 08 49 8d 4a 10 eb af 48 8d 0d 78 ea 12 00 ba 64 10 00 00
[ 1269.500687] RSP: 002b:00007fff7ccb74c0 EFLAGS: 00010206
[ 1269.500690] RAX: 0000000000521121 RBX: 0000000000001010 RCX: 0000563cf48c3ee0
[ 1269.500692] RDX: 0000000000001011 RSI: ffffffffffffff20 RDI: 0000000000000000
[ 1269.500695] RBP: 00007fff7ccb7540 R08: 0000000000000000 R09: 0000000000000001
[ 1269.500697] R10: 0000563cf48c2ed0 R11: 0000000000000206 R12: 00007f1e5ba11ac0
[ 1269.500699] R13: 0000000000001000 R14: 0000000000000000 R15: 00007f1e5ba11b20
[ 1269.500708]
[ 1269.500710] irq event stamp: 176580299
[ 1269.500712] hardirqs last enabled at (176580305): [] __up_console_sem+0x66/0x70
[ 1269.500716] hardirqs last disabled at (176580310): [] __up_console_sem+0x4b/0x70
[ 1269.500719] softirqs last enabled at (176580168): [] __irq_exit_rcu+0xbe/0x110
[ 1269.500723] softirqs last disabled at (176579351): [] __irq_exit_rcu+0xbe/0x110
[ 1269.500726] ---[ end trace 0000000000000000 ]---

Fixes: c5b3eb5 ("drm/xe: Add GPUSVM device memory copy vfunc functions")
Cc: stable@vger.kernel.org
Signed-off-by: Matthew Brost matthew.brost@intel.com

As a lockmap takes a reference for every ww_mutex used together, this can be an arbitrarily large number and under control of userspace -- easily overflowing the arbitrary limit of 4096. However, the pin_count (used for detecting unexpected lock dropping) is a full 32b despite nesting being extremely rare (see lockdep_pin_lock). References: https://gitlab.freedesktop.org/drm/intel/-/issues/8028 Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Link: https://patchwork.freedesktop.org/patch/msgid/20190425092004.9995-33-chris@chris-wilson.co.uk Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com> [Joonas: Converting to pin_count:11 as per addition of sync:1] Signed-off-by: Joonas Lahtinen <joonas.lahtinen@intel.com> Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>

We have recently turned on ftrace-dump-on-oops for i915's CI and an issue we have encountered is that the trace buffer size greatly exceeds the pstore capabilities; we get the tail of the oops but not the introduction. Currently the global buffer size is controllable on the cmdline, but at the request of our CI sysadmin, we would like to add a control to the Kconfig as well. The rationale being the cmdline carries the temporary hacks that we want to eradicate, and we want to track the permanent configuration in .config. I have kept the Kconfig option hidden from the user as the default should suffice for the majority of users; reserving the configuration for those that eschew the cmdline option. v2: Add an expert prompt to stop the default value overriding .config changes. References: https://gitlab.freedesktop.org/drm/intel/-/issues/8029 Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Steven Rostedt <srostedt@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Tomi Sarvela <tomi.p.sarvela@intel.com> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>

Most systems keep the last messages from the panic, and we value the stacktrace most, so dump it last in order to preserve it for post-mortems. References: https://gitlab.freedesktop.org/drm/intel/-/issues/8030 Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Acked-by: Martin Peres <martin.peres@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20180903131745.30593-1-chris@chris-wilson.co.uk Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>

Under CI testing, it is common for the cpus to overheat with the continuous workloads and end up being throttled. As the cpus still function, it is less of a critical error meriting urgent action, but an expected yet significant condition (pr_note). References: https://gitlab.freedesktop.org/drm/intel/-/issues/8031 Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Petri Latvala <petri.latvala@intel.com> Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>

References: https://gitlab.freedesktop.org/drm/intel/-/issues/8032 Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Petri Latvala <petri.latvala@intel.com> [danvet: Rebase] Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>

There's the hung_task_panic sysctl, but that's a bit an extreme measure. As a fallback taint at least the machine. Our CI uses this to decide when a reboot is necessary, plus to figure out whether the kernel is still happy. v2: Works much better when I put the else { add_taint() } at the right place. References: https://gitlab.freedesktop.org/drm/intel/-/issues/8034 Signed-off-by: Daniel Vetter <daniel.vetter@intel.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: "Paul E. McKenney" <paulmck@linux.ibm.com> Cc: Valdis Kletnieks <valdis.kletnieks@vt.edu> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: "Liu, Chuansheng" <chuansheng.liu@intel.com> Acked-by: Chris Wilson <chris@chris-wilson.co.uk> (for core-for-CI) Link: https://patchwork.freedesktop.org/patch/msgid/20190502204648.5537-1-daniel.vetter@ffwll.ch Signed-off-by: Jani Nikula <jani.nikula@intel.com> Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>

There's the soft/hardlookup_panic sysctls, but that's a bit an extreme measure. As a fallback taint at least the machine. Our CI uses this to decide when a reboot is necessary, plus to figure out whether the kernel is still happy. References: https://gitlab.freedesktop.org/drm/intel/-/issues/8035 Signed-off-by: Daniel Vetter <daniel.vetter@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Valdis Kletnieks <valdis.kletnieks@vt.edu> Cc: Laurence Oberman <loberman@redhat.com> Cc: Vincent Whitchurch <vincent.whitchurch@axis.com> Cc: Don Zickus <dzickus@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> Cc: Sinan Kaya <okaya@kernel.org> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Acked-by: Chris Wilson <chris@chris-wilson.co.uk> (for core-for-CI) Link: https://patchwork.freedesktop.org/patch/msgid/20190502194208.3535-2-daniel.vetter@ffwll.ch Signed-off-by: Jani Nikula <jani.nikula@intel.com> Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>

We can't allow spam in CI. Update 26th June 2018: This is still an issue: Update 23rd May 2019: You guessed it, still ocurring. [ 224.739686] ------------[ cut here ]------------ [ 224.739712] WARNING: CPU: 3 PID: 2982 at net/sched/sch_generic.c:461 dev_watchdog+0x1fd/0x210 [ 224.739714] Modules linked in: vgem snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hwdep snd_hda_core x86_pkg_temp_thermal intel_powerclamp coretemp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel snd_pcm i915 asix usbnet mii mei_me mei prime_numbers i2c_hid pinctrl_sunrisepoint pinctrl_intel btusb btrtl btbcm btintel bluetooth ecdh_generic [ 224.739775] CPU: 3 PID: 2982 Comm: gem_exec_suspen Tainted: G U W 4.18.0-rc2-CI-Patchwork_9414+ #1 [ 224.739777] Hardware name: Dell Inc. XPS 13 9350/, BIOS 1.4.12 11/30/2016 [ 224.739780] RIP: 0010:dev_watchdog+0x1fd/0x210 [ 224.739781] Code: 49 63 4c 24 f0 eb 92 4c 89 ef c6 05 21 46 ad 00 01 e8 77 ee fc ff 89 d9 48 89 c2 4c 89 ee 48 c7 c7 88 4c 14 82 e8 a3 fe 84 ff <0f> 0b eb be 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 00 00 48 c7 47 [ 224.739866] RSP: 0018:ffff88027dd83e40 EFLAGS: 00010286 [ 224.739869] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000102 [ 224.739871] RDX: 0000000080000102 RSI: ffffffff820c8c6c RDI: 00000000ffffffff [ 224.739873] RBP: ffff8802644c1540 R08: 0000000071be9b33 R09: 0000000000000000 [ 224.739874] R10: ffff88027dd83dc0 R11: 0000000000000000 R12: ffff8802644c1588 [ 224.739876] R13: ffff8802644c1160 R14: 0000000000000001 R15: ffff88026a5dc728 [ 224.739878] FS: 00007f18f4887980(0000) GS:ffff88027dd80000(0000) knlGS:0000000000000000 [ 224.739880] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 224.739881] CR2: 00007f4c627ae548 CR3: 000000022ca1a002 CR4: 00000000003606e0 [ 224.739883] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 224.739885] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 224.739886] Call Trace: [ 224.739888] <IRQ> [ 224.739892] ? qdisc_reset+0xe0/0xe0 [ 224.739894] ? qdisc_reset+0xe0/0xe0 [ 224.739897] call_timer_fn+0x93/0x360 [ 224.739903] expire_timers+0xc1/0x1d0 [ 224.739908] run_timer_softirq+0xc7/0x170 [ 224.739916] __do_softirq+0xd9/0x505 [ 224.739923] irq_exit+0xa9/0xc0 [ 224.739926] smp_apic_timer_interrupt+0x9c/0x2d0 [ 224.739929] apic_timer_interrupt+0xf/0x20 [ 224.739931] </IRQ> [ 224.739934] RIP: 0010:delay_tsc+0x2e/0xb0 [ 224.739936] Code: 49 89 fc 55 53 bf 01 00 00 00 e8 6d 2c 78 ff e8 88 9d b6 ff 41 89 c5 0f ae e8 0f 31 48 c1 e2 20 48 09 c2 48 89 d5 eb 16 f3 90 <bf> 01 00 00 00 e8 48 2c 78 ff e8 63 9d b6 ff 44 39 e8 75 36 0f ae [ 224.740021] RSP: 0018:ffffc900002f7d48 EFLAGS: 00000286 ORIG_RAX: ffffffffffffff13 [ 224.740024] RAX: 0000000080000000 RBX: 0000000649565ca9 RCX: 0000000000000001 [ 224.740026] RDX: 0000000080000001 RSI: ffffffff820c8c6c RDI: 00000000ffffffff [ 224.740027] RBP: 00000006493ea9ce R08: 000000005e81e2ee R09: 0000000000000000 [ 224.740029] R10: 0000000000000120 R11: 0000000000000000 R12: 00000000002ad8d6 [ 224.740030] R13: 0000000000000003 R14: 0000000000000004 R15: ffff88025caf5408 [ 224.740040] ? delay_tsc+0x66/0xb0 [ 224.740045] hibernation_debug_sleep+0x1c/0x30 [ 224.740048] hibernation_snapshot+0x2c1/0x690 [ 224.740053] hibernate+0x142/0x2a4 [ 224.740057] state_store+0xd0/0xe0 [ 224.740063] kernfs_fop_write+0x104/0x190 [ 224.740068] __vfs_write+0x31/0x180 [ 224.740072] ? rcu_read_lock_sched_held+0x6f/0x80 [ 224.740075] ? rcu_sync_lockdep_assert+0x29/0x50 [ 224.740078] ? __sb_start_write+0x152/0x1f0 [ 224.740080] ? __sb_start_write+0x168/0x1f0 [ 224.740084] vfs_write+0xbd/0x1a0 [ 224.740088] ksys_write+0x50/0xc0 [ 224.740094] do_syscall_64+0x55/0x190 [ 224.740097] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 224.740099] RIP: 0033:0x7f18f400a281 [ 224.740100] Code: c3 0f 1f 84 00 00 00 00 00 48 8b 05 59 8d 20 00 c3 0f 1f 84 00 00 00 00 00 8b 05 8a d1 20 00 85 c0 75 16 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 57 f3 c3 0f 1f 44 00 00 41 54 55 49 89 d4 53 [ 224.740186] RSP: 002b:00007fffd1f4fec8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 [ 224.740189] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f18f400a281 [ 224.740190] RDX: 0000000000000004 RSI: 00007f18f448069a RDI: 0000000000000006 [ 224.740192] RBP: 00007fffd1f4fef0 R08: 0000000000000000 R09: 0000000000000000 [ 224.740194] R10: 0000000000000000 R11: 0000000000000246 R12: 000055e795d03400 [ 224.740195] R13: 00007fffd1f50500 R14: 0000000000000000 R15: 0000000000000000 [ 224.740205] irq event stamp: 1582591 [ 224.740207] hardirqs last enabled at (1582590): [<ffffffff810f9f9c>] vprintk_emit+0x4bc/0x4d0 [ 224.740210] hardirqs last disabled at (1582591): [<ffffffff81a0111c>] error_entry+0x7c/0x100 [ 224.740212] softirqs last enabled at (1582568): [<ffffffff81c0034f>] __do_softirq+0x34f/0x505 [ 224.740215] softirqs last disabled at (1582571): [<ffffffff8108c959>] irq_exit+0xa9/0xc0 [ 224.740218] WARNING: CPU: 3 PID: 2982 at net/sched/sch_generic.c:461 dev_watchdog+0x1fd/0x210 [ 224.740219] ---[ end trace 6e41d690e611c338 ]--- References: https://gitlab.freedesktop.org/drm/intel/-/issues/8037 References: https://bugzilla.kernel.org/show_bug.cgi?id=196399 Acked-by: Martin Peres <martin.peres@linux.intel.com> Cc: Martin Peres <martin.peres@linux.intel.com> Signed-off-by: Daniel Vetter <daniel.vetter@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20170718082110.12524-1-daniel.vetter@ffwll.ch Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Signed-off-by: Joonas Lahtinen <joonas.lahtinen@intel.com> Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>

Since the kernel now used hashed pointers for raw addresses, it is very hard to guage the relative placement within a section, and since the hash value will never match up with any contents, using it provides no information relevant for slab debugging. Show the relative offset into each section, so that some reference for the hexdump is provided. References: https://gitlab.freedesktop.org/drm/intel/-/issues/8038 Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>

If the MSI is already enabled, trying to enable it again results in an -EINVAL and on the first attempt a WARN. That WARN causes our CI to abort the run [on each first attempt to suspend]: <4> [463.142025] WARNING: CPU: 0 PID: 2225 at drivers/pci/msi.c:1074 __pci_enable_msi_range+0x3cb/0x420 <4> [463.142026] Modules linked in: snd_hda_intel i915 snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic mei_hdcp x86_pkg_temp_thermal coretemp crct10dif_pclmul crc32_pclmul snd_intel_dspcfg ghash_clmulni_intel snd_hda_codec btusb btrtl btbcm btintel e1000e bluetooth snd_hwdep snd_hda_core ptp ecdh_generic snd_pcm ecc pps_core mei_me mei prime_numbers [last unloaded: i915] <4> [463.142045] CPU: 0 PID: 2225 Comm: kworker/u8:14 Tainted: G U 5.7.0-rc2-CI-CI_DRM_8350+ #1 <4> [463.142046] Hardware name: Intel Corporation NUC7i5BNH/NUC7i5BNB, BIOS BNKBL357.86A.0060.2017.1214.2013 12/14/2017 <4> [463.142049] Workqueue: events_unbound async_run_entry_fn <4> [463.142051] RIP: 0010:__pci_enable_msi_range+0x3cb/0x420 <4> [463.142053] Code: 76 58 49 8d 56 48 48 89 df e8 31 73 fd ff e9 20 fe ff ff 31 f6 48 89 df e8 c2 e9 fd ff e9 d6 fe ff ff 45 89 fc e9 1a ff ff ff <0f> 0b 41 bc ea ff ff ff e9 0d ff ff ff 41 bc ea ff ff ff e9 02 ff <4> [463.142054] RSP: 0018:ffffc90000593cd0 EFLAGS: 00010202 <4> [463.142056] RAX: 0000000000000010 RBX: ffff888274051000 RCX: 0000000000000000 <4> [463.142057] RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffff888274051000 <4> [463.142058] RBP: ffff888238aa1018 R08: 0000000000000001 R09: 0000000000000001 <4> [463.142060] R10: ffffc90000593d90 R11: 00000000c79cdfd5 R12: ffff8882740510b0 <4> [463.142061] R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000000001 <4> [463.142062] FS: 0000000000000000(0000) GS:ffff888276c00000(0000) knlGS:0000000000000000 <4> [463.142064] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 <4> [463.142065] CR2: 000055706f347d80 CR3: 0000000005610003 CR4: 00000000003606f0 <4> [463.142066] Call Trace: <4> [463.142073] pci_enable_msi+0x11/0x20 <4> [463.142077] azx_resume+0x1ab/0x200 [snd_hda_intel] <4> [463.142080] ? pci_pm_thaw+0x80/0x80 <4> [463.142084] dpm_run_callback+0x64/0x280 <4> [463.142089] device_resume+0xd4/0x1c0 <4> [463.142093] ? dpm_watchdog_set+0x60/0 While this would appear to be a bug in snd-hda, it does appear inconsequential, at least for gfx-ci. Downgrade the warning to an info, like the other already-enabled error for MSI-X. References: https://gitlab.freedesktop.org/drm/intel/-/issues/8041 Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/1687 Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Link: https://patchwork.freedesktop.org/patch/msgid/20200423082753.3899018-1-chris@chris-wilson.co.uk Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>

References: https://gitlab.freedesktop.org/drm/intel/-/issues/8046 Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/2805 Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>

References: https://gitlab.freedesktop.org/drm/intel/-/issues/8047 Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/2874 Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Signed-off-by: Joonas Lahtinen <joonas.lahtinen@intel.com> Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>

In typical cases PCIe tunneling is needed to make the devices fully usable for the host system. However, it poses a security issue because they can also use DMA to access the host memory. We already have two ways of preventing this, one an IOMMU that is enabled on recent systems by default and the second is the "authorized" attribute under each connected device that needs to be written by userspace before a PCIe tunnel is created. This option adds one more by adding a Kconfig option, which is enabled by default, that can be used to make kernel binaries where PCIe tunneling is completely disabled. Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com> References: https://intel-gfx-ci.01.org/tree/drm-tip/Trybot_134314v1/bat-mtlp-9/boot0.txt References: https://gitlab.freedesktop.org/drm/i915/kernel/-/issues/11261 Signed-off-by: Imre Deak <imre.deak@intel.com> Acked-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240604161618.1958674-1-imre.deak@intel.com Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>

This reverts commit 560af5d. Locking in i915_pmu.c interacting with perf is completely wrong. It's using spinlock_t everywhere when it should actually use raw_spinlock_t since perf is already holding raw_spinlock in the caller. This started to be checked with commit 560af5d ("lockdep: Enable PROVE_RAW_LOCK_NESTING with PROVE_LOCKING."), but should only be a real issue when PREEMPT_RT is enabled: in that config, the spinlock_t can sleep and creates issue. Reworking the locks in i915_pmu.c is not very simple as changing locks to raw_spinlock_t cascades to too many locks, which is both a) not desired from an RT perspective and b) hard to get right as it calls into other parts of the driver that have other requirements. Example backtrace: <4> [141.043897] ============================= <4> [141.043922] [ BUG: Invalid wait context ] <4> [141.043940] 6.13.0-rc2-CI_DRM_15820-g78bd7a249aa0+ #1 Not tainted <4> [141.043964] ----------------------------- <4> [141.043981] swapper/0/0 is trying to lock: <4> [141.044000] ffff88810861b910 (&pmu->lock){....}-{3:3}, at: i915_pmu_enable+0x48/0x3a0 [i915] <4> [141.044194] other info that might help us debug this: <4> [141.044217] context-{5:5} <4> [141.044229] 1 lock held by swapper/0/0: <4> [141.044248] #0: ffff88885f432038 (&cpuctx_lock){....}-{2:2}, at: __perf_install_in_context+0x3f/0x360 <4> [141.044297] stack backtrace: <4> [141.044312] CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Not tainted 6.13.0-rc2-CI_DRM_15820-g78bd7a249aa0+ #1 <4> [141.044353] Hardware name: Intel Corporation Meteor Lake Client Platform/MTL-P LP5x T3 RVP, BIOS MTLPFWI1.R00.3471.D91.2401310918 01/31/2024 <4> [141.044405] Call Trace: <4> [141.044419] <TASK> <4> [141.044431] dump_stack_lvl+0x91/0xf0 <4> [141.044454] dump_stack+0x10/0x20 <4> [141.044472] __lock_acquire+0x990/0x2820 <4> [141.044498] lock_acquire+0xc9/0x300 <4> [141.044518] ? i915_pmu_enable+0x48/0x3a0 [i915] <4> [141.044689] _raw_spin_lock_irqsave+0x49/0x80 <4> [141.044713] ? i915_pmu_enable+0x48/0x3a0 [i915] <4> [141.044903] i915_pmu_enable+0x48/0x3a0 [i915] <4> [141.045112] ? __lock_acquire+0x455/0x2820 <4> [141.045142] i915_pmu_event_add+0x71/0x90 [i915] More time is needed to get this fixed properly, but let's not pile regressions on top. Signed-off-by: Luca Coelho <luciano.coelho@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241211121703.2890150-1-luciano.coelho@intel.com [ Reword commit message, giving more detail on what the issue is ] Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com> References: https://gitlab.freedesktop.org/drm/i915/kernel/-/issues/13311 Acked-by: Rodrigo Vivi <rodrigo.vivi@intel.com>

This patch re-enables D3Cold by default on BMG. If issues on runtime_pm resume are seen and the D3cold->D0 transition is suspected to block the device or cause memory corruptions, D3cold can be disabled for confirmation with either: 1. at runtime: echo 0 > /sys/bus/pci/devices/<addr>/vram_d3cold_threshold 2. at boot: pcie_port_pm=off Upon confirmation of D3Cold related bug, please file a bug to the link below. Link: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/ Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20250308005636.1475420-2-rodrigo.vivi@intel.com Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>

# Conflicts: # include/drm/drm_kunit_helpers.h

# Conflicts: # drivers/gpu/drm/i915/display/intel_bw.c # drivers/gpu/drm/i915/display/intel_vblank.c # drivers/gpu/drm/i915/i915_drv.h # drivers/gpu/drm/xe/xe_device_types.h

# Conflicts: # drivers/gpu/drm/xe/xe_device.c # drivers/gpu/drm/xe/xe_guc_pc.c # drivers/gpu/drm/xe/xe_pci.c # drivers/gpu/drm/xe/xe_survivability_mode.c # drivers/gpu/drm/xe/xe_survivability_mode.h # drivers/gpu/drm/xe/xe_wa_oob.rules

It is possible, however unlikely, for the CPU to access memory which is in the GPU triggering a fault without a PM reference. Ensure a PM ref is held when doing a SVM copy to SRAM. Fixes the below splat found in local testing: [ 1269.500163] ------------[ cut here ]------------ [ 1269.500167] xe 0000:03:00.0: [drm] Missing outer runtime PM protection [ 1269.500184] WARNING: CPU: 8 PID: 38648 at drivers/gpu/drm/xe/xe_pm.c:664 xe_pm_runtime_get_noresume+0x86/0xb0 [xe] [ 1269.500226] Modules linked in: xe drm_gpusvm drm_gpuvm drm_ttm_helper ttm drm_exec gpu_sched drm_suballoc_helper drm_buddy drm_kms_helper snd_hda_codec_realtek snd_hda_codec_generic snd_hda_scodec_component snd_hda_intel snd_intel_dspcfg snd_hda_codec x86_pkg_temp_thermal snd_hwdep coretemp snd_hda_core i2c_i801 i2c_mux snd_pcm wmi_bmof i2c_smbus mei_pxp mei_hdcp video wmi mei_me mei fuse igb e1000e i2c_algo_bit ptp ghash_clmulni_intel pps_core intel_lpss_pci [ 1269.500257] CPU: 8 UID: 0 PID: 38648 Comm: xe_exec_system_ Tainted: G W 6.15.0-rc2-xe+ torvalds#158 PREEMPT(undef) [ 1269.500260] Tainted: [W]=WARN [ 1269.500261] Hardware name: Intel Corporation Raptor Lake Client Platform/RPL-S ADP-S DDR5 UDIMM CRB, BIOS RPLSFWI1.R00.3492.A00.2211291114 11/29/2022 [ 1269.500262] RIP: 0010:xe_pm_runtime_get_noresume+0x86/0xb0 [xe] [ 1269.500293] Code: ee 31 c0 48 85 db 48 0f 44 f8 4c 8b 67 50 4d 85 e4 74 2e e8 6c 0b 9a e1 4c 89 e2 48 c7 c7 80 d5 4e a0 48 89 c6 e8 aa 51 11 e1 <0f> 0b eb c1 48 8b 47 08 f0 ff 80 f8 02 00 00 5b 41 5c c3 cc cc cc [ 1269.500294] RSP: 0000:ffffc9000ed439c0 EFLAGS: 00010282 [ 1269.500297] RAX: 0000000000000000 RBX: ffff888113568000 RCX: 0000000000000000 [ 1269.500298] RDX: 0000000000000002 RSI: 0000000000000001 RDI: 00000000ffffffff [ 1269.500299] RBP: ffff888111bdf600 R08: ffff88888d5fffe8 R09: 00000000fffdffff [ 1269.500300] R10: ffff88888c800000 R11: ffff88888d300000 R12: ffff888103b3dd10 [ 1269.500301] R13: ffffc9000ed43a70 R14: ffff88813e5a52c0 R15: ffff88813e5a52c0 [ 1269.500302] FS: 00007f1e596a3940(0000) GS:ffff88890ac15000(0000) knlGS:0000000000000000 [ 1269.500304] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1269.500305] CR2: 0000563cf48c3ee8 CR3: 000000031055e001 CR4: 0000000000f70ef0 [ 1269.500306] PKRU: 55555554 [ 1269.500307] Call Trace: [ 1269.500308] <TASK> [ 1269.500310] xe_sched_job_create+0x159/0x330 [xe] [ 1269.500342] xe_bb_create_migration_job+0x7c/0x510 [xe] [ 1269.500359] ? rcu_is_watching+0x11/0x50 [ 1269.500363] ? __kmalloc_cache_noprof+0x255/0x330 [ 1269.500366] ? xelp_pte_encode_addr+0x34/0x1d0 [xe] [ 1269.500394] xe_migrate_vram+0x2c5/0x620 [xe] [ 1269.500558] ? __iommu_dma_map+0x99/0x170 [ 1269.500569] xe_svm_copy+0x486/0x620 [xe] [ 1269.500613] drm_gpusvm_migrate_to_ram+0x290/0x330 [drm_gpusvm] [ 1269.500624] do_swap_page+0xff7/0x2440 [ 1269.500633] ? __pfx_default_wake_function+0x10/0x10 [ 1269.500640] ? rcu_is_watching+0x11/0x50 [ 1269.500646] __handle_mm_fault+0x617/0x950 [ 1269.500658] handle_mm_fault+0xbf/0x250 [ 1269.500664] do_user_addr_fault+0x177/0x6a0 [ 1269.500672] exc_page_fault+0x63/0x1c0 [ 1269.500678] asm_exc_page_fault+0x26/0x30 [ 1269.500681] RIP: 0033:0x7f1e5b8b1b0f [ 1269.500684] Code: 15 00 49 8d 0c 1a 49 39 d4 49 89 4c 24 60 0f 95 c2 48 29 d8 0f b6 d2 48 83 c8 01 48 c1 e2 02 48 09 da 48 83 ca 01 49 89 52 08 <48> 89 41 08 49 8d 4a 10 eb af 48 8d 0d 78 ea 12 00 ba 64 10 00 00 [ 1269.500687] RSP: 002b:00007fff7ccb74c0 EFLAGS: 00010206 [ 1269.500690] RAX: 0000000000521121 RBX: 0000000000001010 RCX: 0000563cf48c3ee0 [ 1269.500692] RDX: 0000000000001011 RSI: ffffffffffffff20 RDI: 0000000000000000 [ 1269.500695] RBP: 00007fff7ccb7540 R08: 0000000000000000 R09: 0000000000000001 [ 1269.500697] R10: 0000563cf48c2ed0 R11: 0000000000000206 R12: 00007f1e5ba11ac0 [ 1269.500699] R13: 0000000000001000 R14: 0000000000000000 R15: 00007f1e5ba11b20 [ 1269.500708] </TASK> [ 1269.500710] irq event stamp: 176580299 [ 1269.500712] hardirqs last enabled at (176580305): [<ffffffff813447f6>] __up_console_sem+0x66/0x70 [ 1269.500716] hardirqs last disabled at (176580310): [<ffffffff813447db>] __up_console_sem+0x4b/0x70 [ 1269.500719] softirqs last enabled at (176580168): [<ffffffff812a637e>] __irq_exit_rcu+0xbe/0x110 [ 1269.500723] softirqs last disabled at (176579351): [<ffffffff812a637e>] __irq_exit_rcu+0xbe/0x110 [ 1269.500726] ---[ end trace 0000000000000000 ]--- Fixes: c5b3eb5 ("drm/xe: Add GPUSVM device memory copy vfunc functions") Cc: stable@vger.kernel.org Signed-off-by: Matthew Brost <matthew.brost@intel.com>

ickle and others added 24 commits April 7, 2025 12:56

HAX net/phy: Suppress WARN for calling stop while halted

016489e

References: https://gitlab.freedesktop.org/drm/intel/-/issues/8046 Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/2805 Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>

Merge remote-tracking branch 'drm-misc/drm-misc-fixes' into drm-tip

e58e53d

Merge remote-tracking branch 'drm-misc/drm-misc-next' into drm-tip

d3910dd

# Conflicts: # include/drm/drm_kunit_helpers.h

Merge remote-tracking branch 'drm-intel/drm-intel-next' into drm-tip

3bace4c

# Conflicts: # drivers/gpu/drm/i915/display/intel_bw.c # drivers/gpu/drm/i915/display/intel_vblank.c # drivers/gpu/drm/i915/i915_drv.h # drivers/gpu/drm/xe/xe_device_types.h

Merge remote-tracking branch 'drm-intel/drm-intel-gt-next' into drm-tip

8437a14

Merge remote-tracking branch 'drm-intel/topic/core-for-CI' into drm-tip

dc48b24

Merge remote-tracking branch 'drm-xe/topic/xe-for-CI' into drm-tip

7062d04

drm-tip: 2025y-04m-11d-11h-44m-23s UTC integration manifest

21c916e

sys-i915-oscijenkins force-pushed the drm-tip branch 6 times, most recently from ae8b802 to d161139 Compare April 18, 2025 16:35

sys-i915-oscijenkins force-pushed the drm-tip branch 30 times, most recently from 13b0995 to cdf42c8 Compare May 2, 2025 04:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

drm/xe: Take PM ref SVM copy to SRAM #29

drm/xe: Take PM ref SVM copy to SRAM #29

sys-i915-oscijenkins commented Apr 17, 2025

drm/xe: Take PM ref SVM copy to SRAM #29

Are you sure you want to change the base?

drm/xe: Take PM ref SVM copy to SRAM #29

Conversation

sys-i915-oscijenkins commented Apr 17, 2025