Skip to content

drm/xe: Take PM ref SVM copy to SRAM #29

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 24 commits into
base: drm-tip
Choose a base branch
from

Conversation

sys-i915-oscijenkins
Copy link

From git@z Thu Jan 1 00:00:00 1970
Subject: [PATCH] drm/xe: Take PM ref SVM copy to SRAM
From: Matthew Brost matthew.brost@intel.com
Date: Wed, 16 Apr 2025 11:43:54 -0700
Message-Id: 20250416184354.419272-1-matthew.brost@intel.com
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: 7bit

It is possible, however unlikely, for the CPU to access memory which is
in the GPU triggering a fault without a PM reference. Ensure a PM ref is
held when doing a SVM copy to SRAM.

Fixes the below splat found in local testing:
[ 1269.500163] ------------[ cut here ]------------
[ 1269.500167] xe 0000:03:00.0: [drm] Missing outer runtime PM protection
[ 1269.500184] WARNING: CPU: 8 PID: 38648 at drivers/gpu/drm/xe/xe_pm.c:664 xe_pm_runtime_get_noresume+0x86/0xb0 [xe]
[ 1269.500226] Modules linked in: xe drm_gpusvm drm_gpuvm drm_ttm_helper ttm drm_exec gpu_sched drm_suballoc_helper drm_buddy drm_kms_helper snd_hda_codec_realtek snd_hda_codec_generic snd_hda_scodec_component snd_hda_intel snd_intel_dspcfg snd_hda_codec x86_pkg_temp_thermal snd_hwdep coretemp snd_hda_core i2c_i801 i2c_mux snd_pcm wmi_bmof i2c_smbus mei_pxp mei_hdcp video wmi mei_me mei fuse igb e1000e i2c_algo_bit ptp ghash_clmulni_intel pps_core intel_lpss_pci
[ 1269.500257] CPU: 8 UID: 0 PID: 38648 Comm: xe_exec_system_ Tainted: G W 6.15.0-rc2-xe+ torvalds#158 PREEMPT(undef)
[ 1269.500260] Tainted: [W]=WARN
[ 1269.500261] Hardware name: Intel Corporation Raptor Lake Client Platform/RPL-S ADP-S DDR5 UDIMM CRB, BIOS RPLSFWI1.R00.3492.A00.2211291114 11/29/2022
[ 1269.500262] RIP: 0010:xe_pm_runtime_get_noresume+0x86/0xb0 [xe]
[ 1269.500293] Code: ee 31 c0 48 85 db 48 0f 44 f8 4c 8b 67 50 4d 85 e4 74 2e e8 6c 0b 9a e1 4c 89 e2 48 c7 c7 80 d5 4e a0 48 89 c6 e8 aa 51 11 e1 <0f> 0b eb c1 48 8b 47 08 f0 ff 80 f8 02 00 00 5b 41 5c c3 cc cc cc
[ 1269.500294] RSP: 0000:ffffc9000ed439c0 EFLAGS: 00010282
[ 1269.500297] RAX: 0000000000000000 RBX: ffff888113568000 RCX: 0000000000000000
[ 1269.500298] RDX: 0000000000000002 RSI: 0000000000000001 RDI: 00000000ffffffff
[ 1269.500299] RBP: ffff888111bdf600 R08: ffff88888d5fffe8 R09: 00000000fffdffff
[ 1269.500300] R10: ffff88888c800000 R11: ffff88888d300000 R12: ffff888103b3dd10
[ 1269.500301] R13: ffffc9000ed43a70 R14: ffff88813e5a52c0 R15: ffff88813e5a52c0
[ 1269.500302] FS: 00007f1e596a3940(0000) GS:ffff88890ac15000(0000) knlGS:0000000000000000
[ 1269.500304] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1269.500305] CR2: 0000563cf48c3ee8 CR3: 000000031055e001 CR4: 0000000000f70ef0
[ 1269.500306] PKRU: 55555554
[ 1269.500307] Call Trace:
[ 1269.500308]
[ 1269.500310] xe_sched_job_create+0x159/0x330 [xe]
[ 1269.500342] xe_bb_create_migration_job+0x7c/0x510 [xe]
[ 1269.500359] ? rcu_is_watching+0x11/0x50
[ 1269.500363] ? __kmalloc_cache_noprof+0x255/0x330
[ 1269.500366] ? xelp_pte_encode_addr+0x34/0x1d0 [xe]
[ 1269.500394] xe_migrate_vram+0x2c5/0x620 [xe]
[ 1269.500558] ? __iommu_dma_map+0x99/0x170
[ 1269.500569] xe_svm_copy+0x486/0x620 [xe]
[ 1269.500613] drm_gpusvm_migrate_to_ram+0x290/0x330 [drm_gpusvm]
[ 1269.500624] do_swap_page+0xff7/0x2440
[ 1269.500633] ? __pfx_default_wake_function+0x10/0x10
[ 1269.500640] ? rcu_is_watching+0x11/0x50
[ 1269.500646] __handle_mm_fault+0x617/0x950
[ 1269.500658] handle_mm_fault+0xbf/0x250
[ 1269.500664] do_user_addr_fault+0x177/0x6a0
[ 1269.500672] exc_page_fault+0x63/0x1c0
[ 1269.500678] asm_exc_page_fault+0x26/0x30
[ 1269.500681] RIP: 0033:0x7f1e5b8b1b0f
[ 1269.500684] Code: 15 00 49 8d 0c 1a 49 39 d4 49 89 4c 24 60 0f 95 c2 48 29 d8 0f b6 d2 48 83 c8 01 48 c1 e2 02 48 09 da 48 83 ca 01 49 89 52 08 <48> 89 41 08 49 8d 4a 10 eb af 48 8d 0d 78 ea 12 00 ba 64 10 00 00
[ 1269.500687] RSP: 002b:00007fff7ccb74c0 EFLAGS: 00010206
[ 1269.500690] RAX: 0000000000521121 RBX: 0000000000001010 RCX: 0000563cf48c3ee0
[ 1269.500692] RDX: 0000000000001011 RSI: ffffffffffffff20 RDI: 0000000000000000
[ 1269.500695] RBP: 00007fff7ccb7540 R08: 0000000000000000 R09: 0000000000000001
[ 1269.500697] R10: 0000563cf48c2ed0 R11: 0000000000000206 R12: 00007f1e5ba11ac0
[ 1269.500699] R13: 0000000000001000 R14: 0000000000000000 R15: 00007f1e5ba11b20
[ 1269.500708]
[ 1269.500710] irq event stamp: 176580299
[ 1269.500712] hardirqs last enabled at (176580305): [] __up_console_sem+0x66/0x70
[ 1269.500716] hardirqs last disabled at (176580310): [] __up_console_sem+0x4b/0x70
[ 1269.500719] softirqs last enabled at (176580168): [] __irq_exit_rcu+0xbe/0x110
[ 1269.500723] softirqs last disabled at (176579351): [] __irq_exit_rcu+0xbe/0x110
[ 1269.500726] ---[ end trace 0000000000000000 ]---

Fixes: c5b3eb5 ("drm/xe: Add GPUSVM device memory copy vfunc functions")
Cc: stable@vger.kernel.org
Signed-off-by: Matthew Brost matthew.brost@intel.com

ickle and others added 24 commits April 7, 2025 12:56
As a lockmap takes a reference for every ww_mutex used together, this
can be an arbitrarily large number and under control of userspace --
easily overflowing the arbitrary limit of 4096. However, the pin_count
(used for detecting unexpected lock dropping) is a full 32b despite
nesting being extremely rare (see lockdep_pin_lock).

References: https://gitlab.freedesktop.org/drm/intel/-/issues/8028
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Link: https://patchwork.freedesktop.org/patch/msgid/20190425092004.9995-33-chris@chris-wilson.co.uk
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
[Joonas: Converting to pin_count:11 as per addition of sync:1]
Signed-off-by: Joonas Lahtinen <joonas.lahtinen@intel.com>
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
We have recently turned on ftrace-dump-on-oops for i915's CI and an
issue we have encountered is that the trace buffer size greatly exceeds
the pstore capabilities; we get the tail of the oops but not the
introduction.

Currently the global buffer size is controllable on the cmdline, but at
the request of our CI sysadmin, we would like to add a control to the
Kconfig as well. The rationale being the cmdline carries the temporary
hacks that we want to eradicate, and we want to track the permanent
configuration in .config.

I have kept the Kconfig option hidden from the user as the default
should suffice for the majority of users; reserving the configuration
for those that eschew the cmdline option.

v2: Add an expert prompt to stop the default value overriding .config
changes.

References: https://gitlab.freedesktop.org/drm/intel/-/issues/8029
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Steven Rostedt <srostedt@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Tomi Sarvela <tomi.p.sarvela@intel.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Most systems keep the last messages from the panic, and we value the
stacktrace most, so dump it last in order to preserve it for
post-mortems.

References: https://gitlab.freedesktop.org/drm/intel/-/issues/8030
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Acked-by: Martin Peres <martin.peres@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20180903131745.30593-1-chris@chris-wilson.co.uk
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Under CI testing, it is common for the cpus to overheat with the
continuous workloads and end up being throttled. As the cpus still
function, it is less of a critical error meriting urgent action, but an
expected yet significant condition (pr_note).

References: https://gitlab.freedesktop.org/drm/intel/-/issues/8031
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Petri Latvala <petri.latvala@intel.com>
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
References: https://gitlab.freedesktop.org/drm/intel/-/issues/8032
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Petri Latvala <petri.latvala@intel.com>
[danvet: Rebase]
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
There's the hung_task_panic sysctl, but that's a bit an extreme measure.
As a fallback taint at least the machine.

Our CI uses this to decide when a reboot is necessary, plus to figure
out whether the kernel is still happy.

v2: Works much better when I put the else { add_taint() } at the right
place.

References: https://gitlab.freedesktop.org/drm/intel/-/issues/8034
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: "Paul E. McKenney" <paulmck@linux.ibm.com>
Cc: Valdis Kletnieks <valdis.kletnieks@vt.edu>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: "Liu, Chuansheng" <chuansheng.liu@intel.com>
Acked-by: Chris Wilson <chris@chris-wilson.co.uk> (for core-for-CI)
Link: https://patchwork.freedesktop.org/patch/msgid/20190502204648.5537-1-daniel.vetter@ffwll.ch
Signed-off-by: Jani Nikula <jani.nikula@intel.com>
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
There's the soft/hardlookup_panic sysctls, but that's a bit an extreme
measure. As a fallback taint at least the machine.

Our CI uses this to decide when a reboot is necessary, plus to figure
out whether the kernel is still happy.

References: https://gitlab.freedesktop.org/drm/intel/-/issues/8035
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Valdis Kletnieks <valdis.kletnieks@vt.edu>
Cc: Laurence Oberman <loberman@redhat.com>
Cc: Vincent Whitchurch <vincent.whitchurch@axis.com>
Cc: Don Zickus <dzickus@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Cc: Sinan Kaya <okaya@kernel.org>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Acked-by: Chris Wilson <chris@chris-wilson.co.uk> (for core-for-CI)
Link: https://patchwork.freedesktop.org/patch/msgid/20190502194208.3535-2-daniel.vetter@ffwll.ch
Signed-off-by: Jani Nikula <jani.nikula@intel.com>
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
We can't allow spam in CI.

Update 26th June 2018: This is still an issue:
Update 23rd May 2019: You guessed it, still ocurring.

[  224.739686] ------------[ cut here ]------------
[  224.739712] WARNING: CPU: 3 PID: 2982 at net/sched/sch_generic.c:461 dev_watchdog+0x1fd/0x210
[  224.739714] Modules linked in: vgem snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hwdep snd_hda_core x86_pkg_temp_thermal intel_powerclamp coretemp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel snd_pcm i915 asix usbnet mii mei_me mei prime_numbers i2c_hid pinctrl_sunrisepoint pinctrl_intel btusb btrtl btbcm btintel bluetooth ecdh_generic
[  224.739775] CPU: 3 PID: 2982 Comm: gem_exec_suspen Tainted: G     U  W         4.18.0-rc2-CI-Patchwork_9414+ #1
[  224.739777] Hardware name: Dell Inc. XPS 13 9350/, BIOS 1.4.12 11/30/2016
[  224.739780] RIP: 0010:dev_watchdog+0x1fd/0x210
[  224.739781] Code: 49 63 4c 24 f0 eb 92 4c 89 ef c6 05 21 46 ad 00 01 e8 77 ee fc ff 89 d9 48 89 c2 4c 89 ee 48 c7 c7 88 4c 14 82 e8 a3 fe 84 ff <0f> 0b eb be 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 00 00 48 c7 47
[  224.739866] RSP: 0018:ffff88027dd83e40 EFLAGS: 00010286
[  224.739869] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000102
[  224.739871] RDX: 0000000080000102 RSI: ffffffff820c8c6c RDI: 00000000ffffffff
[  224.739873] RBP: ffff8802644c1540 R08: 0000000071be9b33 R09: 0000000000000000
[  224.739874] R10: ffff88027dd83dc0 R11: 0000000000000000 R12: ffff8802644c1588
[  224.739876] R13: ffff8802644c1160 R14: 0000000000000001 R15: ffff88026a5dc728
[  224.739878] FS:  00007f18f4887980(0000) GS:ffff88027dd80000(0000) knlGS:0000000000000000
[  224.739880] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  224.739881] CR2: 00007f4c627ae548 CR3: 000000022ca1a002 CR4: 00000000003606e0
[  224.739883] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  224.739885] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  224.739886] Call Trace:
[  224.739888]  <IRQ>
[  224.739892]  ? qdisc_reset+0xe0/0xe0
[  224.739894]  ? qdisc_reset+0xe0/0xe0
[  224.739897]  call_timer_fn+0x93/0x360
[  224.739903]  expire_timers+0xc1/0x1d0
[  224.739908]  run_timer_softirq+0xc7/0x170
[  224.739916]  __do_softirq+0xd9/0x505
[  224.739923]  irq_exit+0xa9/0xc0
[  224.739926]  smp_apic_timer_interrupt+0x9c/0x2d0
[  224.739929]  apic_timer_interrupt+0xf/0x20
[  224.739931]  </IRQ>
[  224.739934] RIP: 0010:delay_tsc+0x2e/0xb0
[  224.739936] Code: 49 89 fc 55 53 bf 01 00 00 00 e8 6d 2c 78 ff e8 88 9d b6 ff 41 89 c5 0f ae e8 0f 31 48 c1 e2 20 48 09 c2 48 89 d5 eb 16 f3 90 <bf> 01 00 00 00 e8 48 2c 78 ff e8 63 9d b6 ff 44 39 e8 75 36 0f ae
[  224.740021] RSP: 0018:ffffc900002f7d48 EFLAGS: 00000286 ORIG_RAX: ffffffffffffff13
[  224.740024] RAX: 0000000080000000 RBX: 0000000649565ca9 RCX: 0000000000000001
[  224.740026] RDX: 0000000080000001 RSI: ffffffff820c8c6c RDI: 00000000ffffffff
[  224.740027] RBP: 00000006493ea9ce R08: 000000005e81e2ee R09: 0000000000000000
[  224.740029] R10: 0000000000000120 R11: 0000000000000000 R12: 00000000002ad8d6
[  224.740030] R13: 0000000000000003 R14: 0000000000000004 R15: ffff88025caf5408
[  224.740040]  ? delay_tsc+0x66/0xb0
[  224.740045]  hibernation_debug_sleep+0x1c/0x30
[  224.740048]  hibernation_snapshot+0x2c1/0x690
[  224.740053]  hibernate+0x142/0x2a4
[  224.740057]  state_store+0xd0/0xe0
[  224.740063]  kernfs_fop_write+0x104/0x190
[  224.740068]  __vfs_write+0x31/0x180
[  224.740072]  ? rcu_read_lock_sched_held+0x6f/0x80
[  224.740075]  ? rcu_sync_lockdep_assert+0x29/0x50
[  224.740078]  ? __sb_start_write+0x152/0x1f0
[  224.740080]  ? __sb_start_write+0x168/0x1f0
[  224.740084]  vfs_write+0xbd/0x1a0
[  224.740088]  ksys_write+0x50/0xc0
[  224.740094]  do_syscall_64+0x55/0x190
[  224.740097]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  224.740099] RIP: 0033:0x7f18f400a281
[  224.740100] Code: c3 0f 1f 84 00 00 00 00 00 48 8b 05 59 8d 20 00 c3 0f 1f 84 00 00 00 00 00 8b 05 8a d1 20 00 85 c0 75 16 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 57 f3 c3 0f 1f 44 00 00 41 54 55 49 89 d4 53
[  224.740186] RSP: 002b:00007fffd1f4fec8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[  224.740189] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f18f400a281
[  224.740190] RDX: 0000000000000004 RSI: 00007f18f448069a RDI: 0000000000000006
[  224.740192] RBP: 00007fffd1f4fef0 R08: 0000000000000000 R09: 0000000000000000
[  224.740194] R10: 0000000000000000 R11: 0000000000000246 R12: 000055e795d03400
[  224.740195] R13: 00007fffd1f50500 R14: 0000000000000000 R15: 0000000000000000
[  224.740205] irq event stamp: 1582591
[  224.740207] hardirqs last  enabled at (1582590): [<ffffffff810f9f9c>] vprintk_emit+0x4bc/0x4d0
[  224.740210] hardirqs last disabled at (1582591): [<ffffffff81a0111c>] error_entry+0x7c/0x100
[  224.740212] softirqs last  enabled at (1582568): [<ffffffff81c0034f>] __do_softirq+0x34f/0x505
[  224.740215] softirqs last disabled at (1582571): [<ffffffff8108c959>] irq_exit+0xa9/0xc0
[  224.740218] WARNING: CPU: 3 PID: 2982 at net/sched/sch_generic.c:461 dev_watchdog+0x1fd/0x210
[  224.740219] ---[ end trace 6e41d690e611c338 ]---

References: https://gitlab.freedesktop.org/drm/intel/-/issues/8037
References: https://bugzilla.kernel.org/show_bug.cgi?id=196399
Acked-by: Martin Peres <martin.peres@linux.intel.com>
Cc: Martin Peres <martin.peres@linux.intel.com>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20170718082110.12524-1-daniel.vetter@ffwll.ch
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Signed-off-by: Joonas Lahtinen <joonas.lahtinen@intel.com>
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Since the kernel now used hashed pointers for raw addresses, it is very
hard to guage the relative placement within a section, and since the
hash value will never match up with any contents, using it provides no
information relevant for slab debugging. Show the relative offset into
each section, so that some reference for the hexdump is provided.

References: https://gitlab.freedesktop.org/drm/intel/-/issues/8038
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
If the MSI is already enabled, trying to enable it again results in an
-EINVAL and on the first attempt a WARN. That WARN causes our CI to
abort the run [on each first attempt to suspend]:

<4> [463.142025] WARNING: CPU: 0 PID: 2225 at drivers/pci/msi.c:1074 __pci_enable_msi_range+0x3cb/0x420
<4> [463.142026] Modules linked in: snd_hda_intel i915 snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic mei_hdcp x86_pkg_temp_thermal coretemp crct10dif_pclmul crc32_pclmul snd_intel_dspcfg ghash_clmulni_intel snd_hda_codec btusb btrtl btbcm btintel e1000e bluetooth snd_hwdep snd_hda_core ptp ecdh_generic snd_pcm ecc pps_core mei_me mei prime_numbers [last unloaded: i915]
<4> [463.142045] CPU: 0 PID: 2225 Comm: kworker/u8:14 Tainted: G     U            5.7.0-rc2-CI-CI_DRM_8350+ #1
<4> [463.142046] Hardware name: Intel Corporation NUC7i5BNH/NUC7i5BNB, BIOS BNKBL357.86A.0060.2017.1214.2013 12/14/2017
<4> [463.142049] Workqueue: events_unbound async_run_entry_fn
<4> [463.142051] RIP: 0010:__pci_enable_msi_range+0x3cb/0x420
<4> [463.142053] Code: 76 58 49 8d 56 48 48 89 df e8 31 73 fd ff e9 20 fe ff ff 31 f6 48 89 df e8 c2 e9 fd ff e9 d6 fe ff ff 45 89 fc e9 1a ff ff ff <0f> 0b 41 bc ea ff ff ff e9 0d ff ff ff 41 bc ea ff ff ff e9 02 ff
<4> [463.142054] RSP: 0018:ffffc90000593cd0 EFLAGS: 00010202
<4> [463.142056] RAX: 0000000000000010 RBX: ffff888274051000 RCX: 0000000000000000
<4> [463.142057] RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffff888274051000
<4> [463.142058] RBP: ffff888238aa1018 R08: 0000000000000001 R09: 0000000000000001
<4> [463.142060] R10: ffffc90000593d90 R11: 00000000c79cdfd5 R12: ffff8882740510b0
<4> [463.142061] R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000000001
<4> [463.142062] FS:  0000000000000000(0000) GS:ffff888276c00000(0000) knlGS:0000000000000000
<4> [463.142064] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4> [463.142065] CR2: 000055706f347d80 CR3: 0000000005610003 CR4: 00000000003606f0
<4> [463.142066] Call Trace:
<4> [463.142073]  pci_enable_msi+0x11/0x20
<4> [463.142077]  azx_resume+0x1ab/0x200 [snd_hda_intel]
<4> [463.142080]  ? pci_pm_thaw+0x80/0x80
<4> [463.142084]  dpm_run_callback+0x64/0x280
<4> [463.142089]  device_resume+0xd4/0x1c0
<4> [463.142093]  ? dpm_watchdog_set+0x60/0

While this would appear to be a bug in snd-hda, it does appear
inconsequential, at least for gfx-ci.

Downgrade the warning to an info, like the other already-enabled error
for MSI-X.

References: https://gitlab.freedesktop.org/drm/intel/-/issues/8041
Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/1687
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Link: https://patchwork.freedesktop.org/patch/msgid/20200423082753.3899018-1-chris@chris-wilson.co.uk
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
References: https://gitlab.freedesktop.org/drm/intel/-/issues/8047
Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/2874
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Signed-off-by: Joonas Lahtinen <joonas.lahtinen@intel.com>
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
In typical cases PCIe tunneling is needed to make the devices fully
usable for the host system. However, it poses a security issue because
they can also use DMA to access the host memory. We already have two
ways of preventing this, one an IOMMU that is enabled on recent systems
by default and the second is the "authorized" attribute under each
connected device that needs to be written by userspace before a PCIe
tunnel is created. This option adds one more by adding a Kconfig option,
which is enabled by default, that can be used to make kernel binaries
where PCIe tunneling is completely disabled.

Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com>
References: https://intel-gfx-ci.01.org/tree/drm-tip/Trybot_134314v1/bat-mtlp-9/boot0.txt
References: https://gitlab.freedesktop.org/drm/i915/kernel/-/issues/11261
Signed-off-by: Imre Deak <imre.deak@intel.com>
Acked-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240604161618.1958674-1-imre.deak@intel.com
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
This reverts commit 560af5d.

Locking in i915_pmu.c interacting with perf is completely wrong. It's
using spinlock_t everywhere when it should actually use raw_spinlock_t
since perf is already holding raw_spinlock in the caller. This started
to be checked with commit 560af5d ("lockdep: Enable
PROVE_RAW_LOCK_NESTING with PROVE_LOCKING."), but should only be a real
issue when PREEMPT_RT is enabled: in that config, the spinlock_t can
sleep and creates issue.

Reworking the locks in i915_pmu.c is not very simple as changing locks
to raw_spinlock_t cascades to too many locks, which is both a) not
desired from an RT perspective and b) hard to get right as it calls into
other parts of the driver that have other requirements.

Example backtrace:

<4> [141.043897] =============================
<4> [141.043922] [ BUG: Invalid wait context ]
<4> [141.043940] 6.13.0-rc2-CI_DRM_15820-g78bd7a249aa0+ #1 Not tainted
<4> [141.043964] -----------------------------
<4> [141.043981] swapper/0/0 is trying to lock:
<4> [141.044000] ffff88810861b910 (&pmu->lock){....}-{3:3}, at: i915_pmu_enable+0x48/0x3a0 [i915]
<4> [141.044194] other info that might help us debug this:
<4> [141.044217] context-{5:5}
<4> [141.044229] 1 lock held by swapper/0/0:
<4> [141.044248]  #0: ffff88885f432038 (&cpuctx_lock){....}-{2:2}, at: __perf_install_in_context+0x3f/0x360
<4> [141.044297] stack backtrace:
<4> [141.044312] CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Not tainted 6.13.0-rc2-CI_DRM_15820-g78bd7a249aa0+ #1
<4> [141.044353] Hardware name: Intel Corporation Meteor Lake Client Platform/MTL-P LP5x T3 RVP, BIOS MTLPFWI1.R00.3471.D91.2401310918 01/31/2024
<4> [141.044405] Call Trace:
<4> [141.044419]  <TASK>
<4> [141.044431]  dump_stack_lvl+0x91/0xf0
<4> [141.044454]  dump_stack+0x10/0x20
<4> [141.044472]  __lock_acquire+0x990/0x2820
<4> [141.044498]  lock_acquire+0xc9/0x300
<4> [141.044518]  ? i915_pmu_enable+0x48/0x3a0 [i915]
<4> [141.044689]  _raw_spin_lock_irqsave+0x49/0x80
<4> [141.044713]  ? i915_pmu_enable+0x48/0x3a0 [i915]
<4> [141.044903]  i915_pmu_enable+0x48/0x3a0 [i915]
<4> [141.045112]  ? __lock_acquire+0x455/0x2820
<4> [141.045142]  i915_pmu_event_add+0x71/0x90 [i915]

More time is needed to get this fixed properly, but let's not pile
regressions on top.

Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20241211121703.2890150-1-luciano.coelho@intel.com
[ Reword commit message, giving more detail on what the issue is ]
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
References: https://gitlab.freedesktop.org/drm/i915/kernel/-/issues/13311
Acked-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
This patch re-enables D3Cold by default on BMG.

If issues on runtime_pm resume are seen and the D3cold->D0 transition
is suspected to block the device or cause memory corruptions, D3cold
can be disabled for confirmation with either:

1. at runtime:
   echo 0 > /sys/bus/pci/devices/<addr>/vram_d3cold_threshold

2. at boot:
   pcie_port_pm=off

Upon confirmation of D3Cold related bug, please file a bug to the
link below.

Link: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/
Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20250308005636.1475420-2-rodrigo.vivi@intel.com
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
# Conflicts:
#	include/drm/drm_kunit_helpers.h
# Conflicts:
#	drivers/gpu/drm/i915/display/intel_bw.c
#	drivers/gpu/drm/i915/display/intel_vblank.c
#	drivers/gpu/drm/i915/i915_drv.h
#	drivers/gpu/drm/xe/xe_device_types.h
# Conflicts:
#	drivers/gpu/drm/xe/xe_device.c
#	drivers/gpu/drm/xe/xe_guc_pc.c
#	drivers/gpu/drm/xe/xe_pci.c
#	drivers/gpu/drm/xe/xe_survivability_mode.c
#	drivers/gpu/drm/xe/xe_survivability_mode.h
#	drivers/gpu/drm/xe/xe_wa_oob.rules
It is possible, however unlikely, for the CPU to access memory which is
in the GPU triggering a fault without a PM reference. Ensure a PM ref is
held when doing a SVM copy to SRAM.

Fixes the below splat found in local testing:
[ 1269.500163] ------------[ cut here ]------------
[ 1269.500167] xe 0000:03:00.0: [drm] Missing outer runtime PM protection
[ 1269.500184] WARNING: CPU: 8 PID: 38648 at drivers/gpu/drm/xe/xe_pm.c:664 xe_pm_runtime_get_noresume+0x86/0xb0 [xe]
[ 1269.500226] Modules linked in: xe drm_gpusvm drm_gpuvm drm_ttm_helper ttm drm_exec gpu_sched drm_suballoc_helper drm_buddy drm_kms_helper snd_hda_codec_realtek snd_hda_codec_generic snd_hda_scodec_component snd_hda_intel snd_intel_dspcfg snd_hda_codec x86_pkg_temp_thermal snd_hwdep coretemp snd_hda_core i2c_i801 i2c_mux snd_pcm wmi_bmof i2c_smbus mei_pxp mei_hdcp video wmi mei_me mei fuse igb e1000e i2c_algo_bit ptp ghash_clmulni_intel pps_core intel_lpss_pci
[ 1269.500257] CPU: 8 UID: 0 PID: 38648 Comm: xe_exec_system_ Tainted: G        W           6.15.0-rc2-xe+ torvalds#158 PREEMPT(undef)
[ 1269.500260] Tainted: [W]=WARN
[ 1269.500261] Hardware name: Intel Corporation Raptor Lake Client Platform/RPL-S ADP-S DDR5 UDIMM CRB, BIOS RPLSFWI1.R00.3492.A00.2211291114 11/29/2022
[ 1269.500262] RIP: 0010:xe_pm_runtime_get_noresume+0x86/0xb0 [xe]
[ 1269.500293] Code: ee 31 c0 48 85 db 48 0f 44 f8 4c 8b 67 50 4d 85 e4 74 2e e8 6c 0b 9a e1 4c 89 e2 48 c7 c7 80 d5 4e a0 48 89 c6 e8 aa 51 11 e1 <0f> 0b eb c1 48 8b 47 08 f0 ff 80 f8 02 00 00 5b 41 5c c3 cc cc cc
[ 1269.500294] RSP: 0000:ffffc9000ed439c0 EFLAGS: 00010282
[ 1269.500297] RAX: 0000000000000000 RBX: ffff888113568000 RCX: 0000000000000000
[ 1269.500298] RDX: 0000000000000002 RSI: 0000000000000001 RDI: 00000000ffffffff
[ 1269.500299] RBP: ffff888111bdf600 R08: ffff88888d5fffe8 R09: 00000000fffdffff
[ 1269.500300] R10: ffff88888c800000 R11: ffff88888d300000 R12: ffff888103b3dd10
[ 1269.500301] R13: ffffc9000ed43a70 R14: ffff88813e5a52c0 R15: ffff88813e5a52c0
[ 1269.500302] FS:  00007f1e596a3940(0000) GS:ffff88890ac15000(0000) knlGS:0000000000000000
[ 1269.500304] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1269.500305] CR2: 0000563cf48c3ee8 CR3: 000000031055e001 CR4: 0000000000f70ef0
[ 1269.500306] PKRU: 55555554
[ 1269.500307] Call Trace:
[ 1269.500308]  <TASK>
[ 1269.500310]  xe_sched_job_create+0x159/0x330 [xe]
[ 1269.500342]  xe_bb_create_migration_job+0x7c/0x510 [xe]
[ 1269.500359]  ? rcu_is_watching+0x11/0x50
[ 1269.500363]  ? __kmalloc_cache_noprof+0x255/0x330
[ 1269.500366]  ? xelp_pte_encode_addr+0x34/0x1d0 [xe]
[ 1269.500394]  xe_migrate_vram+0x2c5/0x620 [xe]
[ 1269.500558]  ? __iommu_dma_map+0x99/0x170
[ 1269.500569]  xe_svm_copy+0x486/0x620 [xe]
[ 1269.500613]  drm_gpusvm_migrate_to_ram+0x290/0x330 [drm_gpusvm]
[ 1269.500624]  do_swap_page+0xff7/0x2440
[ 1269.500633]  ? __pfx_default_wake_function+0x10/0x10
[ 1269.500640]  ? rcu_is_watching+0x11/0x50
[ 1269.500646]  __handle_mm_fault+0x617/0x950
[ 1269.500658]  handle_mm_fault+0xbf/0x250
[ 1269.500664]  do_user_addr_fault+0x177/0x6a0
[ 1269.500672]  exc_page_fault+0x63/0x1c0
[ 1269.500678]  asm_exc_page_fault+0x26/0x30
[ 1269.500681] RIP: 0033:0x7f1e5b8b1b0f
[ 1269.500684] Code: 15 00 49 8d 0c 1a 49 39 d4 49 89 4c 24 60 0f 95 c2 48 29 d8 0f b6 d2 48 83 c8 01 48 c1 e2 02 48 09 da 48 83 ca 01 49 89 52 08 <48> 89 41 08 49 8d 4a 10 eb af 48 8d 0d 78 ea 12 00 ba 64 10 00 00
[ 1269.500687] RSP: 002b:00007fff7ccb74c0 EFLAGS: 00010206
[ 1269.500690] RAX: 0000000000521121 RBX: 0000000000001010 RCX: 0000563cf48c3ee0
[ 1269.500692] RDX: 0000000000001011 RSI: ffffffffffffff20 RDI: 0000000000000000
[ 1269.500695] RBP: 00007fff7ccb7540 R08: 0000000000000000 R09: 0000000000000001
[ 1269.500697] R10: 0000563cf48c2ed0 R11: 0000000000000206 R12: 00007f1e5ba11ac0
[ 1269.500699] R13: 0000000000001000 R14: 0000000000000000 R15: 00007f1e5ba11b20
[ 1269.500708]  </TASK>
[ 1269.500710] irq event stamp: 176580299
[ 1269.500712] hardirqs last  enabled at (176580305): [<ffffffff813447f6>] __up_console_sem+0x66/0x70
[ 1269.500716] hardirqs last disabled at (176580310): [<ffffffff813447db>] __up_console_sem+0x4b/0x70
[ 1269.500719] softirqs last  enabled at (176580168): [<ffffffff812a637e>] __irq_exit_rcu+0xbe/0x110
[ 1269.500723] softirqs last disabled at (176579351): [<ffffffff812a637e>] __irq_exit_rcu+0xbe/0x110
[ 1269.500726] ---[ end trace 0000000000000000 ]---

Fixes: c5b3eb5 ("drm/xe: Add GPUSVM device memory copy vfunc functions")
Cc: stable@vger.kernel.org
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
@sys-i915-oscijenkins sys-i915-oscijenkins force-pushed the drm-tip branch 6 times, most recently from ae8b802 to d161139 Compare April 18, 2025 16:35
@sys-i915-oscijenkins sys-i915-oscijenkins force-pushed the drm-tip branch 30 times, most recently from 13b0995 to cdf42c8 Compare May 2, 2025 04:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants