Skip to content

Conversation

shiqingd
Copy link

@shiqingd shiqingd commented Feb 23, 2024

This PR include the 9 patches which are used for EDAC/igen6 Intel Meteor Lake-P SoCs support.
Five patches have already been upstreamed to Linux kernel community.

Upstream:
commit 7f5b45e
6807434 ("EDAC/igen6: Add Intel Meteor Lake-P SoCs support")

commit 4681a10
3c77090 ("EDAC/igen6: Add Intel Meteor Lake-PS SoCs support").

commit 72597e7
a264f71 ("EDAC/igen6: Make get_mchbar() helper function").

commit cfa3e9a
d23627a ("EDAC/igen6: Add Intel Raptor Lake-P SoCs support").

commit 8f077b8
c4a5398 ("EDAC/igen6: Add Intel Alder Lake-N SoCs support").

Four Patches have already been uploaded to github/linux-intel-lts public repo.

commit be12908
intel/linux-intel-lts@da1536bdff36 ("EDAC/igen6: Add registration APIs for In-Band ECC error notification")

commit 1673f29
intel/linux-intel-lts@39b1cbd8dc70 ("EDAC/ieh: Add I/O device EDAC support for Intel Tiger Lake-H SoC")

commit 15ecf04
intel/linux-intel-lts@9f721ed88580 ("EDAC/ieh: Add I/O device EDAC driver for Intel CPUs with IEH")

commit 22dbb4d
intel/linux-intel-lts@1afcf245df31 ("x86/mce: Add MCACOD code for generic I/O error")

Summary by Sourcery

Sync and integrate upstream EDAC driver improvements by adding support for new Intel SoCs (Meteor Lake-P/PS), introducing the IEH I/O device EDAC driver for Tiger Lake, and enhancing IBECC error handling with a notifier API and broader error address support

New Features:

  • Add EDAC/igen6 support for Intel Meteor Lake-P and Meteor Lake-PS SoCs
  • Introduce I/O device EDAC driver for Intel Integrated Error Handler on Tiger Lake platforms
  • Expose IBECC error notification API via a blocking notifier chain

Enhancements:

  • Refactor MCHBAR access with a dedicated helper and extend ECC error address extraction to support wider bitfields
  • Add MCACOD generic I/O error code and PCI segment/RID extraction macros in x86 MCE headers

Build:

  • Include ieh_edac driver in the EDAC build configuration

Errors of some I/O devices can be signaled by MCE and logged in
IOMCA bank. Add MCACOD code of generic I/O error and related macros
for MCi_MISC to support IOMCA logging.

See Intel Software Developers' Manual, version 071, volume 3B,
section "IOMCA".

Intel-SIG: x86/mce: Add MCACOD code for generic I/O error.
intel/linux-intel-lts@1afcf245df31

Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
[ Qingdong Shi: amend commit log ]
Signed-off-by: Qingdong Shi <qingdong.shi@intel.com>
Integrated Error Handlers (IEHs) are PCIe devices which aggregate and
report error events of different severities (correctable, non-fatal
uncorrectable, and fatal uncorrectable) from various I/O devices, e.g.,
PCIe devices, legacy PCI devices. Each error severity is notified by
one of {SMI, NMI, MCE} which is configured by BIOS/platform firmware.

The first IEH-supported platform is Intel Tiger Lake-U CPU. The driver
reads/prints the error severity and error source (bus/device/function)
logged in the IEH(s) and restarts the system on fatal I/O device error.

Intel-SIG: EDAC/ieh: Add I/O device EDAC driver for Intel CPUs with IEH.
intel/linux-intel-lts@9f721ed88580

Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
[ Qingdong Shi: amend commit log ]
Signed-off-by: Qingdong Shi <qingdong.shi@intel.com>
Tiger Lake-H SoC shares the same Integrated Error Handler(IEH) architecture
with Tiger Lake-U, so can use the same ieh_edac driver.

Add Tiger Lake-H IEH device ID for I/O device EDAC support.

Intel-SIG: EDAC/ieh: Add I/O device EDAC support for Intel Tiger Lake-H SoC.
intel/linux-intel-lts@39b1cbd8dc70

Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
[ Qingdong Shi: amend commit log ]
Signed-off-by: Qingdong Shi <qingdong.shi@intel.com>
The igen6_edac driver is the root to capture the In-Band ECC error
event. There are some external modules which want to be notified about
the In-Band ECC errors for specific error handling. So add the
registration APIs for those external modules for the In-Band ECC errors.

Intel-SIG: EDAC/igen6: Add registration APIs for In-Band ECC error notification.
intel/linux-intel-lts@da1536bdff36

Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
[ Qingdong Shi: amend commit log ]
Signed-off-by: Qingdong Shi <qingdong.shi@intel.com>
Mainline: commit c4a5398("EDAC/igen6: Add Intel Alder Lake-N SoCs support")
from: v6.8-rc1

Add several Intel Alder Lake-N SoC compute die IDs for EDAC support.
Alder Lake-N (one memory controller) is a cut-down derivative of
Alder Lake-P (two memory controllers).

Intel-SIG: Upstream commit c4a5398 ("EDAC/igen6: Add Intel Alder Lake-N SoCs support").

Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
[ Qingdong Shi: amend commit log ]
Signed-off-by: Qingdong Shi <qingdong.shi@intel.com>
Mainline: commit d23627a("EDAC/igen6: Add Intel Raptor Lake-P SoCs support")
from: v6.8-rc1

Add several Intel Raptor Lake-P SoC compute die IDs for EDAC support.
These Raptor Lake-P SoCs use similar memory controller and IBECC as
Alder Lake-P SoC but extend the most significant bit of error address
logged in IBECC from bit 38 to bit 45.

Intel-SIG: Upstream commit d23627a ("EDAC/igen6: Add Intel Raptor Lake-P SoCs support").
upstream link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d23627a7688f&dt=2

Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
[ Qingdong Shi: amend commit log ]
Signed-off-by: Qingdong Shi <qingdong.shi@intel.com>
Mainline: commit a264f71 ("EDAC/igen6: Make get_mchbar() helper function")
from: v6.8-rc1

Make get_mchbar() helper function to retrieve the BAR address of
the memory controller. No function changes.

Intel-SIG: Upstream commit a264f71 ("EDAC/igen6: Make get_mchbar() helper function").

Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
[ Qingdong Shi: amend commit log ]
Signed-off-by: Qingdong Shi <qingdong.shi@intel.com>
Mainline: commit 3c77090 ("EDAC/igen6: Add Intel Meteor Lake-PS SoCs support")
from: v6.8-rc1

Add several Intel Meteor Lake-PS SoC compute die IDs for EDAC support.
These Meteor Lake-PS SoCs use similar memory controller and IBECC as
Alder Lake-P SoC.

Intel-SIG: Upstream commit 3c77090 ("EDAC/igen6: Add Intel Meteor Lake-PS SoCs support").

Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
[ Qingdong Shi: amend commit log ]
Signed-off-by: Qingdong Shi <qingdong.shi@intel.com>
Mainline: commit 6807434 ("EDAC/igen6: Add Intel Meteor Lake-P SoCs support")
from: v6.8-rc1

Add several Intel Meteor Lake-P SoC compute die IDs for EDAC support.
These Meteor Lake-P SoCs use similar memory controller and IBECC as
Alder Lake-P SoC.

Intel-SIG: Upstream commit 6807434 ("EDAC/igen6: Add Intel Meteor Lake-P SoCs support").

Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
[ Qingdong Shi: amend commit log ]
Signed-off-by: Qingdong Shi <qingdong.shi@intel.com>
@matrix-wsk matrix-wsk merged commit f4f7870 into deepin-community:linux-6.6.y Feb 23, 2024
opsiff pushed a commit to opsiff/UOS-kernel that referenced this pull request Jul 30, 2024
commit 667574e upstream.

When tries to demote 1G hugetlb folios, a lockdep warning is observed:

============================================
WARNING: possible recursive locking detected
6.10.0-rc6-00452-ga4d0275fa660-dirty deepin-community#79 Not tainted
--------------------------------------------
bash/710 is trying to acquire lock:
ffffffff8f0a7850 (&h->resize_lock){+.+.}-{3:3}, at: demote_store+0x244/0x460

but task is already holding lock:
ffffffff8f0a6f48 (&h->resize_lock){+.+.}-{3:3}, at: demote_store+0xae/0x460

other info that might help us debug this:
 Possible unsafe locking scenario:

       CPU0
       ----
  lock(&h->resize_lock);
  lock(&h->resize_lock);

 *** DEADLOCK ***

 May be due to missing lock nesting notation

4 locks held by bash/710:
 #0: ffff8f118439c3f0 (sb_writers#5){.+.+}-{0:0}, at: ksys_write+0x64/0xe0
 #1: ffff8f11893b9e88 (&of->mutex#2){+.+.}-{3:3}, at: kernfs_fop_write_iter+0xf8/0x1d0
 #2: ffff8f1183dc4428 (kn->active#98){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x100/0x1d0
 #3: ffffffff8f0a6f48 (&h->resize_lock){+.+.}-{3:3}, at: demote_store+0xae/0x460

stack backtrace:
CPU: 3 PID: 710 Comm: bash Not tainted 6.10.0-rc6-00452-ga4d0275fa660-dirty deepin-community#79
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
Call Trace:
 <TASK>
 dump_stack_lvl+0x68/0xa0
 __lock_acquire+0x10f2/0x1ca0
 lock_acquire+0xbe/0x2d0
 __mutex_lock+0x6d/0x400
 demote_store+0x244/0x460
 kernfs_fop_write_iter+0x12c/0x1d0
 vfs_write+0x380/0x540
 ksys_write+0x64/0xe0
 do_syscall_64+0xb9/0x1d0
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7fa61db14887
RSP: 002b:00007ffc56c48358 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007fa61db14887
RDX: 0000000000000002 RSI: 000055a030050220 RDI: 0000000000000001
RBP: 000055a030050220 R08: 00007fa61dbd1460 R09: 000000007fffffff
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000002
R13: 00007fa61dc1b780 R14: 00007fa61dc17600 R15: 00007fa61dc16a00
 </TASK>

Lockdep considers this an AA deadlock because the different resize_lock
mutexes reside in the same lockdep class, but this is a false positive.
Place them in distinct classes to avoid these warnings.

Link: https://lkml.kernel.org/r/20240712031314.2570452-1-linmiaohe@huawei.com
Fixes: 8531fc6 ("hugetlb: add hugetlb demote page support")
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Acked-by: Muchun Song <muchun.song@linux.dev>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit b10c359e644235bc1c1899981d7203cbf5abf443)
opsiff pushed a commit to opsiff/UOS-kernel that referenced this pull request Aug 1, 2024
commit 667574e upstream.

When tries to demote 1G hugetlb folios, a lockdep warning is observed:

============================================
WARNING: possible recursive locking detected
6.10.0-rc6-00452-ga4d0275fa660-dirty deepin-community#79 Not tainted
--------------------------------------------
bash/710 is trying to acquire lock:
ffffffff8f0a7850 (&h->resize_lock){+.+.}-{3:3}, at: demote_store+0x244/0x460

but task is already holding lock:
ffffffff8f0a6f48 (&h->resize_lock){+.+.}-{3:3}, at: demote_store+0xae/0x460

other info that might help us debug this:
 Possible unsafe locking scenario:

       CPU0
       ----
  lock(&h->resize_lock);
  lock(&h->resize_lock);

 *** DEADLOCK ***

 May be due to missing lock nesting notation

4 locks held by bash/710:
 #0: ffff8f118439c3f0 (sb_writers#5){.+.+}-{0:0}, at: ksys_write+0x64/0xe0
 #1: ffff8f11893b9e88 (&of->mutex#2){+.+.}-{3:3}, at: kernfs_fop_write_iter+0xf8/0x1d0
 #2: ffff8f1183dc4428 (kn->active#98){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x100/0x1d0
 #3: ffffffff8f0a6f48 (&h->resize_lock){+.+.}-{3:3}, at: demote_store+0xae/0x460

stack backtrace:
CPU: 3 PID: 710 Comm: bash Not tainted 6.10.0-rc6-00452-ga4d0275fa660-dirty deepin-community#79
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
Call Trace:
 <TASK>
 dump_stack_lvl+0x68/0xa0
 __lock_acquire+0x10f2/0x1ca0
 lock_acquire+0xbe/0x2d0
 __mutex_lock+0x6d/0x400
 demote_store+0x244/0x460
 kernfs_fop_write_iter+0x12c/0x1d0
 vfs_write+0x380/0x540
 ksys_write+0x64/0xe0
 do_syscall_64+0xb9/0x1d0
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7fa61db14887
RSP: 002b:00007ffc56c48358 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007fa61db14887
RDX: 0000000000000002 RSI: 000055a030050220 RDI: 0000000000000001
RBP: 000055a030050220 R08: 00007fa61dbd1460 R09: 000000007fffffff
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000002
R13: 00007fa61dc1b780 R14: 00007fa61dc17600 R15: 00007fa61dc16a00
 </TASK>

Lockdep considers this an AA deadlock because the different resize_lock
mutexes reside in the same lockdep class, but this is a false positive.
Place them in distinct classes to avoid these warnings.

Link: https://lkml.kernel.org/r/20240712031314.2570452-1-linmiaohe@huawei.com
Fixes: 8531fc6 ("hugetlb: add hugetlb demote page support")
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Acked-by: Muchun Song <muchun.song@linux.dev>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 60c223ed1f6f47ad36da6b4032e3b45cad8fb41f)
opsiff pushed a commit to opsiff/UOS-kernel that referenced this pull request Aug 4, 2024
commit 667574e upstream.

When tries to demote 1G hugetlb folios, a lockdep warning is observed:

============================================
WARNING: possible recursive locking detected
6.10.0-rc6-00452-ga4d0275fa660-dirty deepin-community#79 Not tainted
--------------------------------------------
bash/710 is trying to acquire lock:
ffffffff8f0a7850 (&h->resize_lock){+.+.}-{3:3}, at: demote_store+0x244/0x460

but task is already holding lock:
ffffffff8f0a6f48 (&h->resize_lock){+.+.}-{3:3}, at: demote_store+0xae/0x460

other info that might help us debug this:
 Possible unsafe locking scenario:

       CPU0
       ----
  lock(&h->resize_lock);
  lock(&h->resize_lock);

 *** DEADLOCK ***

 May be due to missing lock nesting notation

4 locks held by bash/710:
 #0: ffff8f118439c3f0 (sb_writers#5){.+.+}-{0:0}, at: ksys_write+0x64/0xe0
 #1: ffff8f11893b9e88 (&of->mutex#2){+.+.}-{3:3}, at: kernfs_fop_write_iter+0xf8/0x1d0
 #2: ffff8f1183dc4428 (kn->active#98){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x100/0x1d0
 #3: ffffffff8f0a6f48 (&h->resize_lock){+.+.}-{3:3}, at: demote_store+0xae/0x460

stack backtrace:
CPU: 3 PID: 710 Comm: bash Not tainted 6.10.0-rc6-00452-ga4d0275fa660-dirty deepin-community#79
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
Call Trace:
 <TASK>
 dump_stack_lvl+0x68/0xa0
 __lock_acquire+0x10f2/0x1ca0
 lock_acquire+0xbe/0x2d0
 __mutex_lock+0x6d/0x400
 demote_store+0x244/0x460
 kernfs_fop_write_iter+0x12c/0x1d0
 vfs_write+0x380/0x540
 ksys_write+0x64/0xe0
 do_syscall_64+0xb9/0x1d0
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7fa61db14887
RSP: 002b:00007ffc56c48358 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007fa61db14887
RDX: 0000000000000002 RSI: 000055a030050220 RDI: 0000000000000001
RBP: 000055a030050220 R08: 00007fa61dbd1460 R09: 000000007fffffff
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000002
R13: 00007fa61dc1b780 R14: 00007fa61dc17600 R15: 00007fa61dc16a00
 </TASK>

Lockdep considers this an AA deadlock because the different resize_lock
mutexes reside in the same lockdep class, but this is a false positive.
Place them in distinct classes to avoid these warnings.

Link: https://lkml.kernel.org/r/20240712031314.2570452-1-linmiaohe@huawei.com
Fixes: 8531fc6 ("hugetlb: add hugetlb demote page support")
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Acked-by: Muchun Song <muchun.song@linux.dev>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 99a49b6)
Avenger-285714 pushed a commit that referenced this pull request Aug 12, 2024
commit 667574e upstream.

When tries to demote 1G hugetlb folios, a lockdep warning is observed:

============================================
WARNING: possible recursive locking detected
6.10.0-rc6-00452-ga4d0275fa660-dirty #79 Not tainted
--------------------------------------------
bash/710 is trying to acquire lock:
ffffffff8f0a7850 (&h->resize_lock){+.+.}-{3:3}, at: demote_store+0x244/0x460

but task is already holding lock:
ffffffff8f0a6f48 (&h->resize_lock){+.+.}-{3:3}, at: demote_store+0xae/0x460

other info that might help us debug this:
 Possible unsafe locking scenario:

       CPU0
       ----
  lock(&h->resize_lock);
  lock(&h->resize_lock);

 *** DEADLOCK ***

 May be due to missing lock nesting notation

4 locks held by bash/710:
 #0: ffff8f118439c3f0 (sb_writers#5){.+.+}-{0:0}, at: ksys_write+0x64/0xe0
 #1: ffff8f11893b9e88 (&of->mutex#2){+.+.}-{3:3}, at: kernfs_fop_write_iter+0xf8/0x1d0
 #2: ffff8f1183dc4428 (kn->active#98){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x100/0x1d0
 #3: ffffffff8f0a6f48 (&h->resize_lock){+.+.}-{3:3}, at: demote_store+0xae/0x460

stack backtrace:
CPU: 3 PID: 710 Comm: bash Not tainted 6.10.0-rc6-00452-ga4d0275fa660-dirty #79
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
Call Trace:
 <TASK>
 dump_stack_lvl+0x68/0xa0
 __lock_acquire+0x10f2/0x1ca0
 lock_acquire+0xbe/0x2d0
 __mutex_lock+0x6d/0x400
 demote_store+0x244/0x460
 kernfs_fop_write_iter+0x12c/0x1d0
 vfs_write+0x380/0x540
 ksys_write+0x64/0xe0
 do_syscall_64+0xb9/0x1d0
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7fa61db14887
RSP: 002b:00007ffc56c48358 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007fa61db14887
RDX: 0000000000000002 RSI: 000055a030050220 RDI: 0000000000000001
RBP: 000055a030050220 R08: 00007fa61dbd1460 R09: 000000007fffffff
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000002
R13: 00007fa61dc1b780 R14: 00007fa61dc17600 R15: 00007fa61dc16a00
 </TASK>

Lockdep considers this an AA deadlock because the different resize_lock
mutexes reside in the same lockdep class, but this is a false positive.
Place them in distinct classes to avoid these warnings.

Link: https://lkml.kernel.org/r/20240712031314.2570452-1-linmiaohe@huawei.com
Fixes: 8531fc6 ("hugetlb: add hugetlb demote page support")
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Acked-by: Muchun Song <muchun.song@linux.dev>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 99a49b6)
@Avenger-285714
Copy link
Member

@sourcery-ai review

Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR synchronizes EDAC driver updates for Intel Meteor Lake and related SoCs, adding in-band ECC error notification support (IBECC), new SKU configurations, and a new IEH EDAC driver.

  • Introduce notifier API for IBECC error handling
  • Extend igen6 driver with Meteor Lake-P/S SKU support and error address extraction
  • Add a new EDAC driver (ieh_edac) for Intel Integrated Error Handler devices, plus build and Kconfig entries
  • Update MCE definitions to include generic I/O error codes

Reviewed Changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
drivers/edac/igen6_edac.h Declare IBECC notifier API and ibecc_err_info struct
drivers/edac/igen6_edac.c Implement IBECC notifier chain, new SKU configs, and helpers
drivers/edac/ieh_edac.c New IEH EDAC driver implementation
drivers/edac/Makefile Add ieh_edac.o to EDAC build targets
drivers/edac/Kconfig Add EDAC_IEH configuration entry
arch/x86/include/asm/mce.h Add MCACOD_IOERR and PCI misc bitfield macros
Comments suppressed due to low confidence (2)

drivers/edac/igen6_edac.c:85

  • [nitpick] The macro name 'ECC_ERROR_LOG_ADDR45' is unclear about its purpose; consider renaming to a descriptive name like 'ECC_ERROR_LOG_ADDR_WIDE' to indicate it extracts a wider address field.
#define ECC_ERROR_LOG_ADDR45(v)		GET_BITFIELD(v, 5, 45)

drivers/edac/igen6_edac.h:1

  • [nitpick] It would be helpful to add brief function comments for ibecc_err_register_notifier and ibecc_err_unregister_notifier to explain their usage and the notifier chain semantics.
/* SPDX-License-Identifier: GPL-2.0 */

Comment on lines +19 to +20
int ibecc_err_register_notifer(struct notifier_block *nb);
int ibecc_err_unregister_notifer(struct notifier_block *nb);
Copy link

Copilot AI Jul 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function name 'ibecc_err_register_notifer' contains a typo: 'notifer' should be 'notifier' for consistency and clarity.

Suggested change
int ibecc_err_register_notifer(struct notifier_block *nb);
int ibecc_err_unregister_notifer(struct notifier_block *nb);
int ibecc_err_register_notifier(struct notifier_block *nb);
int ibecc_err_unregister_notifier(struct notifier_block *nb);

Copilot uses AI. Check for mistakes.

eaddr = ECC_ERROR_LOG_ADDR(node->ecclog) <<
ECC_ERROR_LOG_ADDR_SHIFT;
if (res_cfg->err_addr)
eaddr = res_cfg->err_addr(node->ecclog);
Copy link

Copilot AI Jul 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When using the err_addr callback you must apply the same left shift (<< ECC_ERROR_LOG_ADDR_SHIFT) after extracting the bitfield, or the computed physical address will be incorrect.

Suggested change
eaddr = res_cfg->err_addr(node->ecclog);
eaddr = res_cfg->err_addr(node->ecclog) <<
ECC_ERROR_LOG_ADDR_SHIFT;

Copilot uses AI. Check for mistakes.

#define DID_ADL_SKU4 0x4641

/* Compute die IDs for Alder Lake-N with IBECC */
#define DID_ADL_N_SKU1 0x4614
Copy link

Copilot AI Jul 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The many DID_* SKU definitions result in a long, repetitive block; consider grouping them in an array or using helper macros to simplify future updates.

Copilot uses AI. Check for mistakes.

Copy link

sourcery-ai bot commented Jul 15, 2025

Reviewer's Guide

Introduces IBECC error extraction and notifier support in the igen6_edac driver for new Intel SoCs (Meteor Lake-P/PS, Raptor Lake-P, Alder Lake-N), refactors MCHBAR access into a helper, extends the x86 MCE interface for generic I/O error decoding, and adds a dedicated IEH EDAC driver with build and Kconfig integration.

Sequence diagram for IBECC error notification via notifier chain

sequenceDiagram
    participant igen6_edac as igen6_edac driver
    participant ibecc_err_handler_chain as ibecc_err_handler_chain
    participant handler as External Handler (notifier_block)

    igen6_edac->>ibecc_err_handler_chain: blocking_notifier_call_chain(&ibecc_err_handler_chain, 0, &e)
    ibecc_err_handler_chain-->>handler: handler->notifier_call(&e)
    handler-->>ibecc_err_handler_chain: return NOTIFY_OK/NOTIFY_DONE
    ibecc_err_handler_chain-->>igen6_edac: return
Loading

Class diagram for new IBECC error notification and IEH EDAC driver integration

classDiagram
    class res_config {
        +u32 ibecc_base
        +u32 ibecc_error_log_offset
        +bool (*ibecc_available)(struct pci_dev *pdev)
        +u64 (*err_addr)(u64 ecclog)
        +u64 (*err_addr_to_sys_addr)(u64 eaddr, int mc)
        +u64 (*err_addr_to_imc_addr)(u64 eaddr, int mc)
    }
    class ibecc_err_info {
        +enum hw_event_mc_err_type type
        +u64 sys_addr
        +u64 ecc_log
    }
    class notifier_block {
        <<external>>
    }
    class igen6_edac {
        +int ibecc_err_register_notifer(struct notifier_block *nb)
        +int ibecc_err_unregister_notifer(struct notifier_block *nb)
        +static BLOCKING_NOTIFIER_HEAD(ibecc_err_handler_chain)
    }
    class ieh_dev {
        +struct list_head list
        +struct pci_dev *pdev
        +struct pci_sbdf sbdf
        +enum ieh_type type
        +u8 ver
        +enum evt_map corr_map
        +enum evt_map nonfatal_map
        +enum evt_map fatal_map
    }
    class ieh_config {
        +u16 did
        +enum action_on_fatal_err action
    }
    class decoded_res {
        +enum severity_level sev
        +struct pci_sbdf sbdf
    }
    igen6_edac --> ibecc_err_info : notifies
    igen6_edac --> notifier_block : registers
    res_config o-- igen6_edac : used by
    ieh_dev o-- ieh_config : uses
    ieh_dev o-- decoded_res : outputs error info
Loading

Class diagram for new IEH EDAC driver core structures

classDiagram
    class ieh_dev {
        +struct list_head list
        +struct pci_dev *pdev
        +struct pci_sbdf sbdf
        +enum ieh_type type
        +u8 ver
        +enum evt_map corr_map
        +enum evt_map nonfatal_map
        +enum evt_map fatal_map
    }
    class pci_sbdf {
        +u32 seg : 16
        +u32 bus : 8
        +u32 dev : 5
        +u32 fun : 3
    }
    class decoded_res {
        +enum severity_level sev
        +struct pci_sbdf sbdf
    }
    class ieh_config {
        +u16 did
        +enum action_on_fatal_err action
    }
    ieh_dev o-- pci_sbdf : has
    decoded_res o-- pci_sbdf : has
    ieh_dev o-- ieh_config : uses
Loading

Class diagram for new and updated PCI device ID configuration

classDiagram
    class res_config {
        +bool machine_check
        +int num_imc
        +u32 imc_base
        +u32 ibecc_base
        +u32 ibecc_error_log_offset
        +bool (*ibecc_available)(struct pci_dev *pdev)
        +u64 (*err_addr)(u64 ecclog)
        +u64 (*err_addr_to_sys_addr)(u64 eaddr, int mc)
        +u64 (*err_addr_to_imc_addr)(u64 eaddr, int mc)
    }
    class pci_device_id {
        +PCI_VDEVICE(INTEL, DID_*)
        +kernel_ulong_t driver_data (points to res_config)
    }
    pci_device_id o-- res_config : references
Loading

Class diagram for new MCE error code and helpers in x86 mce.h

classDiagram
    class mce {
        +u64 status
        +u64 misc
        +u64 addr
        +u64 mcgstatus
        +u64 mcgcap
        +u64 ip
        +u32 bank
        +u64 kflags
    }
    class mce_helpers {
        +#define MCACOD_IOERR 0x0e0b
        +#define MCI_MISC_PCISEG_MASK
        +#define MCI_MISC_PCISEG(m)
        +#define MCI_MISC_PCIRID_MASK
        +#define MCI_MISC_PCIRID(m)
    }
    mce_helpers <.. mce : used for decoding
Loading

File-Level Changes

Change Details Files
Enhance the igen6_edac driver with IBECC support and notification
  • Define new IBECC bitfields (CAPID_E_IBECC_BIT18, ECC_ERROR_LOG_ADDR45)
  • Add get_mchbar helper and SoC-specific ibecc_available/err_addr functions
  • Extend res_config with err_addr callback and create configs for ADL-N, RPL-P, MTL-PS, MTL-P SKUs
  • Update PCI ID table with new SKU entries
  • Integrate a blocking_notifier_chain and call it in igen6_output_error
drivers/edac/igen6_edac.c
drivers/edac/igen6_edac.h
Extend x86 MCE interface for generic I/O error handling
  • Introduce MCACOD_IOERR code constant
  • Add MCI_MISC_PCISEG and MCI_MISC_PCIRID macros to extract PCI IDs
arch/x86/include/asm/mce.h
Add a new IEH EDAC driver and integrate in build system
  • Implement drivers/edac/ieh_edac.c with PCI scanning, NMI/MCE handlers, and error dispatch logic
  • Add Makefile and Kconfig entries to build the IEH_EDAC module
drivers/edac/ieh_edac.c
drivers/edac/Makefile
drivers/edac/Kconfig

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @shiqingd - I've reviewed your changes - here's some feedback:

  • A corresponding Kconfig entry for CONFIG_EDAC_IEH is missing, please add it so that ieh_edac.o can be properly configured.
  • Duplicate MCHBAR reading code remains in igen6_pci_setup after introducing get_mchbar; consolidate them to avoid code duplication.
  • Blocking notifier calls are used in igen6_output_error which may run in atomic context; consider a non-blocking notifier or deferring to workqueue to prevent deadlocks.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- A corresponding Kconfig entry for CONFIG_EDAC_IEH is missing, please add it so that ieh_edac.o can be properly configured.
- Duplicate MCHBAR reading code remains in igen6_pci_setup after introducing get_mchbar; consolidate them to avoid code duplication.
- Blocking notifier calls are used in igen6_output_error which may run in atomic context; consider a non-blocking notifier or deferring to workqueue to prevent deadlocks.

## Individual Comments

### Comment 1
<location> `drivers/edac/igen6_edac.c:604` </location>
<code_context>

+static BLOCKING_NOTIFIER_HEAD(ibecc_err_handler_chain);
+
+int ibecc_err_register_notifer(struct notifier_block *nb)
+{
+	return blocking_notifier_chain_register(&ibecc_err_handler_chain, nb);
+}
+EXPORT_SYMBOL_GPL(ibecc_err_register_notifer);
+
+int ibecc_err_unregister_notifer(struct notifier_block *nb)
</code_context>

<issue_to_address>
Typo in function name 'notifer' should be 'notifier'.

The typo appears in both the function names, the header file, and the EXPORT_SYMBOL_GPL macro. Please update all instances to 'notifier' for consistency.

Suggested implementation:

```c
int ibecc_err_register_notifier(struct notifier_block *nb)
{
	return blocking_notifier_chain_register(&ibecc_err_handler_chain, nb);
}
EXPORT_SYMBOL_GPL(ibecc_err_register_notifier);

int ibecc_err_unregister_notifier(struct notifier_block *nb)
{
	return blocking_notifier_chain_unregister(&ibecc_err_handler_chain, nb);
}
EXPORT_SYMBOL_GPL(ibecc_err_unregister_notifier);

```

If there is a header file (e.g., `drivers/edac/igen6_edac.h` or similar) that declares these functions, you should also update the function declarations there to use `notifier` instead of `notifer`.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment on lines +604 to +608
int ibecc_err_register_notifer(struct notifier_block *nb)
{
return blocking_notifier_chain_register(&ibecc_err_handler_chain, nb);
}
EXPORT_SYMBOL_GPL(ibecc_err_register_notifer);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (typo): Typo in function name 'notifer' should be 'notifier'.

The typo appears in both the function names, the header file, and the EXPORT_SYMBOL_GPL macro. Please update all instances to 'notifier' for consistency.

Suggested implementation:

int ibecc_err_register_notifier(struct notifier_block *nb)
{
	return blocking_notifier_chain_register(&ibecc_err_handler_chain, nb);
}
EXPORT_SYMBOL_GPL(ibecc_err_register_notifier);

int ibecc_err_unregister_notifier(struct notifier_block *nb)
{
	return blocking_notifier_chain_unregister(&ibecc_err_handler_chain, nb);
}
EXPORT_SYMBOL_GPL(ibecc_err_unregister_notifier);

If there is a header file (e.g., drivers/edac/igen6_edac.h or similar) that declares these functions, you should also update the function declarations there to use notifier instead of notifer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants