Skip to content

feat: allow for config refresh and npd reload #3

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 9 commits into
base: dy/working
Choose a base branch
from

Conversation

daveoy
Copy link
Owner

@daveoy daveoy commented Jul 22, 2025

this PR allows for NPD to reload when logmonitor config is refreshed by

  • propagating a shared context via problemdaemon
  • setting exporters up to respect ctx.Done()
  • starting a refresh loop if configured to do so by the plugin's config
  • enabling live reload with --reload-on-config-change

@daveoy
Copy link
Owner Author

daveoy commented Jul 22, 2025

sample logs from dev

I0722 15:00:04.032957      13 custom_plugin_monitor.go:80] Finish parsing custom plugin monitor config file /config/network-monitor.json: {Plugin:custom PluginGlobalConfig:{InvokeIntervalString:0xc00005f150 TimeoutString:0xc00005f160 InvokeInterval:1m0s Timeout:50s MaxOutputLength:0xc000015a98 Concurrency:0xc000015aa0 EnableMessageChangeBasedConditionUpdate:0x2fec7f1 SkipInitialStatus:0x2fec7f2} Source:network-monitor DefaultConditions:[{Type:DNSFailure Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:DNSIsOk Message:DNS lookups are working} {Type:ConnectivityFailure Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:ConnectionIsOk Message:Internal connectivity is working} {Type:APIFailure Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:ConnectionIsOk Message:Internal connectivity to K8S APIServer is working} {Type:PubFailure Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:ConnectionIsOk Message:Public Internet connectivity is working}] Rules:[0xc0005c15e0 0xc0005c1650 0xc0005c16c0 0xc0005c1730] EnableMetricsReporting:0xc000015aa8}
I0722 15:00:04.033195      13 custom_plugin_monitor.go:80] Finish parsing custom plugin monitor config file /config/pci-monitor.json: {Plugin:custom PluginGlobalConfig:{InvokeIntervalString:0xc00005f3d0 TimeoutString:0xc00005f3e0 InvokeInterval:20s Timeout:50s MaxOutputLength:0xc000015db8 Concurrency:0xc000015dc0 EnableMessageChangeBasedConditionUpdate:0xc000015dc8 SkipInitialStatus:0x2fec7f2} Source:pci-monitor DefaultConditions:[{Type:GPUPCIFault Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:PCIIsOk Message:VFIO PCI Connectivity is OK} {Type:InfiniBandLinkFault Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:LinkIsOk Message:InfiniBand Interfaces OK}] Rules:[0xc0005c1960 0xc0005c19d0 0xc0005c1a40] EnableMetricsReporting:0xc000015dc9}
I0722 15:00:04.034300      13 log_monitor.go:82] Finish parsing log monitor config file /config/kernel-kmsg.json: {WatcherConfig:{Plugin:kmsg PluginConfig:map[refresh:true refreshDurationSeconds:10 revive:true] SkipList:[] LogPath:/dev/kmsg Lookback:5m Delay:} BufferSize:1000 Source: DefaultConditions:[{Type:KernelDeadlock Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:KernelHasNoDeadlock Message:Kernel has no deadlock} {Type:KernelHardlock Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoCPUHardLockup Message:Kernel has no CPU Hard Lockup} {Type:ReadonlyFilesystem Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:FilesystemIsNotReadOnly Message:Filesystem is not read-only} {Type:LocalDiskErrors Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoDiskErrors Message:Local NVMe is healthy} {Type:GPUWantsReset Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoGPUErrors Message:GPUs are not reporting non-fatal errors} {Type:GPUChannelRetirement Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoGPUErrors Message:GPUs are not reporting any channel retirement errors} {Type:GPUChannelRetirementFailure Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoGPUErrors Message:No channels have failed retirement} {Type:GPURowRemapFailure Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:RowRemapOk Message:No rows have failed remapping} {Type:GPUECCUncorrectableError Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoECCError Message:No GPUs have triggered an ECC uncorrectable error} {Type:GPUInvalidPushBuffer Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoGPUErrors Message:GPUs are not reporting invalid push buffer} {Type:GPUContextSwitchFault Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoGPUErrors Message:GPUs are not reporting context switch fault} {Type:GPUFault Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoGPUErrors Message:GPUs are not reporting errors} {Type:IBPCIUnavailable Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:IBPCIAvailable Message:InfiniBand adapters are not reporting PCI slot unavailable} {Type:GPUFallenOffBus Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoMatchingXid Message:No Xid 79 detected} {Type:GPUGSPTimeoutXid119 Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoMatchingXid Message:No Xid 119 detected} {Type:GPUContextSwitchTimeoutXid109 Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoMatchingXid Message:No Xid 109 detected} {Type:GPUGSPPanicXid120 Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoMatchingXid Message:No Xid 120 detected} {Type:PersistentStorageFault Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoStorageErrors Message:Storage subsystem is not reporting any errors} {Type:HardwareErrorFatal Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoHardwareErrorFatal Message:Platform is not reporting any fatal hardware errors} {Type:HardwareErrorInterruptCPU Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoInterruptsDetected Message:Platform is reporting no hardware errors via APEI for CPU} {Type:HardwareErrorInterruptMemory Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoInterruptsDetected Message:Platform is reporting no hardware errors via APEI for Memory} {Type:NVLinkXIDFatal Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoMatchingXid Message:No XID 144-150, 154-157 detected} {Type:NVLinkXIDFatalSwitch Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoMatchingXid Message:No XID 144-150, 154-157 detected from switch side} {Type:NVLinkMaskError Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoMaskError Message:No GPUs are reporting link mask errors} {Type:Sector0LocalDiskErrors Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoSector0NVMEErrors Message:No nvme i/o errors detected on sector 0} {Type:SuspectedLocalDiskErrors Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoSuspectedNVMEErrors Message:No nvme i/o errors detected}] Rules:[{Type:temporary Condition: Reason:SystemOOMKilling Pattern:Out of memory: Killed process \d+ (.+) total-vm:\d+kB, anon-rss:\d+kB, file-rss:\d+kB.* PatternGeneratedMessageSuffix:} {Type:temporary Condition: Reason:CGroupOOMKilling Pattern:Memory cgroup out of memory: Killed process \d+ (.*) total-vm:\d+kB, anon-rss:\d+kB, file-rss:\d+kB.* PatternGeneratedMessageSuffix:} {Type:temporary Condition: Reason:TaskHung Pattern:task \S+:\w+ blocked for more than \w+ seconds\. PatternGeneratedMessageSuffix:} {Type:temporary Condition: Reason:UnregisterNetDevice Pattern:unregister_netdevice: waiting for \w+ to become free. Usage count = \d+ PatternGeneratedMessageSuffix:} {Type:temporary Condition: Reason:KernelOops Pattern:BUG: unable to handle kernel NULL pointer dereference at .* PatternGeneratedMessageSuffix:} {Type:temporary Condition: Reason:KernelOops Pattern:divide error: 0000 \[#\d+\] SMP PatternGeneratedMessageSuffix:} {Type:temporary Condition: Reason:HardwareErrorRecoverable Pattern:\[Hardware Error\]: event severity: recoverable PatternGeneratedMessageSuffix:} {Type:temporary Condition: Reason:HardwareErrorCorrected Pattern:\[Hardware Error\]: event severity: corrected PatternGeneratedMessageSuffix:} {Type:temporary Condition: Reason:HardwareErrorInfo Pattern:\[Hardware Error\]: event severity: info PatternGeneratedMessageSuffix:} {Type:temporary Condition: Reason:PCIAER Pattern:AER: aer_status: .* PatternGeneratedMessageSuffix:} {Type:temporary Condition: Reason:NVSXidNonFatal Pattern:nvidia-nvswitch\d: SXid .* Non-fatal, .* PatternGeneratedMessageSuffix:} {Type:temporary Condition:NVLinkXIDNonfatal Reason: Pattern:NVRM: Xid \(PCI.+\): (14[4-9]|150|15[4-7]),.*Nonfatal.* PatternGeneratedMessageSuffix:} {Type:permanent Condition:NVLinkXIDFatal Reason:NVSwitch XID indicates fatal error Pattern:NVRM: Xid \(PCI.+\): (14[4-9]|150|15[4-7]),.*Fatal.* \(0x0[0-1|3-9].* PatternGeneratedMessageSuffix:} {Type:permanent Condition:NVLinkXIDFatalSwitch Reason:NVSwitch XID indicates fatal error from switch side Pattern:NVRM: Xid \(PCI.+\): (14[4-9]|150|15[4-7]),.*Fatal.* \(0x02.* PatternGeneratedMessageSuffix:} {Type:permanent Condition:NVLinkMaskError Reason:GPUs are reporting Link mask errors Pattern:NVRM: (knvlinkUpdatePostRxDetectLinkMask_IMPL: Failed to update Rx Detect Link mask|NVRM: knvlinkDiscoverPostRxDetLinks.*: Getting peer..s postRxDetLinkMask failed).* PatternGeneratedMessageSuffix:} {Type:permanent Condition:GPUWantsReset Reason:NVSwitch XID indicates GPU reset required Pattern:NVRM: Xid \(PCI.+\): (14[4-9]|150|15[4-7]),.*GPU Reset Required.* PatternGeneratedMessageSuffix:} {Type:permanent Condition:GPUChannelRetirement Reason:GPU is reporting memory channel retirement due to repeat uncorrectable errors Pattern:NVRM: Xid \(PCI.+\): (160),.* PatternGeneratedMessageSuffix:} {Type:permanent Condition:GPUChannelRetirementFailure Reason:GPU is reporting memory channel retirement failure Pattern:NVRM: Xid \(PCI.+\): (161),.* PatternGeneratedMessageSuffix:} {Type:permanent Condition:KernelDeadlock Reason:CPUSoftLockup Pattern:watchdog: BUG: soft lockup .* PatternGeneratedMessageSuffix:} {Type:permanent Condition:KernelHardlock Reason:CPUHardLockup Pattern:NMI watchdog: Watchdog detected hard LOCKUP .* PatternGeneratedMessageSuffix:} {Type:permanent Condition:KernelDeadlock Reason:AUFSUmountHung Pattern:task umount\.aufs:\w+ blocked for more than \w+ seconds\. PatternGeneratedMessageSuffix:} {Type:permanent Condition:KernelDeadlock Reason:DockerHung Pattern:task docker:\w+ blocked for more than \w+ seconds\. PatternGeneratedMessageSuffix:} {Type:permanent Condition:ReadonlyFilesystem Reason:FilesystemIsReadOnly Pattern:Remounting filesystem read-only PatternGeneratedMessageSuffix:} {Type:permanent Condition:LocalDiskErrors Reason:XFSCorruption Pattern:XFS \(((.{0,4})|(.+loop.*)|(loo[^p].*)|(lo[^o].*)|(l[^o].*))\).*Corruption detected.* PatternGeneratedMessageSuffix:} {Type:permanent Condition:SuspectedLocalDiskErrors Reason:IOError Pattern:I/O error, dev nvme.*, sector (?:[^0].+) .* PatternGeneratedMessageSuffix:} {Type:permanent Condition:Sector0LocalDiskErrors Reason:IOError Pattern:I/O error, dev nvme.*, sector (?:[0].+) .* PatternGeneratedMessageSuffix:} {Type:permanent Condition:LocalDiskErrors Reason:IOError Pattern:nvme.+ I/O .+ timeout.* PatternGeneratedMessageSuffix:} {Type:permanent Condition:LocalDiskErrors Reason:CriticalMediumError Pattern:critical medium error, dev nvme.+ PatternGeneratedMessageSuffix:} {Type:permanent Condition:GPUFault Reason:CUDASegFault Pattern:cuda-.+ segfault at .* PatternGeneratedMessageSuffix:} {Type:permanent Condition:GPUFault Reason:GPUIsReportingErrors Pattern:NVRM: Xid \(PCI.+\): (1|2|3|4|5|6|7|8|9|10|11|12|14|15|16|17|18|19|20|21|22|23|24|25|26|27|28|29|30|33|34|35|36|37|38|39|40|41|42|46|47|49|50|51|52|53|54|55|56|57|58|59|60|61|62|65|66|67|69|70|71|72|73|74|75|76|77|78|80|81|82|83|84|85|86|87|88|89|90|91|92|93|96|97|98|99|100|101|102|103|104|105|106|107|108|110|111|112|113|114|115|116|117|118),.* PatternGeneratedMessageSuffix:} {Type:permanent Condition:GPUFault Reason:GPUIsReportingErrors Pattern:NVRM: Rate limiting GSP RPC error prints for GPU at PCI:.+ \(printing .+ of every .+\).  The GPU likely needs to be reset.* PatternGeneratedMessageSuffix:} {Type:permanent Condition:GPUInvalidPushBuffer Reason:GPUIsReportingInvalidPushBuffer Pattern:NVRM: Xid \(PCI.+\): 32,.* PatternGeneratedMessageSuffix:} {Type:permanent Condition:GPUContextSwitchFault Reason:GPUIsReportingContextSwitchFault Pattern:NVRM: Xid \(PCI.+\): 44,.* PatternGeneratedMessageSuffix:} {Type:permanent Condition:GPUWantsReset Reason:GPU Has a pending row remap Pattern:NVRM: Xid \(PCI.+\): (63),.* PatternGeneratedMessageSuffix:} {Type:permanent Condition:GPURowRemapFailure Reason:GPU Failed a row remap Pattern:NVRM: Xid \(PCI.+\): (64),.* PatternGeneratedMessageSuffix:} {Type:permanent Condition:GPUECCUncorrectableError Reason:GPU has encountered an uncorrectable ECC error Pattern:NVRM: Xid \(PCI.+\): (48|94|95),.* PatternGeneratedMessageSuffix:} {Type:permanent Condition:GPUFallenOffBus Reason:A GPU has fallen off the bus Pattern:NVRM: Xid \(PCI.+\): (79),.* PatternGeneratedMessageSuffix:} {Type:permanent Condition:GPUGSPTimeoutXid119 Reason:GPU System Processor is failing to respond, likely crashed or deadlocked Pattern:NVRM: Xid \(PCI.+\): (119),.* PatternGeneratedMessageSuffix:} {Type:permanent Condition:GPUContextSwitchTimeoutXid109 Reason:GPU is reporting a Xid 109 Pattern:NVRM: Xid \(PCI.+\): (109),.* PatternGeneratedMessageSuffix:} {Type:permanent Condition:GPUGSPPanicXid120 Reason:GPU is reporting a GSP task panic Pattern:NVRM: Xid \(PCI.+\): (120),.* PatternGeneratedMessageSuffix:} {Type:permanent Condition:NVSwitchFailure Reason:A NVSwitch has failed Pattern:nvidia-nvswitch.: SXid \(PCI.+\): (1900[4-6]|1901[3-7]|1904[6-8]|1905(3|4|6|8)|1906(0|1|3|4|6|7|9)|19070|20034|220(03|12)|2300[1-9]|2301[1-7]|2400[4-6]|2600[1-7]|2900(2|4)|3000(2|4)),.* PatternGeneratedMessageSuffix:} {Type:permanent Condition:GPUFault Reason:GPUSmiError Pattern:CW: GPU .* PatternGeneratedMessageSuffix:} {Type:permanent Condition:IBPCIUnavailable Reason:PCILost Pattern:mlx5_core.*PCI slot is unavailable.* PatternGeneratedMessageSuffix:} {Type:permanent Condition:PersistentStorageFault Reason:CephFSQuotaError Pattern:ceph: get_quota_realm: ino .+ PatternGeneratedMessageSuffix:} {Type:temporary Condition:NFSStorageFault Reason:NFSNotResponding Pattern:nfs: server .+ not responding.+ PatternGeneratedMessageSuffix:} {Type:permanent Condition:GPUFault Reason:ROMError Pattern:.*Invalid PCI ROM header signature.* PatternGeneratedMessageSuffix:} {Type:permanent Condition:HardwareErrorFatal Reason:HardwareErrorFatal Pattern:\[Hardware Error\]: event severity: fatal.* PatternGeneratedMessageSuffix:} {Type:temporary Condition: Reason:HardwareErrorInterruptPCIe Pattern:\[Hardware Error\]:   section_type: PCIe error PatternGeneratedMessageSuffix:} {Type:permanent Condition:HardwareErrorInterruptCPU Reason:HardwareErrorFromAPEI Pattern:\[Hardware Error\]:   section_type: .* processor error PatternGeneratedMessageSuffix:} {Type:permanent Condition:HardwareErrorInterruptMemory Reason:HardwareErrorFromAPEI Pattern:\[Hardware Error\]:   section_type: memory error PatternGeneratedMessageSuffix:} {Type:temporary Condition: Reason:HardwareErrorInterruptUnknown Pattern:\[Hardware Error\]:   section_type: unknown.* PatternGeneratedMessageSuffix:} {Type:temporary Condition: Reason:KmsgWatchLoopStarted Pattern:\[npd-internal\] Entering watch loop for kernel log PatternGeneratedMessageSuffix:} {Type:temporary Condition: Reason:KmsgParserRevived Pattern:\[npd-internal\] Reviving.*parser.* PatternGeneratedMessageSuffix:}] EnableMetricsReporting:0x2f38e62}
I0722 15:00:04.034773      13 log_watchers.go:40] Use log watcher of plugin "kmsg"
I0722 15:00:04.036723      13 log_monitor.go:82] Finish parsing log monitor config file /config/kernel-journald.json: {WatcherConfig:{Plugin:journald PluginConfig:map[source:kernel] SkipList:[] LogPath:/var/log/journal Lookback:5m Delay:} BufferSize:1000 Source: DefaultConditions:[{Type:KernelDeadlock Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:KernelHasNoDeadlock Message:Kernel has no deadlock} {Type:KernelHardlock Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoCPUHardLockup Message:Kernel has no CPU Hard Lockup} {Type:ReadonlyFilesystem Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:FilesystemIsNotReadOnly Message:Filesystem is not read-only} {Type:LocalDiskErrors Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoDiskErrors Message:Local NVMe is healthy} {Type:GPUWantsReset Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoGPUErrors Message:GPUs are not reporting non-fatal errors} {Type:GPUChannelRetirement Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoGPUErrors Message:GPUs are not reporting any channel retirement errors} {Type:GPUChannelRetirementFailure Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoGPUErrors Message:No channels have failed retirement} {Type:GPURowRemapFailure Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:RowRemapOk Message:No rows have failed remapping} {Type:GPUECCUncorrectableError Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoECCError Message:No GPUs have triggered an ECC uncorrectable error} {Type:GPUInvalidPushBuffer Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoGPUErrors Message:GPUs are not reporting invalid push buffer} {Type:GPUContextSwitchFault Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoGPUErrors Message:GPUs are not reporting context switch fault} {Type:GPUFault Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoGPUErrors Message:GPUs are not reporting errors} {Type:IBPCIUnavailable Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:IBPCIAvailable Message:InfiniBand adapters are not reporting PCI slot unavailable} {Type:GPUFallenOffBus Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoMatchingXid Message:No Xid 79 detected} {Type:GPUGSPTimeoutXid119 Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoMatchingXid Message:No Xid 119 detected} {Type:GPUContextSwitchTimeoutXid109 Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoMatchingXid Message:No Xid 109 detected} {Type:GPUGSPPanicXid120 Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoMatchingXid Message:No Xid 120 detected} {Type:PersistentStorageFault Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoStorageErrors Message:Storage subsystem is not reporting any errors} {Type:HardwareErrorFatal Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoHardwareErrorFatal Message:Platform is not reporting any fatal hardware errors} {Type:HardwareErrorInterruptCPU Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoInterruptsDetected Message:Platform is reporting no hardware errors via APEI for CPU} {Type:HardwareErrorInterruptMemory Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoInterruptsDetected Message:Platform is reporting no hardware errors via APEI for Memory} {Type:NVLinkXIDFatal Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoMatchingXid Message:No XID 144-150, 154-157 detected} {Type:NVLinkXIDFatalSwitch Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoMatchingXid Message:No XID 144-150, 154-157 detected from switch side} {Type:NVLinkMaskError Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoMaskError Message:No GPUs are reporting link mask errors} {Type:Sector0LocalDiskErrors Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoSector0NVMEErrors Message:No nvme i/o errors detected on sector 0} {Type:SuspectedLocalDiskErrors Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoSuspectedNVMEErrors Message:No nvme i/o errors detected}] Rules:[{Type:temporary Condition: Reason:SystemOOMKilling Pattern:Out of memory: Killed process \d+ (.+) total-vm:\d+kB, anon-rss:\d+kB, file-rss:\d+kB.* PatternGeneratedMessageSuffix:} {Type:temporary Condition: Reason:CGroupOOMKilling Pattern:Memory cgroup out of memory: Killed process \d+ (.*) total-vm:\d+kB, anon-rss:\d+kB, file-rss:\d+kB.* PatternGeneratedMessageSuffix:} {Type:temporary Condition: Reason:TaskHung Pattern:task \S+:\w+ blocked for more than \w+ seconds\. PatternGeneratedMessageSuffix:} {Type:temporary Condition: Reason:UnregisterNetDevice Pattern:unregister_netdevice: waiting for \w+ to become free. Usage count = \d+ PatternGeneratedMessageSuffix:} {Type:temporary Condition: Reason:KernelOops Pattern:BUG: unable to handle kernel NULL pointer dereference at .* PatternGeneratedMessageSuffix:} {Type:temporary Condition: Reason:KernelOops Pattern:divide error: 0000 \[#\d+\] SMP PatternGeneratedMessageSuffix:} {Type:temporary Condition: Reason:HardwareErrorRecoverable Pattern:\[Hardware Error\]: event severity: recoverable PatternGeneratedMessageSuffix:} {Type:temporary Condition: Reason:HardwareErrorCorrected Pattern:\[Hardware Error\]: event severity: corrected PatternGeneratedMessageSuffix:} {Type:temporary Condition: Reason:HardwareErrorInfo Pattern:\[Hardware Error\]: event severity: info PatternGeneratedMessageSuffix:} {Type:temporary Condition: Reason:PCIAER Pattern:AER: aer_status: .* PatternGeneratedMessageSuffix:} {Type:temporary Condition: Reason:NVSXidNonFatal Pattern:nvidia-nvswitch\d: SXid .* Non-fatal, .* PatternGeneratedMessageSuffix:} {Type:temporary Condition:NVLinkXIDNonfatal Reason: Pattern:NVRM: Xid \(PCI.+\): (14[4-9]|150|15[4-7]),.*Nonfatal.* PatternGeneratedMessageSuffix:} {Type:permanent Condition:NVLinkXIDFatal Reason:NVSwitch XID indicates fatal error Pattern:NVRM: Xid \(PCI.+\): (14[4-9]|150|15[4-7]),.*Fatal.* \(0x0[0-1|3-9].* PatternGeneratedMessageSuffix:} {Type:permanent Condition:NVLinkXIDFatalSwitch Reason:NVSwitch XID indicates fatal error from switch side Pattern:NVRM: Xid \(PCI.+\): (14[4-9]|150|15[4-7]),.*Fatal.* \(0x02.* PatternGeneratedMessageSuffix:} {Type:permanent Condition:NVLinkMaskError Reason:GPUs are reporting Link mask errors Pattern:NVRM: (knvlinkUpdatePostRxDetectLinkMask_IMPL: Failed to update Rx Detect Link mask|NVRM: knvlinkDiscoverPostRxDetLinks.*: Getting peer..s postRxDetLinkMask failed).* PatternGeneratedMessageSuffix:} {Type:permanent Condition:GPUWantsReset Reason:NVSwitch XID indicates GPU reset required Pattern:NVRM: Xid \(PCI.+\): (14[4-9]|150|15[4-7]),.*GPU Reset Required.* PatternGeneratedMessageSuffix:} {Type:permanent Condition:GPUChannelRetirement Reason:GPU is reporting memory channel retirement due to repeat uncorrectable errors Pattern:NVRM: Xid \(PCI.+\): (160),.* PatternGeneratedMessageSuffix:} {Type:permanent Condition:GPUChannelRetirementFailure Reason:GPU is reporting memory channel retirement failure Pattern:NVRM: Xid \(PCI.+\): (161),.* PatternGeneratedMessageSuffix:} {Type:permanent Condition:KernelDeadlock Reason:CPUSoftLockup Pattern:watchdog: BUG: soft lockup .* PatternGeneratedMessageSuffix:} {Type:permanent Condition:KernelHardlock Reason:CPUHardLockup Pattern:NMI watchdog: Watchdog detected hard LOCKUP .* PatternGeneratedMessageSuffix:} {Type:permanent Condition:KernelDeadlock Reason:AUFSUmountHung Pattern:task umount\.aufs:\w+ blocked for more than \w+ seconds\. PatternGeneratedMessageSuffix:} {Type:permanent Condition:KernelDeadlock Reason:DockerHung Pattern:task docker:\w+ blocked for more than \w+ seconds\. PatternGeneratedMessageSuffix:} {Type:permanent Condition:ReadonlyFilesystem Reason:FilesystemIsReadOnly Pattern:Remounting filesystem read-only PatternGeneratedMessageSuffix:} {Type:permanent Condition:LocalDiskErrors Reason:XFSCorruption Pattern:XFS \(((.{0,4})|(.+loop.*)|(loo[^p].*)|(lo[^o].*)|(l[^o].*))\).*Corruption detected.* PatternGeneratedMessageSuffix:} {Type:permanent Condition:SuspectedLocalDiskErrors Reason:IOError Pattern:I/O error, dev nvme.*, sector (?:[^0].+) .* PatternGeneratedMessageSuffix:} {Type:permanent Condition:Sector0LocalDiskErrors Reason:IOError Pattern:I/O error, dev nvme.*, sector (?:[0].+) .* PatternGeneratedMessageSuffix:} {Type:permanent Condition:LocalDiskErrors Reason:IOError Pattern:nvme.+ I/O .+ timeout.* PatternGeneratedMessageSuffix:} {Type:permanent Condition:LocalDiskErrors Reason:CriticalMediumError Pattern:critical medium error, dev nvme.+ PatternGeneratedMessageSuffix:} {Type:permanent Condition:GPUFault Reason:CUDASegFault Pattern:cuda-.+ segfault at .* PatternGeneratedMessageSuffix:} {Type:permanent Condition:GPUFault Reason:GPUIsReportingErrors Pattern:NVRM: Xid \(PCI.+\): (1|2|3|4|5|6|7|8|9|10|11|12|14|15|16|17|18|19|20|21|22|23|24|25|26|27|28|29|30|33|34|35|36|37|38|39|40|41|42|46|47|49|50|51|52|53|54|55|56|57|58|59|60|61|62|65|66|67|69|70|71|72|73|74|75|76|77|78|80|81|82|83|84|85|86|87|88|89|90|91|92|93|96|97|98|99|100|101|102|103|104|105|106|107|108|110|111|112|113|114|115|116|117|118),.* PatternGeneratedMessageSuffix:} {Type:permanent Condition:GPUFault Reason:GPUIsReportingErrors Pattern:NVRM: Rate limiting GSP RPC error prints for GPU at PCI:.+ \(printing .+ of every .+\).  The GPU likely needs to be reset.* PatternGeneratedMessageSuffix:} {Type:permanent Condition:GPUInvalidPushBuffer Reason:GPUIsReportingInvalidPushBuffer Pattern:NVRM: Xid \(PCI.+\): 32,.* PatternGeneratedMessageSuffix:} {Type:permanent Condition:GPUContextSwitchFault Reason:GPUIsReportingContextSwitchFault Pattern:NVRM: Xid \(PCI.+\): 44,.* PatternGeneratedMessageSuffix:} {Type:permanent Condition:GPUWantsReset Reason:GPU Has a pending row remap Pattern:NVRM: Xid \(PCI.+\): (63),.* PatternGeneratedMessageSuffix:} {Type:permanent Condition:GPURowRemapFailure Reason:GPU Failed a row remap Pattern:NVRM: Xid \(PCI.+\): (64),.* PatternGeneratedMessageSuffix:} {Type:permanent Condition:GPUECCUncorrectableError Reason:GPU has encountered an uncorrectable ECC error Pattern:NVRM: Xid \(PCI.+\): (48|94|95),.* PatternGeneratedMessageSuffix:} {Type:permanent Condition:GPUFallenOffBus Reason:A GPU has fallen off the bus Pattern:NVRM: Xid \(PCI.+\): (79),.* PatternGeneratedMessageSuffix:} {Type:permanent Condition:GPUGSPTimeoutXid119 Reason:GPU System Processor is failing to respond, likely crashed or deadlocked Pattern:NVRM: Xid \(PCI.+\): (119),.* PatternGeneratedMessageSuffix:} {Type:permanent Condition:GPUContextSwitchTimeoutXid109 Reason:GPU is reporting a Xid 109 Pattern:NVRM: Xid \(PCI.+\): (109),.* PatternGeneratedMessageSuffix:} {Type:permanent Condition:GPUGSPPanicXid120 Reason:GPU is reporting a GSP task panic Pattern:NVRM: Xid \(PCI.+\): (120),.* PatternGeneratedMessageSuffix:} {Type:permanent Condition:NVSwitchFailure Reason:A NVSwitch has failed Pattern:nvidia-nvswitch.: SXid \(PCI.+\): (1900[4-6]|1901[3-7]|1904[6-8]|1905(3|4|6|8)|1906(0|1|3|4|6|7|9)|19070|20034|220(03|12)|2300[1-9]|2301[1-7]|2400[4-6]|2600[1-7]|2900(2|4)|3000(2|4)),.* PatternGeneratedMessageSuffix:} {Type:permanent Condition:GPUFault Reason:GPUSmiError Pattern:CW: GPU .* PatternGeneratedMessageSuffix:} {Type:permanent Condition:IBPCIUnavailable Reason:PCILost Pattern:mlx5_core.*PCI slot is unavailable.* PatternGeneratedMessageSuffix:} {Type:permanent Condition:PersistentStorageFault Reason:CephFSQuotaError Pattern:ceph: get_quota_realm: ino .+ PatternGeneratedMessageSuffix:} {Type:temporary Condition:NFSStorageFault Reason:NFSNotResponding Pattern:nfs: server .+ not responding.+ PatternGeneratedMessageSuffix:} {Type:permanent Condition:GPUFault Reason:ROMError Pattern:.*Invalid PCI ROM header signature.* PatternGeneratedMessageSuffix:} {Type:permanent Condition:HardwareErrorFatal Reason:HardwareErrorFatal Pattern:\[Hardware Error\]: event severity: fatal.* PatternGeneratedMessageSuffix:} {Type:temporary Condition: Reason:HardwareErrorInterruptPCIe Pattern:\[Hardware Error\]:   section_type: PCIe error PatternGeneratedMessageSuffix:} {Type:permanent Condition:HardwareErrorInterruptCPU Reason:HardwareErrorFromAPEI Pattern:\[Hardware Error\]:   section_type: .* processor error PatternGeneratedMessageSuffix:} {Type:permanent Condition:HardwareErrorInterruptMemory Reason:HardwareErrorFromAPEI Pattern:\[Hardware Error\]:   section_type: memory error PatternGeneratedMessageSuffix:} {Type:temporary Condition: Reason:HardwareErrorInterruptUnknown Pattern:\[Hardware Error\]:   section_type: unknown.* PatternGeneratedMessageSuffix:} {Type:temporary Condition: Reason:JournaldKernelWatchLoopStarted Pattern:\[npd-internal\] Entering journald watch loop.*kernel.* PatternGeneratedMessageSuffix:} {Type:temporary Condition: Reason:JournaldKernelFailedToGetNextEntry Pattern:\[npd-internal\] Failed to get next journald entry.*kernel.* PatternGeneratedMessageSuffix:} {Type:temporary Condition: Reason:JournaldKernelFailedToGetEntry Pattern:\[npd-internal\] Failed to get journald entry.*kernel.* PatternGeneratedMessageSuffix:}] EnableMetricsReporting:0x2f38e62}
I0722 15:00:04.036986      13 log_watchers.go:40] Use log watcher of plugin "journald"
I0722 15:00:04.037616      13 log_monitor.go:82] Finish parsing log monitor config file /config/docker-monitor.json: {WatcherConfig:{Plugin:journald PluginConfig:map[source:dockerd] SkipList:[] LogPath:/var/log/journal Lookback:5m Delay:} BufferSize:10 Source:docker-monitor DefaultConditions:[] Rules:[{Type:temporary Condition: Reason:CorruptDockerImage Pattern:Error trying v2 registry: failed to register layer: rename /var/lib/docker/image/(.+) /var/lib/docker/image/(.+): directory not empty.* PatternGeneratedMessageSuffix:}] EnableMetricsReporting:0x2f38e62}
I0722 15:00:04.037641      13 log_watchers.go:40] Use log watcher of plugin "journald"
I0722 15:00:04.037828      13 log_monitor.go:82] Finish parsing log monitor config file /config/kubelet.json: {WatcherConfig:{Plugin:journald PluginConfig:map[source:kubelet] SkipList:[] LogPath:/var/log/journal Lookback:5m Delay:} BufferSize:10 Source: DefaultConditions:[{Type:RunContainerError Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoRunContainerError Message:No RunContainerErrors present} {Type:KillContainerFailed Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoKillContainerFailed Message:No KillContainerFailed Errors present}] Rules:[{Type:temporary Condition: Reason:JournaldKubeletWatchLoopStarted Pattern:\[npd-internal\] Entering journald watch loop.*kubelet.* PatternGeneratedMessageSuffix:} {Type:temporary Condition: Reason:JournaldKubeletFailedToGetNextEntry Pattern:\[npd-internal\] Failed to get next journald entry.*kubelet.* PatternGeneratedMessageSuffix:} {Type:temporary Condition: Reason:JournaldKubeletFailedToGetEntry Pattern:\[npd-internal\] Failed to get journald entry.*kubelet.* PatternGeneratedMessageSuffix:} {Type:permanent Condition:RunContainerError Reason:ContextDeadlineExceeded Pattern:.*rror syncing pod.*RunContainerError.*context deadline exceeded.* PatternGeneratedMessageSuffix:} {Type:permanent Condition:KillContainerFailed Reason:FailedToKillHPCVerificationContainer Pattern:.*ill container failed.*hpc-verification.* PatternGeneratedMessageSuffix:}] EnableMetricsReporting:0x2f38e62}
I0722 15:00:04.037869      13 log_watchers.go:40] Use log watcher of plugin "journald"
I0722 15:00:04.038384      13 k8s_exporter.go:56] Waiting for kube-apiserver to be ready (timeout 5m0s)...
I0722 15:00:04.044811      13 problem_client.go:128] Deleting deprecated conditions [GPUApplicationError GPUMMUErrorXid31 HardwareErrorInterruptPCIe HardwareErrorInterruptUnknown] (if present)...
I0722 15:00:04.045761      13 problem_client.go:159] No deprecated conditions to delete
I0722 15:00:04.045784      13 node_problem_detector.go:59] K8s exporter started.
I0722 15:00:04.045978      13 node_problem_detector.go:63] Prometheus exporter started.
I0722 15:00:04.045993      13 custom_plugin_monitor.go:111] Start custom plugin monitor /config/network-monitor.json
I0722 15:00:04.046001      13 custom_plugin_monitor.go:111] Start custom plugin monitor /config/pci-monitor.json
I0722 15:00:04.046018      13 log_monitor.go:166] Start log monitor /config/kernel-kmsg.json
I0722 15:00:04.046190      13 custom_plugin_monitor.go:312] Initialized conditions for /config/network-monitor.json: [{Type:DNSFailure Status:False Transition:2025-07-22 15:00:04.046164149 +0000 UTC m=+0.032493067 Reason:DNSIsOk Message:DNS lookups are working} {Type:ConnectivityFailure Status:False Transition:2025-07-22 15:00:04.046164389 +0000 UTC m=+0.032493287 Reason:ConnectionIsOk Message:Internal connectivity is working} {Type:APIFailure Status:False Transition:2025-07-22 15:00:04.046164459 +0000 UTC m=+0.032493357 Reason:ConnectionIsOk Message:Internal connectivity to K8S APIServer is working} {Type:PubFailure Status:False Transition:2025-07-22 15:00:04.046164519 +0000 UTC m=+0.032493417 Reason:ConnectionIsOk Message:Public Internet connectivity is working}]
I0722 15:00:04.046296      13 custom_plugin_monitor.go:301] Sending initial status for network-monitor with conditions: [{Type:DNSFailure Status:False Transition:2025-07-22 15:00:04.046164149 +0000 UTC m=+0.032493067 Reason:DNSIsOk Message:DNS lookups are working} {Type:ConnectivityFailure Status:False Transition:2025-07-22 15:00:04.046164389 +0000 UTC m=+0.032493287 Reason:ConnectionIsOk Message:Internal connectivity is working} {Type:APIFailure Status:False Transition:2025-07-22 15:00:04.046164459 +0000 UTC m=+0.032493357 Reason:ConnectionIsOk Message:Internal connectivity to K8S APIServer is working} {Type:PubFailure Status:False Transition:2025-07-22 15:00:04.046164519 +0000 UTC m=+0.032493417 Reason:ConnectionIsOk Message:Public Internet connectivity is working}]
I0722 15:00:04.046390      13 log_monitor.go:174] Log monitor /config/kernel-kmsg.json is configured to refresh periodically
I0722 15:00:04.046407      13 log_monitor.go:166] Start log monitor /config/kernel-journald.json
E0722 15:00:04.046444      13 problem_detector.go:57] Failed to start monitor &{/config/kernel-journald.json 0xc000052090 0xc0005ea940 {{journald map[source:kernel] [] /var/log/journal 5m } 1000  [{KernelDeadlock  {0 0 <nil>} KernelHasNoDeadlock Kernel has no deadlock} {KernelHardlock  {0 0 <nil>} NoCPUHardLockup Kernel has no CPU Hard Lockup} {ReadonlyFilesystem  {0 0 <nil>} FilesystemIsNotReadOnly Filesystem is not read-only} {LocalDiskErrors  {0 0 <nil>} NoDiskErrors Local NVMe is healthy} {GPUWantsReset  {0 0 <nil>} NoGPUErrors GPUs are not reporting non-fatal errors} {GPUChannelRetirement  {0 0 <nil>} NoGPUErrors GPUs are not reporting any channel retirement errors} {GPUChannelRetirementFailure  {0 0 <nil>} NoGPUErrors No channels have failed retirement} {GPURowRemapFailure  {0 0 <nil>} RowRemapOk No rows have failed remapping} {GPUECCUncorrectableError  {0 0 <nil>} NoECCError No GPUs have triggered an ECC uncorrectable error} {GPUInvalidPushBuffer  {0 0 <nil>} NoGPUErrors GPUs are not reporting invalid push buffer} {GPUContextSwitchFault  {0 0 <nil>} NoGPUErrors GPUs are not reporting context switch fault} {GPUFault  {0 0 <nil>} NoGPUErrors GPUs are not reporting errors} {IBPCIUnavailable  {0 0 <nil>} IBPCIAvailable InfiniBand adapters are not reporting PCI slot unavailable} {GPUFallenOffBus  {0 0 <nil>} NoMatchingXid No Xid 79 detected} {GPUGSPTimeoutXid119  {0 0 <nil>} NoMatchingXid No Xid 119 detected} {GPUContextSwitchTimeoutXid109  {0 0 <nil>} NoMatchingXid No Xid 109 detected} {GPUGSPPanicXid120  {0 0 <nil>} NoMatchingXid No Xid 120 detected} {PersistentStorageFault  {0 0 <nil>} NoStorageErrors Storage subsystem is not reporting any errors} {HardwareErrorFatal  {0 0 <nil>} NoHardwareErrorFatal Platform is not reporting any fatal hardware errors} {HardwareErrorInterruptCPU  {0 0 <nil>} NoInterruptsDetected Platform is reporting no hardware errors via APEI for CPU} {HardwareErrorInterruptMemory  {0 0 <nil>} NoInterruptsDetected Platform is reporting no hardware errors via APEI for Memory} {NVLinkXIDFatal  {0 0 <nil>} NoMatchingXid No XID 144-150, 154-157 detected} {NVLinkXIDFatalSwitch  {0 0 <nil>} NoMatchingXid No XID 144-150, 154-157 detected from switch side} {NVLinkMaskError  {0 0 <nil>} NoMaskError No GPUs are reporting link mask errors} {Sector0LocalDiskErrors  {0 0 <nil>} NoSector0NVMEErrors No nvme i/o errors detected on sector 0} {SuspectedLocalDiskErrors  {0 0 <nil>} NoSuspectedNVMEErrors No nvme i/o errors detected}] [{temporary  SystemOOMKilling Out of memory: Killed process \d+ (.+) total-vm:\d+kB, anon-rss:\d+kB, file-rss:\d+kB.* } {temporary  CGroupOOMKilling Memory cgroup out of memory: Killed process \d+ (.*) total-vm:\d+kB, anon-rss:\d+kB, file-rss:\d+kB.* } {temporary  TaskHung task \S+:\w+ blocked for more than \w+ seconds\. } {temporary  UnregisterNetDevice unregister_netdevice: waiting for \w+ to become free. Usage count = \d+ } {temporary  KernelOops BUG: unable to handle kernel NULL pointer dereference at .* } {temporary  KernelOops divide error: 0000 \[#\d+\] SMP } {temporary  HardwareErrorRecoverable \[Hardware Error\]: event severity: recoverable } {temporary  HardwareErrorCorrected \[Hardware Error\]: event severity: corrected } {temporary  HardwareErrorInfo \[Hardware Error\]: event severity: info } {temporary  PCIAER AER: aer_status: .* } {temporary  NVSXidNonFatal nvidia-nvswitch\d: SXid .* Non-fatal, .* } {temporary NVLinkXIDNonfatal  NVRM: Xid \(PCI.+\): (14[4-9]|150|15[4-7]),.*Nonfatal.* } {permanent NVLinkXIDFatal NVSwitch XID indicates fatal error NVRM: Xid \(PCI.+\): (14[4-9]|150|15[4-7]),.*Fatal.* \(0x0[0-1|3-9].* } {permanent NVLinkXIDFatalSwitch NVSwitch XID indicates fatal error from switch side NVRM: Xid \(PCI.+\): (14[4-9]|150|15[4-7]),.*Fatal.* \(0x02.* } {permanent NVLinkMaskError GPUs are reporting Link mask errors NVRM: (knvlinkUpdatePostRxDetectLinkMask_IMPL: Failed to update Rx Detect Link mask|NVRM: knvlinkDiscoverPostRxDetLinks.*: Getting peer..s postRxDetLinkMask failed).* } {permanent GPUWantsReset NVSwitch XID indicates GPU reset required NVRM: Xid \(PCI.+\): (14[4-9]|150|15[4-7]),.*GPU Reset Required.* } {permanent GPUChannelRetirement GPU is reporting memory channel retirement due to repeat uncorrectable errors NVRM: Xid \(PCI.+\): (160),.* } {permanent GPUChannelRetirementFailure GPU is reporting memory channel retirement failure NVRM: Xid \(PCI.+\): (161),.* } {permanent KernelDeadlock CPUSoftLockup watchdog: BUG: soft lockup .* } {permanent KernelHardlock CPUHardLockup NMI watchdog: Watchdog detected hard LOCKUP .* } {permanent KernelDeadlock AUFSUmountHung task umount\.aufs:\w+ blocked for more than \w+ seconds\. } {permanent KernelDeadlock DockerHung task docker:\w+ blocked for more than \w+ seconds\. } {permanent ReadonlyFilesystem FilesystemIsReadOnly Remounting filesystem read-only } {permanent LocalDiskErrors XFSCorruption XFS \(((.{0,4})|(.+loop.*)|(loo[^p].*)|(lo[^o].*)|(l[^o].*))\).*Corruption detected.* } {permanent SuspectedLocalDiskErrors IOError I/O error, dev nvme.*, sector (?:[^0].+) .* } {permanent Sector0LocalDiskErrors IOError I/O error, dev nvme.*, sector (?:[0].+) .* } {permanent LocalDiskErrors IOError nvme.+ I/O .+ timeout.* } {permanent LocalDiskErrors CriticalMediumError critical medium error, dev nvme.+ } {permanent GPUFault CUDASegFault cuda-.+ segfault at .* } {permanent GPUFault GPUIsReportingErrors NVRM: Xid \(PCI.+\): (1|2|3|4|5|6|7|8|9|10|11|12|14|15|16|17|18|19|20|21|22|23|24|25|26|27|28|29|30|33|34|35|36|37|38|39|40|41|42|46|47|49|50|51|52|53|54|55|56|57|58|59|60|61|62|65|66|67|69|70|71|72|73|74|75|76|77|78|80|81|82|83|84|85|86|87|88|89|90|91|92|93|96|97|98|99|100|101|102|103|104|105|106|107|108|110|111|112|113|114|115|116|117|118),.* } {permanent GPUFault GPUIsReportingErrors NVRM: Rate limiting GSP RPC error prints for GPU at PCI:.+ \(printing .+ of every .+\).  The GPU likely needs to be reset.* } {permanent GPUInvalidPushBuffer GPUIsReportingInvalidPushBuffer NVRM: Xid \(PCI.+\): 32,.* } {permanent GPUContextSwitchFault GPUIsReportingContextSwitchFault NVRM: Xid \(PCI.+\): 44,.* } {permanent GPUWantsReset GPU Has a pending row remap NVRM: Xid \(PCI.+\): (63),.* } {permanent GPURowRemapFailure GPU Failed a row remap NVRM: Xid \(PCI.+\): (64),.* } {permanent GPUECCUncorrectableError GPU has encountered an uncorrectable ECC error NVRM: Xid \(PCI.+\): (48|94|95),.* } {permanent GPUFallenOffBus A GPU has fallen off the bus NVRM: Xid \(PCI.+\): (79),.* } {permanent GPUGSPTimeoutXid119 GPU System Processor is failing to respond, likely crashed or deadlocked NVRM: Xid \(PCI.+\): (119),.* } {permanent GPUContextSwitchTimeoutXid109 GPU is reporting a Xid 109 NVRM: Xid \(PCI.+\): (109),.* } {permanent GPUGSPPanicXid120 GPU is reporting a GSP task panic NVRM: Xid \(PCI.+\): (120),.* } {permanent NVSwitchFailure A NVSwitch has failed nvidia-nvswitch.: SXid \(PCI.+\): (1900[4-6]|1901[3-7]|1904[6-8]|1905(3|4|6|8)|1906(0|1|3|4|6|7|9)|19070|20034|220(03|12)|2300[1-9]|2301[1-7]|2400[4-6]|2600[1-7]|2900(2|4)|3000(2|4)),.* } {permanent GPUFault GPUSmiError CW: GPU .* } {permanent IBPCIUnavailable PCILost mlx5_core.*PCI slot is unavailable.* } {permanent PersistentStorageFault CephFSQuotaError ceph: get_quota_realm: ino .+ } {temporary NFSStorageFault NFSNotResponding nfs: server .+ not responding.+ } {permanent GPUFault ROMError .*Invalid PCI ROM header signature.* } {permanent HardwareErrorFatal HardwareErrorFatal \[Hardware Error\]: event severity: fatal.* } {temporary  HardwareErrorInterruptPCIe \[Hardware Error\]:   section_type: PCIe error } {permanent HardwareErrorInterruptCPU HardwareErrorFromAPEI \[Hardware Error\]:   section_type: .* processor error } {permanent HardwareErrorInterruptMemory HardwareErrorFromAPEI \[Hardware Error\]:   section_type: memory error } {temporary  HardwareErrorInterruptUnknown \[Hardware Error\]:   section_type: unknown.* } {temporary  JournaldKernelWatchLoopStarted \[npd-internal\] Entering journald watch loop.*kernel.* } {temporary  JournaldKernelFailedToGetNextEntry \[npd-internal\] Failed to get next journald entry.*kernel.* } {temporary  JournaldKernelFailedToGetEntry \[npd-internal\] Failed to get journald entry.*kernel.* }] 0x2f38e62} [] <nil> 0xc0000ea2a0 0xc000423b50}: failed to stat the log path "/var/log/journal": stat /var/log/journal: no such file or directory
I0722 15:00:04.046568      13 log_monitor.go:166] Start log monitor /config/docker-monitor.json
E0722 15:00:04.046583      13 problem_detector.go:57] Failed to start monitor &{/config/docker-monitor.json 0xc0000522d0 0xc0006bc880 {{journald map[source:dockerd] [] /var/log/journal 5m } 10 docker-monitor [] [{temporary  CorruptDockerImage Error trying v2 registry: failed to register layer: rename /var/lib/docker/image/(.+) /var/lib/docker/image/(.+): directory not empty.* }] 0x2f38e62} [] <nil> 0xc0000eacb0 0xc00012d620}: failed to stat the log path "/var/log/journal": stat /var/log/journal: no such file or directory
I0722 15:00:04.046606      13 log_monitor.go:166] Start log monitor /config/kubelet.json
E0722 15:00:04.046617      13 problem_detector.go:57] Failed to start monitor &{/config/kubelet.json 0xc0000523f0 0xc0006bca40 {{journald map[source:kubelet] [] /var/log/journal 5m } 10  [{RunContainerError  {0 0 <nil>} NoRunContainerError No RunContainerErrors present} {KillContainerFailed  {0 0 <nil>} NoKillContainerFailed No KillContainerFailed Errors present}] [{temporary  JournaldKubeletWatchLoopStarted \[npd-internal\] Entering journald watch loop.*kubelet.* } {temporary  JournaldKubeletFailedToGetNextEntry \[npd-internal\] Failed to get next journald entry.*kubelet.* } {temporary  JournaldKubeletFailedToGetEntry \[npd-internal\] Failed to get journald entry.*kubelet.* } {permanent RunContainerError ContextDeadlineExceeded .*rror syncing pod.*RunContainerError.*context deadline exceeded.* } {permanent KillContainerFailed FailedToKillHPCVerificationContainer .*ill container failed.*hpc-verification.* }] 0x2f38e62} [] <nil> 0xc0000ec460 0xc00012d820}: failed to stat the log path "/var/log/journal": stat /var/log/journal: no such file or directory
I0722 15:00:04.046648      13 problem_detector.go:77] Problem detector started
I0722 15:00:04.046666      13 custom_plugin_monitor.go:312] Initialized conditions for /config/pci-monitor.json: [{Type:GPUPCIFault Status:False Transition:2025-07-22 15:00:04.046657573 +0000 UTC m=+0.032986471 Reason:PCIIsOk Message:VFIO PCI Connectivity is OK} {Type:InfiniBandLinkFault Status:False Transition:2025-07-22 15:00:04.046657653 +0000 UTC m=+0.032986551 Reason:LinkIsOk Message:InfiniBand Interfaces OK}]
I0722 15:00:04.046709      13 custom_plugin_monitor.go:301] Sending initial status for pci-monitor with conditions: [{Type:GPUPCIFault Status:False Transition:2025-07-22 15:00:04.046657573 +0000 UTC m=+0.032986471 Reason:PCIIsOk Message:VFIO PCI Connectivity is OK} {Type:InfiniBandLinkFault Status:False Transition:2025-07-22 15:00:04.046657653 +0000 UTC m=+0.032986551 Reason:LinkIsOk Message:InfiniBand Interfaces OK}]
I0722 15:00:04.057824      13 log_monitor.go:97] Start log monitor config refresh loop for /config/kernel-kmsg.json
I0722 15:00:04.057869      13 log_monitor.go:305] Initialize condition generated: [{Type:KernelDeadlock Status:False Transition:2025-07-22 15:00:04.057859761 +0000 UTC m=+0.044188659 Reason:KernelHasNoDeadlock Message:Kernel has no deadlock} {Type:KernelHardlock Status:False Transition:2025-07-22 15:00:04.057859841 +0000 UTC m=+0.044188749 Reason:NoCPUHardLockup Message:Kernel has no CPU Hard Lockup} {Type:ReadonlyFilesystem Status:False Transition:2025-07-22 15:00:04.057859921 +0000 UTC m=+0.044188819 Reason:FilesystemIsNotReadOnly Message:Filesystem is not read-only} {Type:LocalDiskErrors Status:False Transition:2025-07-22 15:00:04.057859991 +0000 UTC m=+0.044188889 Reason:NoDiskErrors Message:Local NVMe is healthy} {Type:GPUWantsReset Status:False Transition:2025-07-22 15:00:04.057860061 +0000 UTC m=+0.044188959 Reason:NoGPUErrors Message:GPUs are not reporting non-fatal errors} {Type:GPUChannelRetirement Status:False Transition:2025-07-22 15:00:04.057860131 +0000 UTC m=+0.044189039 Reason:NoGPUErrors Message:GPUs are not reporting any channel retirement errors} {Type:GPUChannelRetirementFailure Status:False Transition:2025-07-22 15:00:04.057860211 +0000 UTC m=+0.044189109 Reason:NoGPUErrors Message:No channels have failed retirement} {Type:GPURowRemapFailure Status:False Transition:2025-07-22 15:00:04.057860271 +0000 UTC m=+0.044189179 Reason:RowRemapOk Message:No rows have failed remapping} {Type:GPUECCUncorrectableError Status:False Transition:2025-07-22 15:00:04.057860341 +0000 UTC m=+0.044189239 Reason:NoECCError Message:No GPUs have triggered an ECC uncorrectable error} {Type:GPUInvalidPushBuffer Status:False Transition:2025-07-22 15:00:04.057860411 +0000 UTC m=+0.044189309 Reason:NoGPUErrors Message:GPUs are not reporting invalid push buffer} {Type:GPUContextSwitchFault Status:False Transition:2025-07-22 15:00:04.057860481 +0000 UTC m=+0.044189379 Reason:NoGPUErrors Message:GPUs are not reporting context switch fault} {Type:GPUFault Status:False Transition:2025-07-22 15:00:04.057860551 +0000 UTC m=+0.044189449 Reason:NoGPUErrors Message:GPUs are not reporting errors} {Type:IBPCIUnavailable Status:False Transition:2025-07-22 15:00:04.057860621 +0000 UTC m=+0.044189519 Reason:IBPCIAvailable Message:InfiniBand adapters are not reporting PCI slot unavailable} {Type:GPUFallenOffBus Status:False Transition:2025-07-22 15:00:04.057860691 +0000 UTC m=+0.044189589 Reason:NoMatchingXid Message:No Xid 79 detected} {Type:GPUGSPTimeoutXid119 Status:False Transition:2025-07-22 15:00:04.057860761 +0000 UTC m=+0.044189659 Reason:NoMatchingXid Message:No Xid 119 detected} {Type:GPUContextSwitchTimeoutXid109 Status:False Transition:2025-07-22 15:00:04.057860831 +0000 UTC m=+0.044189729 Reason:NoMatchingXid Message:No Xid 109 detected} {Type:GPUGSPPanicXid120 Status:False Transition:2025-07-22 15:00:04.057860901 +0000 UTC m=+0.044189799 Reason:NoMatchingXid Message:No Xid 120 detected} {Type:PersistentStorageFault Status:False Transition:2025-07-22 15:00:04.057860971 +0000 UTC m=+0.044189879 Reason:NoStorageErrors Message:Storage subsystem is not reporting any errors} {Type:HardwareErrorFatal Status:False Transition:2025-07-22 15:00:04.057861051 +0000 UTC m=+0.044189949 Reason:NoHardwareErrorFatal Message:Platform is not reporting any fatal hardware errors} {Type:HardwareErrorInterruptCPU Status:False Transition:2025-07-22 15:00:04.057861121 +0000 UTC m=+0.044190019 Reason:NoInterruptsDetected Message:Platform is reporting no hardware errors via APEI for CPU} {Type:HardwareErrorInterruptMemory Status:False Transition:2025-07-22 15:00:04.057861201 +0000 UTC m=+0.044190099 Reason:NoInterruptsDetected Message:Platform is reporting no hardware errors via APEI for Memory} {Type:NVLinkXIDFatal Status:False Transition:2025-07-22 15:00:04.057861271 +0000 UTC m=+0.044190169 Reason:NoMatchingXid Message:No XID 144-150, 154-157 detected} {Type:NVLinkXIDFatalSwitch Status:False Transition:2025-07-22 15:00:04.057861341 +0000 UTC m=+0.044190239 Reason:NoMatchingXid Message:No XID 144-150, 154-157 detected from switch side} {Type:NVLinkMaskError Status:False Transition:2025-07-22 15:00:04.057861401 +0000 UTC m=+0.044190299 Reason:NoMaskError Message:No GPUs are reporting link mask errors} {Type:Sector0LocalDiskErrors Status:False Transition:2025-07-22 15:00:04.057861471 +0000 UTC m=+0.044190369 Reason:NoSector0NVMEErrors Message:No nvme i/o errors detected on sector 0} {Type:SuspectedLocalDiskErrors Status:False Transition:2025-07-22 15:00:04.057861541 +0000 UTC m=+0.044190439 Reason:NoSuspectedNVMEErrors Message:No nvme i/o errors detected}]
I0722 15:00:04.069203      13 log_monitor.go:229] New status generated: &{Source: Events:[{Severity:warn Timestamp:2025-07-22 15:00:04.046723653 +0000 UTC m=+0.033052551 Reason:KmsgWatchLoopStarted Message:[npd-internal] Entering watch loop for kernel log}] Conditions:[{Type:KernelDeadlock Status:False Transition:2025-07-22 15:00:04.057859761 +0000 UTC m=+0.044188659 Reason:KernelHasNoDeadlock Message:Kernel has no deadlock} {Type:KernelHardlock Status:False Transition:2025-07-22 15:00:04.057859841 +0000 UTC m=+0.044188749 Reason:NoCPUHardLockup Message:Kernel has no CPU Hard Lockup} {Type:ReadonlyFilesystem Status:False Transition:2025-07-22 15:00:04.057859921 +0000 UTC m=+0.044188819 Reason:FilesystemIsNotReadOnly Message:Filesystem is not read-only} {Type:LocalDiskErrors Status:False Transition:2025-07-22 15:00:04.057859991 +0000 UTC m=+0.044188889 Reason:NoDiskErrors Message:Local NVMe is healthy} {Type:GPUWantsReset Status:False Transition:2025-07-22 15:00:04.057860061 +0000 UTC m=+0.044188959 Reason:NoGPUErrors Message:GPUs are not reporting non-fatal errors} {Type:GPUChannelRetirement Status:False Transition:2025-07-22 15:00:04.057860131 +0000 UTC m=+0.044189039 Reason:NoGPUErrors Message:GPUs are not reporting any channel retirement errors} {Type:GPUChannelRetirementFailure Status:False Transition:2025-07-22 15:00:04.057860211 +0000 UTC m=+0.044189109 Reason:NoGPUErrors Message:No channels have failed retirement} {Type:GPURowRemapFailure Status:False Transition:2025-07-22 15:00:04.057860271 +0000 UTC m=+0.044189179 Reason:RowRemapOk Message:No rows have failed remapping} {Type:GPUECCUncorrectableError Status:False Transition:2025-07-22 15:00:04.057860341 +0000 UTC m=+0.044189239 Reason:NoECCError Message:No GPUs have triggered an ECC uncorrectable error} {Type:GPUInvalidPushBuffer Status:False Transition:2025-07-22 15:00:04.057860411 +0000 UTC m=+0.044189309 Reason:NoGPUErrors Message:GPUs are not reporting invalid push buffer} {Type:GPUContextSwitchFault Status:False Transition:2025-07-22 15:00:04.057860481 +0000 UTC m=+0.044189379 Reason:NoGPUErrors Message:GPUs are not reporting context switch fault} {Type:GPUFault Status:False Transition:2025-07-22 15:00:04.057860551 +0000 UTC m=+0.044189449 Reason:NoGPUErrors Message:GPUs are not reporting errors} {Type:IBPCIUnavailable Status:False Transition:2025-07-22 15:00:04.057860621 +0000 UTC m=+0.044189519 Reason:IBPCIAvailable Message:InfiniBand adapters are not reporting PCI slot unavailable} {Type:GPUFallenOffBus Status:False Transition:2025-07-22 15:00:04.057860691 +0000 UTC m=+0.044189589 Reason:NoMatchingXid Message:No Xid 79 detected} {Type:GPUGSPTimeoutXid119 Status:False Transition:2025-07-22 15:00:04.057860761 +0000 UTC m=+0.044189659 Reason:NoMatchingXid Message:No Xid 119 detected} {Type:GPUContextSwitchTimeoutXid109 Status:False Transition:2025-07-22 15:00:04.057860831 +0000 UTC m=+0.044189729 Reason:NoMatchingXid Message:No Xid 109 detected} {Type:GPUGSPPanicXid120 Status:False Transition:2025-07-22 15:00:04.057860901 +0000 UTC m=+0.044189799 Reason:NoMatchingXid Message:No Xid 120 detected} {Type:PersistentStorageFault Status:False Transition:2025-07-22 15:00:04.057860971 +0000 UTC m=+0.044189879 Reason:NoStorageErrors Message:Storage subsystem is not reporting any errors} {Type:HardwareErrorFatal Status:False Transition:2025-07-22 15:00:04.057861051 +0000 UTC m=+0.044189949 Reason:NoHardwareErrorFatal Message:Platform is not reporting any fatal hardware errors} {Type:HardwareErrorInterruptCPU Status:False Transition:2025-07-22 15:00:04.057861121 +0000 UTC m=+0.044190019 Reason:NoInterruptsDetected Message:Platform is reporting no hardware errors via APEI for CPU} {Type:HardwareErrorInterruptMemory Status:False Transition:2025-07-22 15:00:04.057861201 +0000 UTC m=+0.044190099 Reason:NoInterruptsDetected Message:Platform is reporting no hardware errors via APEI for Memory} {Type:NVLinkXIDFatal Status:False Transition:2025-07-22 15:00:04.057861271 +0000 UTC m=+0.044190169 Reason:NoMatchingXid Message:No XID 144-150, 154-157 detected} {Type:NVLinkXIDFatalSwitch Status:False Transition:2025-07-22 15:00:04.057861341 +0000 UTC m=+0.044190239 Reason:NoMatchingXid Message:No XID 144-150, 154-157 detected from switch side} {Type:NVLinkMaskError Status:False Transition:2025-07-22 15:00:04.057861401 +0000 UTC m=+0.044190299 Reason:NoMaskError Message:No GPUs are reporting link mask errors} {Type:Sector0LocalDiskErrors Status:False Transition:2025-07-22 15:00:04.057861471 +0000 UTC m=+0.044190369 Reason:NoSector0NVMEErrors Message:No nvme i/o errors detected on sector 0} {Type:SuspectedLocalDiskErrors Status:False Transition:2025-07-22 15:00:04.057861541 +0000 UTC m=+0.044190439 Reason:NoSuspectedNVMEErrors Message:No nvme i/o errors detected}]}
I0722 15:11:54.061383      13 log_monitor.go:139] Log monitor config /config/kernel-kmsg.json is changed, updating
E0722 15:11:54.061408      13 log_monitor.go:104] Failed to refresh log monitor config /config/kernel-kmsg.json: log monitor config change requires restart
Diff:   systemlogmonitor.MonitorConfig{
  	WatcherConfig: {Plugin: "kmsg", PluginConfig: {"refresh": "true", "refreshDurationSeconds": "10", "revive": "true"}, LogPath: "/dev/kmsg", Lookback: "5m", ...},
  	BufferSize:    1000,
  	Source:        "",
I0722 15:11:54.061426      13 k8s_exporter.go:126] Context cancelled; shutting down HTTP server on 127.0.0.1:20256
I0722 15:11:54.061459      13 custom_plugin_monitor.go:118] Stop custom plugin monitor /config/network-monitor.json
  	DefaultConditions: []types.Condition{
  		... // 2 identical elements
  		{Type: "ReadonlyFilesystem", Reason: "FilesystemIsNotReadOnly", Message: "Filesystem is not read-only"},
  		{Type: "LocalDiskErrors", Reason: "NoDiskErrors", Message: "Local NVMe is healthy"},
I0722 15:11:54.061509      13 plugin.go:63] Stopping plugin execution
I0722 15:11:54.061525      13 plugin.go:276] Stop plugin execution
+ 		{
+ 			Type:    "GPUXID149",
+ 			Reason:  "NoGPUErrors",
I0722 15:11:54.061533      13 custom_plugin_monitor.go:148] Custom plugin monitor stopped: /config/network-monitor.json
I0722 15:11:54.061484      13 prometheus_exporter.go:65] Context cancelled; shutting down HTTP server on 0.0.0.0:20257
+ 			Message: "GPUs are not reporting xid 149",
+ 		},
I0722 15:11:54.061539      13 custom_plugin_monitor.go:118] Stop custom plugin monitor /config/pci-monitor.json
I0722 15:11:54.061606      13 plugin.go:63] Stopping plugin execution
  		{Type: "GPUWantsReset", Reason: "NoGPUErrors", Message: "GPUs are not reporting non-fatal errors"},
  		{Type: "GPUChannelRetirement", Reason: "NoGPUErrors", Message: "GPUs are not reporting any channel retirement errors"},
  		... // 20 identical elements
I0722 15:11:54.061618      13 plugin.go:276] Stop plugin execution
I0722 15:11:54.061624      13 custom_plugin_monitor.go:148] Custom plugin monitor stopped: /config/pci-monitor.json
I0722 15:11:54.061632      13 log_monitor.go:191] Stop log monitor /config/kernel-kmsg.json
E0722 15:11:54.061661      13 logger.go:18] error reading /dev/kmsg: read /dev/kmsg: file already closed
E0722 15:11:54.061678      13 log_watcher_linux.go:120] Kmsg channel closed, reviving parser
  	},
  	Rules: []types.Rule{
  		... // 13 identical elements
  		{Type: "permanent", Condition: "NVLinkXIDFatalSwitch", Reason: "NVSwitch XID indicates fatal error from switch side", Pattern: `NVRM: Xid \(PCI.+\): (14[4-9]|150|15[4-7]),.*Fatal.* \(0x02.*`, ...},
  		{Type: "permanent", Condition: "NVLinkMaskError", Reason: "GPUs are reporting Link mask errors", Pattern: "NVRM: (knvlinkUpdatePostRxDetectLinkMask_IMPL: Failed to update "..., ...},
  		{
  			Type:      "permanent",
  			Condition: "GPUWantsReset",
  			Reason:    "NVSwitch XID indicates GPU reset required",
  			Pattern: strings.Join({
  				`NVRM: Xid \(PCI.+\): (14[4-`,
- 				"9",
+ 				"8",
  				"]|150|15[4-7]),.*GPU Reset Required.*",
  			}, ""),
  			PatternGeneratedMessageSuffix: "",
  		},
+ 		{
+ 			Type:      "permanent",
+ 			Condition: "GPUXID149",
+ 			Reason:    "NVSwitch XID indicates GPU reset required",
+ 			Pattern:   `NVRM: Xid \(PCI.+\): (149),.*`,
+ 		},
  		{Type: "permanent", Condition: "GPUChannelRetirement", Reason: "GPU is reporting memory channel retirement due to repeat uncorre"..., Pattern: `NVRM: Xid \(PCI.+\): (160),.*`, ...},
  		{Type: "permanent", Condition: "GPUChannelRetirementFailure", Reason: "GPU is reporting memory channel retirement failure", Pattern: `NVRM: Xid \(PCI.+\): (161),.*`, ...},
  		... // 35 identical elements
  	},
  	EnableMetricsReporting: &true,
  }

E0722 15:11:59.064868      13 log_watcher_linux.go:120] Kmsg channel closed, reviving parser
I0722 15:12:04.065230      13 log_watcher_linux.go:108] Stop watching kernel log
I0722 15:12:04.065264      13 log_monitor.go:212] Log monitor stopped: /config/kernel-kmsg.json
I0722 15:12:04.065285      13 node_problem_detector_unix.go:54] Reloading node problem detector due to config change
I0722 15:12:04.065525      13 custom_plugin_monitor.go:80] Finish parsing custom plugin monitor config file /config/network-monitor.json: {Plugin:custom PluginGlobalConfig:{InvokeIntervalString:0xc000a06110 TimeoutString:0xc000a06120 InvokeInterval:1m0s Timeout:50s MaxOutputLength:0xc000a1b410 Concurrency:0xc000a1b418 EnableMessageChangeBasedConditionUpdate:0x2fec7f1 SkipInitialStatus:0x2fec7f2} Source:network-monitor DefaultConditions:[{Type:DNSFailure Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:DNSIsOk Message:DNS lookups are working} {Type:ConnectivityFailure Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:ConnectionIsOk Message:Internal connectivity is working} {Type:APIFailure Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:ConnectionIsOk Message:Internal connectivity to K8S APIServer is working} {Type:PubFailure Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:ConnectionIsOk Message:Public Internet connectivity is working}] Rules:[0xc0007800e0 0xc000780150 0xc0007801c0 0xc000780230] EnableMetricsReporting:0xc000a1b42f}
I0722 15:12:04.065734      13 custom_plugin_monitor.go:80] Finish parsing custom plugin monitor config file /config/pci-monitor.json: {Plugin:custom PluginGlobalConfig:{InvokeIntervalString:0xc000a06380 TimeoutString:0xc000a06390 InvokeInterval:20s Timeout:50s MaxOutputLength:0xc000a1b730 Concurrency:0xc000a1b738 EnableMessageChangeBasedConditionUpdate:0xc000a1b740 SkipInitialStatus:0x2fec7f2} Source:pci-monitor DefaultConditions:[{Type:GPUPCIFault Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:PCIIsOk Message:VFIO PCI Connectivity is OK} {Type:InfiniBandLinkFault Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:LinkIsOk Message:InfiniBand Interfaces OK}] Rules:[0xc000780380 0xc0007803f0 0xc000780460] EnableMetricsReporting:0xc000a1b74c}
I0722 15:12:04.066559      13 log_monitor.go:82] Finish parsing log monitor config file /config/kernel-kmsg.json: {WatcherConfig:{Plugin:kmsg PluginConfig:map[refresh:true refreshDurationSeconds:10 revive:true] SkipList:[] LogPath:/dev/kmsg Lookback:5m Delay:} BufferSize:1000 Source: DefaultConditions:[{Type:KernelDeadlock Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:KernelHasNoDeadlock Message:Kernel has no deadlock} {Type:KernelHardlock Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoCPUHardLockup Message:Kernel has no CPU Hard Lockup} {Type:ReadonlyFilesystem Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:FilesystemIsNotReadOnly Message:Filesystem is not read-only} {Type:LocalDiskErrors Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoDiskErrors Message:Local NVMe is healthy} {Type:GPUXID149 Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoGPUErrors Message:GPUs are not reporting xid 149} {Type:GPUWantsReset Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoGPUErrors Message:GPUs are not reporting non-fatal errors} {Type:GPUChannelRetirement Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoGPUErrors Message:GPUs are not reporting any channel retirement errors} {Type:GPUChannelRetirementFailure Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoGPUErrors Message:No channels have failed retirement} {Type:GPURowRemapFailure Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:RowRemapOk Message:No rows have failed remapping} {Type:GPUECCUncorrectableError Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoECCError Message:No GPUs have triggered an ECC uncorrectable error} {Type:GPUInvalidPushBuffer Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoGPUErrors Message:GPUs are not reporting invalid push buffer} {Type:GPUContextSwitchFault Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoGPUErrors Message:GPUs are not reporting context switch fault} {Type:GPUFault Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoGPUErrors Message:GPUs are not reporting errors} {Type:IBPCIUnavailable Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:IBPCIAvailable Message:InfiniBand adapters are not reporting PCI slot unavailable} {Type:GPUFallenOffBus Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoMatchingXid Message:No Xid 79 detected} {Type:GPUGSPTimeoutXid119 Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoMatchingXid Message:No Xid 119 detected} {Type:GPUContextSwitchTimeoutXid109 Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoMatchingXid Message:No Xid 109 detected} {Type:GPUGSPPanicXid120 Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoMatchingXid Message:No Xid 120 detected} {Type:PersistentStorageFault Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoStorageErrors Message:Storage subsystem is not reporting any errors} {Type:HardwareErrorFatal Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoHardwareErrorFatal Message:Platform is not reporting any fatal hardware errors} {Type:HardwareErrorInterruptCPU Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoInterruptsDetected Message:Platform is reporting no hardware errors via APEI for CPU} {Type:HardwareErrorInterruptMemory Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoInterruptsDetected Message:Platform is reporting no hardware errors via APEI for Memory} {Type:NVLinkXIDFatal Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoMatchingXid Message:No XID 144-150, 154-157 detected} {Type:NVLinkXIDFatalSwitch Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoMatchingXid Message:No XID 144-150, 154-157 detected from switch side} {Type:NVLinkMaskError Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoMaskError Message:No GPUs are reporting link mask errors} {Type:Sector0LocalDiskErrors Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoSector0NVMEErrors Message:No nvme i/o errors detected on sector 0} {Type:SuspectedLocalDiskErrors Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoSuspectedNVMEErrors Message:No nvme i/o errors detected}] Rules:[{Type:temporary Condition: Reason:SystemOOMKilling Pattern:Out of memory: Killed process \d+ (.+) total-vm:\d+kB, anon-rss:\d+kB, file-rss:\d+kB.* PatternGeneratedMessageSuffix:} {Type:temporary Condition: Reason:CGroupOOMKilling Pattern:Memory cgroup out of memory: Killed process \d+ (.*) total-vm:\d+kB, anon-rss:\d+kB, file-rss:\d+kB.* PatternGeneratedMessageSuffix:} {Type:temporary Condition: Reason:TaskHung Pattern:task \S+:\w+ blocked for more than \w+ seconds\. PatternGeneratedMessageSuffix:} {Type:temporary Condition: Reason:UnregisterNetDevice Pattern:unregister_netdevice: waiting for \w+ to become free. Usage count = \d+ PatternGeneratedMessageSuffix:} {Type:temporary Condition: Reason:KernelOops Pattern:BUG: unable to handle kernel NULL pointer dereference at .* PatternGeneratedMessageSuffix:} {Type:temporary Condition: Reason:KernelOops Pattern:divide error: 0000 \[#\d+\] SMP PatternGeneratedMessageSuffix:} {Type:temporary Condition: Reason:HardwareErrorRecoverable Pattern:\[Hardware Error\]: event severity: recoverable PatternGeneratedMessageSuffix:} {Type:temporary Condition: Reason:HardwareErrorCorrected Pattern:\[Hardware Error\]: event severity: corrected PatternGeneratedMessageSuffix:} {Type:temporary Condition: Reason:HardwareErrorInfo Pattern:\[Hardware Error\]: event severity: info PatternGeneratedMessageSuffix:} {Type:temporary Condition: Reason:PCIAER Pattern:AER: aer_status: .* PatternGeneratedMessageSuffix:} {Type:temporary Condition: Reason:NVSXidNonFatal Pattern:nvidia-nvswitch\d: SXid .* Non-fatal, .* PatternGeneratedMessageSuffix:} {Type:temporary Condition:NVLinkXIDNonfatal Reason: Pattern:NVRM: Xid \(PCI.+\): (14[4-9]|150|15[4-7]),.*Nonfatal.* PatternGeneratedMessageSuffix:} {Type:permanent Condition:NVLinkXIDFatal Reason:NVSwitch XID indicates fatal error Pattern:NVRM: Xid \(PCI.+\): (14[4-9]|150|15[4-7]),.*Fatal.* \(0x0[0-1|3-9].* PatternGeneratedMessageSuffix:} {Type:permanent Condition:NVLinkXIDFatalSwitch Reason:NVSwitch XID indicates fatal error from switch side Pattern:NVRM: Xid \(PCI.+\): (14[4-9]|150|15[4-7]),.*Fatal.* \(0x02.* PatternGeneratedMessageSuffix:} {Type:permanent Condition:NVLinkMaskError Reason:GPUs are reporting Link mask errors Pattern:NVRM: (knvlinkUpdatePostRxDetectLinkMask_IMPL: Failed to update Rx Detect Link mask|NVRM: knvlinkDiscoverPostRxDetLinks.*: Getting peer..s postRxDetLinkMask failed).* PatternGeneratedMessageSuffix:} {Type:permanent Condition:GPUWantsReset Reason:NVSwitch XID indicates GPU reset required Pattern:NVRM: Xid \(PCI.+\): (14[4-8]|150|15[4-7]),.*GPU Reset Required.* PatternGeneratedMessageSuffix:} {Type:permanent Condition:GPUXID149 Reason:NVSwitch XID indicates GPU reset required Pattern:NVRM: Xid \(PCI.+\): (149),.* PatternGeneratedMessageSuffix:} {Type:permanent Condition:GPUChannelRetirement Reason:GPU is reporting memory channel retirement due to repeat uncorrectable errors Pattern:NVRM: Xid \(PCI.+\): (160),.* PatternGeneratedMessageSuffix:} {Type:permanent Condition:GPUChannelRetirementFailure Reason:GPU is reporting memory channel retirement failure Pattern:NVRM: Xid \(PCI.+\): (161),.* PatternGeneratedMessageSuffix:} {Type:permanent Condition:KernelDeadlock Reason:CPUSoftLockup Pattern:watchdog: BUG: soft lockup .* PatternGeneratedMessageSuffix:} {Type:permanent Condition:KernelHardlock Reason:CPUHardLockup Pattern:NMI watchdog: Watchdog detected hard LOCKUP .* PatternGeneratedMessageSuffix:} {Type:permanent Condition:KernelDeadlock Reason:AUFSUmountHung Pattern:task umount\.aufs:\w+ blocked for more than \w+ seconds\. PatternGeneratedMessageSuffix:} {Type:permanent Condition:KernelDeadlock Reason:DockerHung Pattern:task docker:\w+ blocked for more than \w+ seconds\. PatternGeneratedMessageSuffix:} {Type:permanent Condition:ReadonlyFilesystem Reason:FilesystemIsReadOnly Pattern:Remounting filesystem read-only PatternGeneratedMessageSuffix:} {Type:permanent Condition:LocalDiskErrors Reason:XFSCorruption Pattern:XFS \(((.{0,4})|(.+loop.*)|(loo[^p].*)|(lo[^o].*)|(l[^o].*))\).*Corruption detected.* PatternGeneratedMessageSuffix:} {Type:permanent Condition:SuspectedLocalDiskErrors Reason:IOError Pattern:I/O error, dev nvme.*, sector (?:[^0].+) .* PatternGeneratedMessageSuffix:} {Type:permanent Condition:Sector0LocalDiskErrors Reason:IOError Pattern:I/O error, dev nvme.*, sector (?:[0].+) .* PatternGeneratedMessageSuffix:} {Type:permanent Condition:LocalDiskErrors Reason:IOError Pattern:nvme.+ I/O .+ timeout.* PatternGeneratedMessageSuffix:} {Type:permanent Condition:LocalDiskErrors Reason:CriticalMediumError Pattern:critical medium error, dev nvme.+ PatternGeneratedMessageSuffix:} {Type:permanent Condition:GPUFault Reason:CUDASegFault Pattern:cuda-.+ segfault at .* PatternGeneratedMessageSuffix:} {Type:permanent Condition:GPUFault Reason:GPUIsReportingErrors Pattern:NVRM: Xid \(PCI.+\): (1|2|3|4|5|6|7|8|9|10|11|12|14|15|16|17|18|19|20|21|22|23|24|25|26|27|28|29|30|33|34|35|36|37|38|39|40|41|42|46|47|49|50|51|52|53|54|55|56|57|58|59|60|61|62|65|66|67|69|70|71|72|73|74|75|76|77|78|80|81|82|83|84|85|86|87|88|89|90|91|92|93|96|97|98|99|100|101|102|103|104|105|106|107|108|110|111|112|113|114|115|116|117|118),.* PatternGeneratedMessageSuffix:} {Type:permanent Condition:GPUFault Reason:GPUIsReportingErrors Pattern:NVRM: Rate limiting GSP RPC error prints for GPU at PCI:.+ \(printing .+ of every .+\).  The GPU likely needs to be reset.* PatternGeneratedMessageSuffix:} {Type:permanent Condition:GPUInvalidPushBuffer Reason:GPUIsReportingInvalidPushBuffer Pattern:NVRM: Xid \(PCI.+\): 32,.* PatternGeneratedMessageSuffix:} {Type:permanent Condition:GPUContextSwitchFault Reason:GPUIsReportingContextSwitchFault Pattern:NVRM: Xid \(PCI.+\): 44,.* PatternGeneratedMessageSuffix:} {Type:permanent Condition:GPUWantsReset Reason:GPU Has a pending row remap Pattern:NVRM: Xid \(PCI.+\): (63),.* PatternGeneratedMessageSuffix:} {Type:permanent Condition:GPURowRemapFailure Reason:GPU Failed a row remap Pattern:NVRM: Xid \(PCI.+\): (64),.* PatternGeneratedMessageSuffix:} {Type:permanent Condition:GPUECCUncorrectableError Reason:GPU has encountered an uncorrectable ECC error Pattern:NVRM: Xid \(PCI.+\): (48|94|95),.* PatternGeneratedMessageSuffix:} {Type:permanent Condition:GPUFallenOffBus Reason:A GPU has fallen off the bus Pattern:NVRM: Xid \(PCI.+\): (79),.* PatternGeneratedMessageSuffix:} {Type:permanent Condition:GPUGSPTimeoutXid119 Reason:GPU System Processor is failing to respond, likely crashed or deadlocked Pattern:NVRM: Xid \(PCI.+\): (119),.* PatternGeneratedMessageSuffix:} {Type:permanent Condition:GPUContextSwitchTimeoutXid109 Reason:GPU is reporting a Xid 109 Pattern:NVRM: Xid \(PCI.+\): (109),.* PatternGeneratedMessageSuffix:} {Type:permanent Condition:GPUGSPPanicXid120 Reason:GPU is reporting a GSP task panic Pattern:NVRM: Xid \(PCI.+\): (120),.* PatternGeneratedMessageSuffix:} {Type:permanent Condition:NVSwitchFailure Reason:A NVSwitch has failed Pattern:nvidia-nvswitch.: SXid \(PCI.+\): (1900[4-6]|1901[3-7]|1904[6-8]|1905(3|4|6|8)|1906(0|1|3|4|6|7|9)|19070|20034|220(03|12)|2300[1-9]|2301[1-7]|2400[4-6]|2600[1-7]|2900(2|4)|3000(2|4)),.* PatternGeneratedMessageSuffix:} {Type:permanent Condition:GPUFault Reason:GPUSmiError Pattern:CW: GPU .* PatternGeneratedMessageSuffix:} {Type:permanent Condition:IBPCIUnavailable Reason:PCILost Pattern:mlx5_core.*PCI slot is unavailable.* PatternGeneratedMessageSuffix:} {Type:permanent Condition:PersistentStorageFault Reason:CephFSQuotaError Pattern:ceph: get_quota_realm: ino .+ PatternGeneratedMessageSuffix:} {Type:temporary Condition:NFSStorageFault Reason:NFSNotResponding Pattern:nfs: server .+ not responding.+ PatternGeneratedMessageSuffix:} {Type:permanent Condition:GPUFault Reason:ROMError Pattern:.*Invalid PCI ROM header signature.* PatternGeneratedMessageSuffix:} {Type:permanent Condition:HardwareErrorFatal Reason:HardwareErrorFatal Pattern:\[Hardware Error\]: event severity: fatal.* PatternGeneratedMessageSuffix:} {Type:temporary Condition: Reason:HardwareErrorInterruptPCIe Pattern:\[Hardware Error\]:   section_type: PCIe error PatternGeneratedMessageSuffix:} {Type:permanent Condition:HardwareErrorInterruptCPU Reason:HardwareErrorFromAPEI Pattern:\[Hardware Error\]:   section_type: .* processor error PatternGeneratedMessageSuffix:} {Type:permanent Condition:HardwareErrorInterruptMemory Reason:HardwareErrorFromAPEI Pattern:\[Hardware Error\]:   section_type: memory error PatternGeneratedMessageSuffix:} {Type:temporary Condition: Reason:HardwareErrorInterruptUnknown Pattern:\[Hardware Error\]:   section_type: unknown.* PatternGeneratedMessageSuffix:} {Type:temporary Condition: Reason:KmsgWatchLoopStarted Pattern:\[npd-internal\] Entering watch loop for kernel log PatternGeneratedMessageSuffix:} {Type:temporary Condition: Reason:KmsgParserRevived Pattern:\[npd-internal\] Reviving.*parser.* PatternGeneratedMessageSuffix:}] EnableMetricsReporting:0x2f38e62}
I0722 15:12:04.066781      13 log_watchers.go:40] Use log watcher of plugin "kmsg"
I0722 15:12:04.067945      13 log_monitor.go:82] Finish parsing log monitor config file /config/kernel-journald.json: {WatcherConfig:{Plugin:journald PluginConfig:map[source:kernel] SkipList:[] LogPath:/var/log/journal Lookback:5m Delay:} BufferSize:1000 Source: DefaultConditions:[{Type:KernelDeadlock Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:KernelHasNoDeadlock Message:Kernel has no deadlock} {Type:KernelHardlock Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoCPUHardLockup Message:Kernel has no CPU Hard Lockup} {Type:ReadonlyFilesystem Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:FilesystemIsNotReadOnly Message:Filesystem is not read-only} {Type:LocalDiskErrors Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoDiskErrors Message:Local NVMe is healthy} {Type:GPUXID149 Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoGPUErrors Message:GPUs are not reporting xid 149} {Type:GPUWantsReset Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoGPUErrors Message:GPUs are not reporting non-fatal errors} {Type:GPUChannelRetirement Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoGPUErrors Message:GPUs are not reporting any channel retirement errors} {Type:GPUChannelRetirementFailure Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoGPUErrors Message:No channels have failed retirement} {Type:GPURowRemapFailure Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:RowRemapOk Message:No rows have failed remapping} {Type:GPUECCUncorrectableError Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoECCError Message:No GPUs have triggered an ECC uncorrectable error} {Type:GPUInvalidPushBuffer Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoGPUErrors Message:GPUs are not reporting invalid push buffer} {Type:GPUContextSwitchFault Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoGPUErrors Message:GPUs are not reporting context switch fault} {Type:GPUFault Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoGPUErrors Message:GPUs are not reporting errors} {Type:IBPCIUnavailable Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:IBPCIAvailable Message:InfiniBand adapters are not reporting PCI slot unavailable} {Type:GPUFallenOffBus Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoMatchingXid Message:No Xid 79 detected} {Type:GPUGSPTimeoutXid119 Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoMatchingXid Message:No Xid 119 detected} {Type:GPUContextSwitchTimeoutXid109 Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoMatchingXid Message:No Xid 109 detected} {Type:GPUGSPPanicXid120 Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoMatchingXid Message:No Xid 120 detected} {Type:PersistentStorageFault Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoStorageErrors Message:Storage subsystem is not reporting any errors} {Type:HardwareErrorFatal Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoHardwareErrorFatal Message:Platform is not reporting any fatal hardware errors} {Type:HardwareErrorInterruptCPU Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoInterruptsDetected Message:Platform is reporting no hardware errors via APEI for CPU} {Type:HardwareErrorInterruptMemory Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoInterruptsDetected Message:Platform is reporting no hardware errors via APEI for Memory} {Type:NVLinkXIDFatal Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoMatchingXid Message:No XID 144-150, 154-157 detected} {Type:NVLinkXIDFatalSwitch Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoMatchingXid Message:No XID 144-150, 154-157 detected from switch side} {Type:NVLinkMaskError Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoMaskError Message:No GPUs are reporting link mask errors} {Type:Sector0LocalDiskErrors Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoSector0NVMEErrors Message:No nvme i/o errors detected on sector 0} {Type:SuspectedLocalDiskErrors Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoSuspectedNVMEErrors Message:No nvme i/o errors detected}] Rules:[{Type:temporary Condition: Reason:SystemOOMKilling Pattern:Out of memory: Killed process \d+ (.+) total-vm:\d+kB, anon-rss:\d+kB, file-rss:\d+kB.* PatternGeneratedMessageSuffix:} {Type:temporary Condition: Reason:CGroupOOMKilling Pattern:Memory cgroup out of memory: Killed process \d+ (.*) total-vm:\d+kB, anon-rss:\d+kB, file-rss:\d+kB.* PatternGeneratedMessageSuffix:} {Type:temporary Condition: Reason:TaskHung Pattern:task \S+:\w+ blocked for more than \w+ seconds\. PatternGeneratedMessageSuffix:} {Type:temporary Condition: Reason:UnregisterNetDevice Pattern:unregister_netdevice: waiting for \w+ to become free. Usage count = \d+ PatternGeneratedMessageSuffix:} {Type:temporary Condition: Reason:KernelOops Pattern:BUG: unable to handle kernel NULL pointer dereference at .* PatternGeneratedMessageSuffix:} {Type:temporary Condition: Reason:KernelOops Pattern:divide error: 0000 \[#\d+\] SMP PatternGeneratedMessageSuffix:} {Type:temporary Condition: Reason:HardwareErrorRecoverable Pattern:\[Hardware Error\]: event severity: recoverable PatternGeneratedMessageSuffix:} {Type:temporary Condition: Reason:HardwareErrorCorrected Pattern:\[Hardware Error\]: event severity: corrected PatternGeneratedMessageSuffix:} {Type:temporary Condition: Reason:HardwareErrorInfo Pattern:\[Hardware Error\]: event severity: info PatternGeneratedMessageSuffix:} {Type:temporary Condition: Reason:PCIAER Pattern:AER: aer_status: .* PatternGeneratedMessageSuffix:} {Type:temporary Condition: Reason:NVSXidNonFatal Pattern:nvidia-nvswitch\d: SXid .* Non-fatal, .* PatternGeneratedMessageSuffix:} {Type:temporary Condition:NVLinkXIDNonfatal Reason: Pattern:NVRM: Xid \(PCI.+\): (14[4-9]|150|15[4-7]),.*Nonfatal.* PatternGeneratedMessageSuffix:} {Type:permanent Condition:NVLinkXIDFatal Reason:NVSwitch XID indicates fatal error Pattern:NVRM: Xid \(PCI.+\): (14[4-9]|150|15[4-7]),.*Fatal.* \(0x0[0-1|3-9].* PatternGeneratedMessageSuffix:} {Type:permanent Condition:NVLinkXIDFatalSwitch Reason:NVSwitch XID indicates fatal error from switch side Pattern:NVRM: Xid \(PCI.+\): (14[4-9]|150|15[4-7]),.*Fatal.* \(0x02.* PatternGeneratedMessageSuffix:} {Type:permanent Condition:NVLinkMaskError Reason:GPUs are reporting Link mask errors Pattern:NVRM: (knvlinkUpdatePostRxDetectLinkMask_IMPL: Failed to update Rx Detect Link mask|NVRM: knvlinkDiscoverPostRxDetLinks.*: Getting peer..s postRxDetLinkMask failed).* PatternGeneratedMessageSuffix:} {Type:permanent Condition:GPUWantsReset Reason:NVSwitch XID indicates GPU reset required Pattern:NVRM: Xid \(PCI.+\): (14[4-8]|150|15[4-7]),.*GPU Reset Required.* PatternGeneratedMessageSuffix:} {Type:permanent Condition:GPUXID149 Reason:NVSwitch XID indicates GPU reset required Pattern:NVRM: Xid \(PCI.+\): (149),.* PatternGeneratedMessageSuffix:} {Type:permanent Condition:GPUChannelRetirement Reason:GPU is reporting memory channel retirement due to repeat uncorrectable errors Pattern:NVRM: Xid \(PCI.+\): (160),.* PatternGeneratedMessageSuffix:} {Type:permanent Condition:GPUChannelRetirementFailure Reason:GPU is reporting memory channel retirement failure Pattern:NVRM: Xid \(PCI.+\): (161),.* PatternGeneratedMessageSuffix:} {Type:permanent Condition:KernelDeadlock Reason:CPUSoftLockup Pattern:watchdog: BUG: soft lockup .* PatternGeneratedMessageSuffix:} {Type:permanent Condition:KernelHardlock Reason:CPUHardLockup Pattern:NMI watchdog: Watchdog detected hard LOCKUP .* PatternGeneratedMessageSuffix:} {Type:permanent Condition:KernelDeadlock Reason:AUFSUmountHung Pattern:task umount\.aufs:\w+ blocked for more than \w+ seconds\. PatternGeneratedMessageSuffix:} {Type:permanent Condition:KernelDeadlock Reason:DockerHung Pattern:task docker:\w+ blocked for more than \w+ seconds\. PatternGeneratedMessageSuffix:} {Type:permanent Condition:ReadonlyFilesystem Reason:FilesystemIsReadOnly Pattern:Remounting filesystem read-only PatternGeneratedMessageSuffix:} {Type:permanent Condition:LocalDiskErrors Reason:XFSCorruption Pattern:XFS \(((.{0,4})|(.+loop.*)|(loo[^p].*)|(lo[^o].*)|(l[^o].*))\).*Corruption detected.* PatternGeneratedMessageSuffix:} {Type:permanent Condition:SuspectedLocalDiskErrors Reason:IOError Pattern:I/O error, dev nvme.*, sector (?:[^0].+) .* PatternGeneratedMessageSuffix:} {Type:permanent Condition:Sector0LocalDiskErrors Reason:IOError Pattern:I/O error, dev nvme.*, sector (?:[0].+) .* PatternGeneratedMessageSuffix:} {Type:permanent Condition:LocalDiskErrors Reason:IOError Pattern:nvme.+ I/O .+ timeout.* PatternGeneratedMessageSuffix:} {Type:permanent Condition:LocalDiskErrors Reason:CriticalMediumError Pattern:critical medium error, dev nvme.+ PatternGeneratedMessageSuffix:} {Type:permanent Condition:GPUFault Reason:CUDASegFault Pattern:cuda-.+ segfault at .* PatternGeneratedMessageSuffix:} {Type:permanent Condition:GPUFault Reason:GPUIsReportingErrors Pattern:NVRM: Xid \(PCI.+\): (1|2|3|4|5|6|7|8|9|10|11|12|14|15|16|17|18|19|20|21|22|23|24|25|26|27|28|29|30|33|34|35|36|37|38|39|40|41|42|46|47|49|50|51|52|53|54|55|56|57|58|59|60|61|62|65|66|67|69|70|71|72|73|74|75|76|77|78|80|81|82|83|84|85|86|87|88|89|90|91|92|93|96|97|98|99|100|101|102|103|104|105|106|107|108|110|111|112|113|114|115|116|117|118),.* PatternGeneratedMessageSuffix:} {Type:permanent Condition:GPUFault Reason:GPUIsReportingErrors Pattern:NVRM: Rate limiting GSP RPC error prints for GPU at PCI:.+ \(printing .+ of every .+\).  The GPU likely needs to be reset.* PatternGeneratedMessageSuffix:} {Type:permanent Condition:GPUInvalidPushBuffer Reason:GPUIsReportingInvalidPushBuffer Pattern:NVRM: Xid \(PCI.+\): 32,.* PatternGeneratedMessageSuffix:} {Type:permanent Condition:GPUContextSwitchFault Reason:GPUIsReportingContextSwitchFault Pattern:NVRM: Xid \(PCI.+\): 44,.* PatternGeneratedMessageSuffix:} {Type:permanent Condition:GPUWantsReset Reason:GPU Has a pending row remap Pattern:NVRM: Xid \(PCI.+\): (63),.* PatternGeneratedMessageSuffix:} {Type:permanent Condition:GPURowRemapFailure Reason:GPU Failed a row remap Pattern:NVRM: Xid \(PCI.+\): (64),.* PatternGeneratedMessageSuffix:} {Type:permanent Condition:GPUECCUncorrectableError Reason:GPU has encountered an uncorrectable ECC error Pattern:NVRM: Xid \(PCI.+\): (48|94|95),.* PatternGeneratedMessageSuffix:} {Type:permanent Condition:GPUFallenOffBus Reason:A GPU has fallen off the bus Pattern:NVRM: Xid \(PCI.+\): (79),.* PatternGeneratedMessageSuffix:} {Type:permanent Condition:GPUGSPTimeoutXid119 Reason:GPU System Processor is failing to respond, likely crashed or deadlocked Pattern:NVRM: Xid \(PCI.+\): (119),.* PatternGeneratedMessageSuffix:} {Type:permanent Condition:GPUContextSwitchTimeoutXid109 Reason:GPU is reporting a Xid 109 Pattern:NVRM: Xid \(PCI.+\): (109),.* PatternGeneratedMessageSuffix:} {Type:permanent Condition:GPUGSPPanicXid120 Reason:GPU is reporting a GSP task panic Pattern:NVRM: Xid \(PCI.+\): (120),.* PatternGeneratedMessageSuffix:} {Type:permanent Condition:NVSwitchFailure Reason:A NVSwitch has failed Pattern:nvidia-nvswitch.: SXid \(PCI.+\): (1900[4-6]|1901[3-7]|1904[6-8]|1905(3|4|6|8)|1906(0|1|3|4|6|7|9)|19070|20034|220(03|12)|2300[1-9]|2301[1-7]|2400[4-6]|2600[1-7]|2900(2|4)|3000(2|4)),.* PatternGeneratedMessageSuffix:} {Type:permanent Condition:GPUFault Reason:GPUSmiError Pattern:CW: GPU .* PatternGeneratedMessageSuffix:} {Type:permanent Condition:IBPCIUnavailable Reason:PCILost Pattern:mlx5_core.*PCI slot is unavailable.* PatternGeneratedMessageSuffix:} {Type:permanent Condition:PersistentStorageFault Reason:CephFSQuotaError Pattern:ceph: get_quota_realm: ino .+ PatternGeneratedMessageSuffix:} {Type:temporary Condition:NFSStorageFault Reason:NFSNotResponding Pattern:nfs: server .+ not responding.+ PatternGeneratedMessageSuffix:} {Type:permanent Condition:GPUFault Reason:ROMError Pattern:.*Invalid PCI ROM header signature.* PatternGeneratedMessageSuffix:} {Type:permanent Condition:HardwareErrorFatal Reason:HardwareErrorFatal Pattern:\[Hardware Error\]: event severity: fatal.* PatternGeneratedMessageSuffix:} {Type:temporary Condition: Reason:HardwareErrorInterruptPCIe Pattern:\[Hardware Error\]:   section_type: PCIe error PatternGeneratedMessageSuffix:} {Type:permanent Condition:HardwareErrorInterruptCPU Reason:HardwareErrorFromAPEI Pattern:\[Hardware Error\]:   section_type: .* processor error PatternGeneratedMessageSuffix:} {Type:permanent Condition:HardwareErrorInterruptMemory Reason:HardwareErrorFromAPEI Pattern:\[Hardware Error\]:   section_type: memory error PatternGeneratedMessageSuffix:} {Type:temporary Condition: Reason:HardwareErrorInterruptUnknown Pattern:\[Hardware Error\]:   section_type: unknown.* PatternGeneratedMessageSuffix:} {Type:temporary Condition: Reason:JournaldKernelWatchLoopStarted Pattern:\[npd-internal\] Entering journald watch loop.*kernel.* PatternGeneratedMessageSuffix:} {Type:temporary Condition: Reason:JournaldKernelFailedToGetNextEntry Pattern:\[npd-internal\] Failed to get next journald entry.*kernel.* PatternGeneratedMessageSuffix:} {Type:temporary Condition: Reason:JournaldKernelFailedToGetEntry Pattern:\[npd-internal\] Failed to get journald entry.*kernel.* PatternGeneratedMessageSuffix:}] EnableMetricsReporting:0x2f38e62}
I0722 15:12:04.068141      13 log_watchers.go:40] Use log watcher of plugin "journald"
I0722 15:12:04.068535      13 log_monitor.go:82] Finish parsing log monitor config file /config/docker-monitor.json: {WatcherConfig:{Plugin:journald PluginConfig:map[source:dockerd] SkipList:[] LogPath:/var/log/journal Lookback:5m Delay:} BufferSize:10 Source:docker-monitor DefaultConditions:[] Rules:[{Type:temporary Condition: Reason:CorruptDockerImage Pattern:Error trying v2 registry: failed to register layer: rename /var/lib/docker/image/(.+) /var/lib/docker/image/(.+): directory not empty.* PatternGeneratedMessageSuffix:}] EnableMetricsReporting:0x2f38e62}
I0722 15:12:04.068556      13 log_watchers.go:40] Use log watcher of plugin "journald"
I0722 15:12:04.068884      13 log_monitor.go:82] Finish parsing log monitor config file /config/kubelet.json: {WatcherConfig:{Plugin:journald PluginConfig:map[source:kubelet] SkipList:[] LogPath:/var/log/journal Lookback:5m Delay:} BufferSize:10 Source: DefaultConditions:[{Type:RunContainerError Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoRunContainerError Message:No RunContainerErrors present} {Type:KillContainerFailed Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoKillContainerFailed Message:No KillContainerFailed Errors present}] Rules:[{Type:temporary Condition: Reason:JournaldKubeletWatchLoopStarted Pattern:\[npd-internal\] Entering journald watch loop.*kubelet.* PatternGeneratedMessageSuffix:} {Type:temporary Condition: Reason:JournaldKubeletFailedToGetNextEntry Pattern:\[npd-internal\] Failed to get next journald entry.*kubelet.* PatternGeneratedMessageSuffix:} {Type:temporary Condition: Reason:JournaldKubeletFailedToGetEntry Pattern:\[npd-internal\] Failed to get journald entry.*kubelet.* PatternGeneratedMessageSuffix:} {Type:permanent Condition:RunContainerError Reason:ContextDeadlineExceeded Pattern:.*rror syncing pod.*RunContainerError.*context deadline exceeded.* PatternGeneratedMessageSuffix:} {Type:permanent Condition:KillContainerFailed Reason:FailedToKillHPCVerificationContainer Pattern:.*ill container failed.*hpc-verification.* PatternGeneratedMessageSuffix:}] EnableMetricsReporting:0x2f38e62}
I0722 15:12:04.069014      13 log_watchers.go:40] Use log watcher of plugin "journald"
I0722 15:12:04.069547      13 k8s_exporter.go:56] Waiting for kube-apiserver to be ready (timeout 5m0s)...
I0722 15:12:04.075558      13 problem_client.go:128] Deleting deprecated conditions [GPUApplicationError GPUMMUErrorXid31 HardwareErrorInterruptPCIe HardwareErrorInterruptUnknown] (if present)...
I0722 15:12:04.076963      13 problem_client.go:159] No deprecated conditions to delete
I0722 15:12:04.076989      13 node_problem_detector.go:59] K8s exporter started.
I0722 15:12:04.077204      13 node_problem_detector.go:63] Prometheus exporter started.
I0722 15:12:04.077233      13 custom_plugin_monitor.go:111] Start custom plugin monitor /config/network-monitor.json
I0722 15:12:04.077242      13 custom_plugin_monitor.go:111] Start custom plugin monitor /config/pci-monitor.json
I0722 15:12:04.077252      13 log_monitor.go:166] Start log monitor /config/kernel-kmsg.json
I0722 15:12:04.077270      13 custom_plugin_monitor.go:312] Initialized conditions for /config/network-monitor.json: [{Type:DNSFailure Status:False Transition:2025-07-22 15:12:04.077255443 +0000 UTC m=+720.063584351 Reason:DNSIsOk Message:DNS lookups are working} {Type:ConnectivityFailure Status:False Transition:2025-07-22 15:12:04.077255543 +0000 UTC m=+720.063584441 Reason:ConnectionIsOk Message:Internal connectivity is working} {Type:APIFailure Status:False Transition:2025-07-22 15:12:04.077255613 +0000 UTC m=+720.063584511 Reason:ConnectionIsOk Message:Internal connectivity to K8S APIServer is working} {Type:PubFailure Status:False Transition:2025-07-22 15:12:04.077255673 +0000 UTC m=+720.063584571 Reason:ConnectionIsOk Message:Public Internet connectivity is working}]
I0722 15:12:04.077351      13 custom_plugin_monitor.go:301] Sending initial status for network-monitor with conditions: [{Type:DNSFailure Status:False Transition:2025-07-22 15:12:04.077255443 +0000 UTC m=+720.063584351 Reason:DNSIsOk Message:DNS lookups are working} {Type:ConnectivityFailure Status:False Transition:2025-07-22 15:12:04.077255543 +0000 UTC m=+720.063584441 Reason:ConnectionIsOk Message:Internal connectivity is working} {Type:APIFailure Status:False Transition:2025-07-22 15:12:04.077255613 +0000 UTC m=+720.063584511 Reason:ConnectionIsOk Message:Internal connectivity to K8S APIServer is working} {Type:PubFailure Status:False Transition:2025-07-22 15:12:04.077255673 +0000 UTC m=+720.063584571 Reason:ConnectionIsOk Message:Public Internet connectivity is working}]
I0722 15:12:04.077516      13 custom_plugin_monitor.go:312] Initialized conditions for /config/pci-monitor.json: [{Type:GPUPCIFault Status:False Transition:2025-07-22 15:12:04.077503625 +0000 UTC m=+720.063832533 Reason:PCIIsOk Message:VFIO PCI Connectivity is OK} {Type:InfiniBandLinkFault Status:False Transition:2025-07-22 15:12:04.077503715 +0000 UTC m=+720.063832613 Reason:LinkIsOk Message:InfiniBand Interfaces OK}]
I0722 15:12:04.077569      13 custom_plugin_monitor.go:301] Sending initial status for pci-monitor with conditions: [{Type:GPUPCIFault Status:False Transition:2025-07-22 15:12:04.077503625 +0000 UTC m=+720.063832533 Reason:PCIIsOk Message:VFIO PCI Connectivity is OK} {Type:InfiniBandLinkFault Status:False Transition:2025-07-22 15:12:04.077503715 +0000 UTC m=+720.063832613 Reason:LinkIsOk Message:InfiniBand Interfaces OK}]
I0722 15:12:04.080292      13 log_monitor.go:174] Log monitor /config/kernel-kmsg.json is configured to refresh periodically
I0722 15:12:04.080317      13 log_monitor.go:166] Start log monitor /config/kernel-journald.json
E0722 15:12:04.080342      13 problem_detector.go:57] Failed to start monitor &{/config/kernel-journald.json 0xc0007c4630 0xc000953280 {{journald map[source:kernel] [] /var/log/journal 5m } 1000  [{KernelDeadlock  {0 0 <nil>} KernelHasNoDeadlock Kernel has no deadlock} {KernelHardlock  {0 0 <nil>} NoCPUHardLockup Kernel has no CPU Hard Lockup} {ReadonlyFilesystem  {0 0 <nil>} FilesystemIsNotReadOnly Filesystem is not read-only} {LocalDiskErrors  {0 0 <nil>} NoDiskErrors Local NVMe is healthy} {GPUXID149  {0 0 <nil>} NoGPUErrors GPUs are not reporting xid 149} {GPUWantsReset  {0 0 <nil>} NoGPUErrors GPUs are not reporting non-fatal errors} {GPUChannelRetirement  {0 0 <nil>} NoGPUErrors GPUs are not reporting any channel retirement errors} {GPUChannelRetirementFailure  {0 0 <nil>} NoGPUErrors No channels have failed retirement} {GPURowRemapFailure  {0 0 <nil>} RowRemapOk No rows have failed remapping} {GPUECCUncorrectableError  {0 0 <nil>} NoECCError No GPUs have triggered an ECC uncorrectable error} {GPUInvalidPushBuffer  {0 0 <nil>} NoGPUErrors GPUs are not reporting invalid push buffer} {GPUContextSwitchFault  {0 0 <nil>} NoGPUErrors GPUs are not reporting context switch fault} {GPUFault  {0 0 <nil>} NoGPUErrors GPUs are not reporting errors} {IBPCIUnavailable  {0 0 <nil>} IBPCIAvailable InfiniBand adapters are not reporting PCI slot unavailable} {GPUFallenOffBus  {0 0 <nil>} NoMatchingXid No Xid 79 detected} {GPUGSPTimeoutXid119  {0 0 <nil>} NoMatchingXid No Xid 119 detected} {GPUContextSwitchTimeoutXid109  {0 0 <nil>} NoMatchingXid No Xid 109 detected} {GPUGSPPanicXid120  {0 0 <nil>} NoMatchingXid No Xid 120 detected} {PersistentStorageFault  {0 0 <nil>} NoStorageErrors Storage subsystem is not reporting any errors} {HardwareErrorFatal  {0 0 <nil>} NoHardwareErrorFatal Platform is not reporting any fatal hardware errors} {HardwareErrorInterruptCPU  {0 0 <nil>} NoInterruptsDetected Platform is reporting no hardware errors via APEI for CPU} {HardwareErrorInterruptMemory  {0 0 <nil>} NoInterruptsDetected Platform is reporting no hardware errors via APEI for Memory} {NVLinkXIDFatal  {0 0 <nil>} NoMatchingXid No XID 144-150, 154-157 detected} {NVLinkXIDFatalSwitch  {0 0 <nil>} NoMatchingXid No XID 144-150, 154-157 detected from switch side} {NVLinkMaskError  {0 0 <nil>} NoMaskError No GPUs are reporting link mask errors} {Sector0LocalDiskErrors  {0 0 <nil>} NoSector0NVMEErrors No nvme i/o errors detected on sector 0} {SuspectedLocalDiskErrors  {0 0 <nil>} NoSuspectedNVMEErrors No nvme i/o errors detected}] [{temporary  SystemOOMKilling Out of memory: Killed process \d+ (.+) total-vm:\d+kB, anon-rss:\d+kB, file-rss:\d+kB.* } {temporary  CGroupOOMKilling Memory cgroup out of memory: Killed process \d+ (.*) total-vm:\d+kB, anon-rss:\d+kB, file-rss:\d+kB.* } {temporary  TaskHung task \S+:\w+ blocked for more than \w+ seconds\. } {temporary  UnregisterNetDevice unregister_netdevice: waiting for \w+ to become free. Usage count = \d+ } {temporary  KernelOops BUG: unable to handle kernel NULL pointer dereference at .* } {temporary  KernelOops divide error: 0000 \[#\d+\] SMP } {temporary  HardwareErrorRecoverable \[Hardware Error\]: event severity: recoverable } {temporary  HardwareErrorCorrected \[Hardware Error\]: event severity: corrected } {temporary  HardwareErrorInfo \[Hardware Error\]: event severity: info } {temporary  PCIAER AER: aer_status: .* } {temporary  NVSXidNonFatal nvidia-nvswitch\d: SXid .* Non-fatal, .* } {temporary NVLinkXIDNonfatal  NVRM: Xid \(PCI.+\): (14[4-9]|150|15[4-7]),.*Nonfatal.* } {permanent NVLinkXIDFatal NVSwitch XID indicates fatal error NVRM: Xid \(PCI.+\): (14[4-9]|150|15[4-7]),.*Fatal.* \(0x0[0-1|3-9].* } {permanent NVLinkXIDFatalSwitch NVSwitch XID indicates fatal error from switch side NVRM: Xid \(PCI.+\): (14[4-9]|150|15[4-7]),.*Fatal.* \(0x02.* } {permanent NVLinkMaskError GPUs are reporting Link mask errors NVRM: (knvlinkUpdatePostRxDetectLinkMask_IMPL: Failed to update Rx Detect Link mask|NVRM: knvlinkDiscoverPostRxDetLinks.*: Getting peer..s postRxDetLinkMask failed).* } {permanent GPUWantsReset NVSwitch XID indicates GPU reset required NVRM: Xid \(PCI.+\): (14[4-8]|150|15[4-7]),.*GPU Reset Required.* } {permanent GPUXID149 NVSwitch XID indicates GPU reset required NVRM: Xid \(PCI.+\): (149),.* } {permanent GPUChannelRetirement GPU is reporting memory channel retirement due to repeat uncorrectable errors NVRM: Xid \(PCI.+\): (160),.* } {permanent GPUChannelRetirementFailure GPU is reporting memory channel retirement failure NVRM: Xid \(PCI.+\): (161),.* } {permanent KernelDeadlock CPUSoftLockup watchdog: BUG: soft lockup .* } {permanent KernelHardlock CPUHardLockup NMI watchdog: Watchdog detected hard LOCKUP .* } {permanent KernelDeadlock AUFSUmountHung task umount\.aufs:\w+ blocked for more than \w+ seconds\. } {permanent KernelDeadlock DockerHung task docker:\w+ blocked for more than \w+ seconds\. } {permanent ReadonlyFilesystem FilesystemIsReadOnly Remounting filesystem read-only } {permanent LocalDiskErrors XFSCorruption XFS \(((.{0,4})|(.+loop.*)|(loo[^p].*)|(lo[^o].*)|(l[^o].*))\).*Corruption detected.* } {permanent SuspectedLocalDiskErrors IOError I/O error, dev nvme.*, sector (?:[^0].+) .* } {permanent Sector0LocalDiskErrors IOError I/O error, dev nvme.*, sector (?:[0].+) .* } {permanent LocalDiskErrors IOError nvme.+ I/O .+ timeout.* } {permanent LocalDiskErrors CriticalMediumError critical medium error, dev nvme.+ } {permanent GPUFault CUDASegFault cuda-.+ segfault at .* } {permanent GPUFault GPUIsReportingErrors NVRM: Xid \(PCI.+\): (1|2|3|4|5|6|7|8|9|10|11|12|14|15|16|17|18|19|20|21|22|23|24|25|26|27|28|29|30|33|34|35|36|37|38|39|40|41|42|46|47|49|50|51|52|53|54|55|56|57|58|59|60|61|62|65|66|67|69|70|71|72|73|74|75|76|77|78|80|81|82|83|84|85|86|87|88|89|90|91|92|93|96|97|98|99|100|101|102|103|104|105|106|107|108|110|111|112|113|114|115|116|117|118),.* } {permanent GPUFault GPUIsReportingErrors NVRM: Rate limiting GSP RPC error prints for GPU at PCI:.+ \(printing .+ of every .+\).  The GPU likely needs to be reset.* } {permanent GPUInvalidPushBuffer GPUIsReportingInvalidPushBuffer NVRM: Xid \(PCI.+\): 32,.* } {permanent GPUContextSwitchFault GPUIsReportingContextSwitchFault NVRM: Xid \(PCI.+\): 44,.* } {permanent GPUWantsReset GPU Has a pending row remap NVRM: Xid \(PCI.+\): (63),.* } {permanent GPURowRemapFailure GPU Failed a row remap NVRM: Xid \(PCI.+\): (64),.* } {permanent GPUECCUncorrectableError GPU has encountered an uncorrectable ECC error NVRM: Xid \(PCI.+\): (48|94|95),.* } {permanent GPUFallenOffBus A GPU has fallen off the bus NVRM: Xid \(PCI.+\): (79),.* } {permanent GPUGSPTimeoutXid119 GPU System Processor is failing to respond, likely crashed or deadlocked NVRM: Xid \(PCI.+\): (119),.* } {permanent GPUContextSwitchTimeoutXid109 GPU is reporting a Xid 109 NVRM: Xid \(PCI.+\): (109),.* } {permanent GPUGSPPanicXid120 GPU is reporting a GSP task panic NVRM: Xid \(PCI.+\): (120),.* } {permanent NVSwitchFailure A NVSwitch has failed nvidia-nvswitch.: SXid \(PCI.+\): (1900[4-6]|1901[3-7]|1904[6-8]|1905(3|4|6|8)|1906(0|1|3|4|6|7|9)|19070|20034|220(03|12)|2300[1-9]|2301[1-7]|2400[4-6]|2600[1-7]|2900(2|4)|3000(2|4)),.* } {permanent GPUFault GPUSmiError CW: GPU .* } {permanent IBPCIUnavailable PCILost mlx5_core.*PCI slot is unavailable.* } {permanent PersistentStorageFault CephFSQuotaError ceph: get_quota_realm: ino .+ } {temporary NFSStorageFault NFSNotResponding nfs: server .+ not responding.+ } {permanent GPUFault ROMError .*Invalid PCI ROM header signature.* } {permanent HardwareErrorFatal HardwareErrorFatal \[Hardware Error\]: event severity: fatal.* } {temporary  HardwareErrorInterruptPCIe \[Hardware Error\]:   section_type: PCIe error } {permanent HardwareErrorInterruptCPU HardwareErrorFromAPEI \[Hardware Error\]:   section_type: .* processor error } {permanent HardwareErrorInterruptMemory HardwareErrorFromAPEI \[Hardware Error\]:   section_type: memory error } {temporary  HardwareErrorInterruptUnknown \[Hardware Error\]:   section_type: unknown.* } {temporary  JournaldKernelWatchLoopStarted \[npd-internal\] Entering journald watch loop.*kernel.* } {temporary  JournaldKernelFailedToGetNextEntry \[npd-internal\] Failed to get next journald entry.*kernel.* } {temporary  JournaldKernelFailedToGetEntry \[npd-internal\] Failed to get journald entry.*kernel.* }] 0x2f38e62} [] <nil> 0xc0009dfc70 0xc0008fd590}: failed to stat the log path "/var/log/journal": stat /var/log/journal: no such file or directory
I0722 15:12:04.080444      13 log_monitor.go:166] Start log monitor /config/docker-monitor.json
E0722 15:12:04.080462      13 problem_detector.go:57] Failed to start monitor &{/config/docker-monitor.json 0xc0007c4750 0xc000a3e7c0 {{journald map[source:dockerd] [] /var/log/journal 5m } 10 docker-monitor [] [{temporary  CorruptDockerImage Error trying v2 registry: failed to register layer: rename /var/lib/docker/image/(.+) /var/lib/docker/image/(.+): directory not empty.* }] 0x2f38e62} [] <nil> 0xc000a42380 0xc000a22580}: failed to stat the log path "/var/log/journal": stat /var/log/journal: no such file or directory
I0722 15:12:04.080477      13 log_monitor.go:166] Start log monitor /config/kubelet.json
E0722 15:12:04.080489      13 problem_detector.go:57] Failed to start monitor &{/config/kubelet.json 0xc0007c4870 0xc000a3e980 {{journald map[source:kubelet] [] /var/log/journal 5m } 10  [{RunContainerError  {0 0 <nil>} NoRunContainerError No RunContainerErrors present} {KillContainerFailed  {0 0 <nil>} NoKillContainerFailed No KillContainerFailed Errors present}] [{temporary  JournaldKubeletWatchLoopStarted \[npd-internal\] Entering journald watch loop.*kubelet.* } {temporary  JournaldKubeletFailedToGetNextEntry \[npd-internal\] Failed to get next journald entry.*kubelet.* } {temporary  JournaldKubeletFailedToGetEntry \[npd-internal\] Failed to get journald entry.*kubelet.* } {permanent RunContainerError ContextDeadlineExceeded .*rror syncing pod.*RunContainerError.*context deadline exceeded.* } {permanent KillContainerFailed FailedToKillHPCVerificationContainer .*ill container failed.*hpc-verification.* }] 0x2f38e62} [] <nil> 0xc000a43730 0xc000a226a0}: failed to stat the log path "/var/log/journal": stat /var/log/journal: no such file or directory
I0722 15:12:04.080510      13 problem_detector.go:77] Problem detector started
I0722 15:12:04.080718      13 log_monitor.go:97] Start log monitor config refresh loop for /config/kernel-kmsg.json
I0722 15:12:04.080783      13 log_monitor.go:305] Initialize condition generated: [{Type:KernelDeadlock Status:False Transition:2025-07-22 15:12:04.080770391 +0000 UTC m=+720.067099289 Reason:KernelHasNoDeadlock Message:Kernel has no deadlock} {Type:KernelHardlock Status:False Transition:2025-07-22 15:12:04.080770491 +0000 UTC m=+720.067099389 Reason:NoCPUHardLockup Message:Kernel has no CPU Hard Lockup} {Type:ReadonlyFilesystem Status:False Transition:2025-07-22 15:12:04.080770561 +0000 UTC m=+720.067099459 Reason:FilesystemIsNotReadOnly Message:Filesystem is not read-only} {Type:LocalDiskErrors Status:False Transition:2025-07-22 15:12:04.080770631 +0000 UTC m=+720.067099529 Reason:NoDiskErrors Message:Local NVMe is healthy} {Type:GPUXID149 Status:False Transition:2025-07-22 15:12:04.080770701 +0000 UTC m=+720.067099609 Reason:NoGPUErrors Message:GPUs are not reporting xid 149} {Type:GPUWantsReset Status:False Transition:2025-07-22 15:12:04.080770781 +0000 UTC m=+720.067099679 Reason:NoGPUErrors Message:GPUs are not reporting non-fatal errors} {Type:GPUChannelRetirement Status:False Transition:2025-07-22 15:12:04.080770851 +0000 UTC m=+720.067099749 Reason:NoGPUErrors Message:GPUs are not reporting any channel retirement errors} {Type:GPUChannelRetirementFailure Status:False Transition:2025-07-22 15:12:04.080770921 +0000 UTC m=+720.067099819 Reason:NoGPUErrors Message:No channels have failed retirement} {Type:GPURowRemapFailure Status:False Transition:2025-07-22 15:12:04.080770991 +0000 UTC m=+720.067099889 Reason:RowRemapOk Message:No rows have failed remapping} {Type:GPUECCUncorrectableError Status:False Transition:2025-07-22 15:12:04.080771051 +0000 UTC m=+720.067099959 Reason:NoECCError Message:No GPUs have triggered an ECC uncorrectable error} {Type:GPUInvalidPushBuffer Status:False Transition:2025-07-22 15:12:04.080771131 +0000 UTC m=+720.067100029 Reason:NoGPUErrors Message:GPUs are not reporting invalid push buffer} {Type:GPUContextSwitchFault Status:False Transition:2025-07-22 15:12:04.080771201 +0000 UTC m=+720.067100099 Reason:NoGPUErrors Message:GPUs are not reporting context switch fault} {Type:GPUFault Status:False Transition:2025-07-22 15:12:04.080771271 +0000 UTC m=+720.067100169 Reason:NoGPUErrors Message:GPUs are not reporting errors} {Type:IBPCIUnavailable Status:False Transition:2025-07-22 15:12:04.080771341 +0000 UTC m=+720.067100239 Reason:IBPCIAvailable Message:InfiniBand adapters are not reporting PCI slot unavailable} {Type:GPUFallenOffBus Status:False Transition:2025-07-22 15:12:04.080771411 +0000 UTC m=+720.067100319 Reason:NoMatchingXid Message:No Xid 79 detected} {Type:GPUGSPTimeoutXid119 Status:False Transition:2025-07-22 15:12:04.080771491 +0000 UTC m=+720.067100389 Reason:NoMatchingXid Message:No Xid 119 detected} {Type:GPUContextSwitchTimeoutXid109 Status:False Transition:2025-07-22 15:12:04.080771561 +0000 UTC m=+720.067100459 Reason:NoMatchingXid Message:No Xid 109 detected} {Type:GPUGSPPanicXid120 Status:False Transition:2025-07-22 15:12:04.080771631 +0000 UTC m=+720.067100529 Reason:NoMatchingXid Message:No Xid 120 detected} {Type:PersistentStorageFault Status:False Transition:2025-07-22 15:12:04.080771701 +0000 UTC m=+720.067100609 Reason:NoStorageErrors Message:Storage subsystem is not reporting any errors} {Type:HardwareErrorFatal Status:False Transition:2025-07-22 15:12:04.080771781 +0000 UTC m=+720.067100679 Reason:NoHardwareErrorFatal Message:Platform is not reporting any fatal hardware errors} {Type:HardwareErrorInterruptCPU Status:False Transition:2025-07-22 15:12:04.080771851 +0000 UTC m=+720.067100749 Reason:NoInterruptsDetected Message:Platform is reporting no hardware errors via APEI for CPU} {Type:HardwareErrorInterruptMemory Status:False Transition:2025-07-22 15:12:04.080771921 +0000 UTC m=+720.067100819 Reason:NoInterruptsDetected Message:Platform is reporting no hardware errors via APEI for Memory} {Type:NVLinkXIDFatal Status:False Transition:2025-07-22 15:12:04.080771991 +0000 UTC m=+720.067100889 Reason:NoMatchingXid Message:No XID 144-150, 154-157 detected} {Type:NVLinkXIDFatalSwitch Status:False Transition:2025-07-22 15:12:04.080772061 +0000 UTC m=+720.067100959 Reason:NoMatchingXid Message:No XID 144-150, 154-157 detected from switch side} {Type:NVLinkMaskError Status:False Transition:2025-07-22 15:12:04.080772131 +0000 UTC m=+720.067101029 Reason:NoMaskError Message:No GPUs are reporting link mask errors} {Type:Sector0LocalDiskErrors Status:False Transition:2025-07-22 15:12:04.080772201 +0000 UTC m=+720.067101109 Reason:NoSector0NVMEErrors Message:No nvme i/o errors detected on sector 0} {Type:SuspectedLocalDiskErrors Status:False Transition:2025-07-22 15:12:04.080772281 +0000 UTC m=+720.067101179 Reason:NoSuspectedNVMEErrors Message:No nvme i/o errors detected}]
I0722 15:12:04.082940      13 log_monitor.go:229] New status generated: &{Source: Events:[{Severity:warn Timestamp:2025-07-22 15:12:04.080518509 +0000 UTC m=+720.066847457 Reason:KmsgWatchLoopStarted Message:[npd-internal] Entering watch loop for kernel log}] Conditions:[{Type:KernelDeadlock Status:False Transition:2025-07-22 15:12:04.080770391 +0000 UTC m=+720.067099289 Reason:KernelHasNoDeadlock Message:Kernel has no deadlock} {Type:KernelHardlock Status:False Transition:2025-07-22 15:12:04.080770491 +0000 UTC m=+720.067099389 Reason:NoCPUHardLockup Message:Kernel has no CPU Hard Lockup} {Type:ReadonlyFilesystem Status:False Transition:2025-07-22 15:12:04.080770561 +0000 UTC m=+720.067099459 Reason:FilesystemIsNotReadOnly Message:Filesystem is not read-only} {Type:LocalDiskErrors Status:False Transition:2025-07-22 15:12:04.080770631 +0000 UTC m=+720.067099529 Reason:NoDiskErrors Message:Local NVMe is healthy} {Type:GPUXID149 Status:False Transition:2025-07-22 15:12:04.080770701 +0000 UTC m=+720.067099609 Reason:NoGPUErrors Message:GPUs are not reporting xid 149} {Type:GPUWantsReset Status:False Transition:2025-07-22 15:12:04.080770781 +0000 UTC m=+720.067099679 Reason:NoGPUErrors Message:GPUs are not reporting non-fatal errors} {Type:GPUChannelRetirement Status:False Transition:2025-07-22 15:12:04.080770851 +0000 UTC m=+720.067099749 Reason:NoGPUErrors Message:GPUs are not reporting any channel retirement errors} {Type:GPUChannelRetirementFailure Status:False Transition:2025-07-22 15:12:04.080770921 +0000 UTC m=+720.067099819 Reason:NoGPUErrors Message:No channels have failed retirement} {Type:GPURowRemapFailure Status:False Transition:2025-07-22 15:12:04.080770991 +0000 UTC m=+720.067099889 Reason:RowRemapOk Message:No rows have failed remapping} {Type:GPUECCUncorrectableError Status:False Transition:2025-07-22 15:12:04.080771051 +0000 UTC m=+720.067099959 Reason:NoECCError Message:No GPUs have triggered an ECC uncorrectable error} {Type:GPUInvalidPushBuffer Status:False Transition:2025-07-22 15:12:04.080771131 +0000 UTC m=+720.067100029 Reason:NoGPUErrors Message:GPUs are not reporting invalid push buffer} {Type:GPUContextSwitchFault Status:False Transition:2025-07-22 15:12:04.080771201 +0000 UTC m=+720.067100099 Reason:NoGPUErrors Message:GPUs are not reporting context switch fault} {Type:GPUFault Status:False Transition:2025-07-22 15:12:04.080771271 +0000 UTC m=+720.067100169 Reason:NoGPUErrors Message:GPUs are not reporting errors} {Type:IBPCIUnavailable Status:False Transition:2025-07-22 15:12:04.080771341 +0000 UTC m=+720.067100239 Reason:IBPCIAvailable Message:InfiniBand adapters are not reporting PCI slot unavailable} {Type:GPUFallenOffBus Status:False Transition:2025-07-22 15:12:04.080771411 +0000 UTC m=+720.067100319 Reason:NoMatchingXid Message:No Xid 79 detected} {Type:GPUGSPTimeoutXid119 Status:False Transition:2025-07-22 15:12:04.080771491 +0000 UTC m=+720.067100389 Reason:NoMatchingXid Message:No Xid 119 detected} {Type:GPUContextSwitchTimeoutXid109 Status:False Transition:2025-07-22 15:12:04.080771561 +0000 UTC m=+720.067100459 Reason:NoMatchingXid Message:No Xid 109 detected} {Type:GPUGSPPanicXid120 Status:False Transition:2025-07-22 15:12:04.080771631 +0000 UTC m=+720.067100529 Reason:NoMatchingXid Message:No Xid 120 detected} {Type:PersistentStorageFault Status:False Transition:2025-07-22 15:12:04.080771701 +0000 UTC m=+720.067100609 Reason:NoStorageErrors Message:Storage subsystem is not reporting any errors} {Type:HardwareErrorFatal Status:False Transition:2025-07-22 15:12:04.080771781 +0000 UTC m=+720.067100679 Reason:NoHardwareErrorFatal Message:Platform is not reporting any fatal hardware errors} {Type:HardwareErrorInterruptCPU Status:False Transition:2025-07-22 15:12:04.080771851 +0000 UTC m=+720.067100749 Reason:NoInterruptsDetected Message:Platform is reporting no hardware errors via APEI for CPU} {Type:HardwareErrorInterruptMemory Status:False Transition:2025-07-22 15:12:04.080771921 +0000 UTC m=+720.067100819 Reason:NoInterruptsDetected Message:Platform is reporting no hardware errors via APEI for Memory} {Type:NVLinkXIDFatal Status:False Transition:2025-07-22 15:12:04.080771991 +0000 UTC m=+720.067100889 Reason:NoMatchingXid Message:No XID 144-150, 154-157 detected} {Type:NVLinkXIDFatalSwitch Status:False Transition:2025-07-22 15:12:04.080772061 +0000 UTC m=+720.067100959 Reason:NoMatchingXid Message:No XID 144-150, 154-157 detected from switch side} {Type:NVLinkMaskError Status:False Transition:2025-07-22 15:12:04.080772131 +0000 UTC m=+720.067101029 Reason:NoMaskError Message:No GPUs are reporting link mask errors} {Type:Sector0LocalDiskErrors Status:False Transition:2025-07-22 15:12:04.080772201 +0000 UTC m=+720.067101109 Reason:NoSector0NVMEErrors Message:No nvme i/o errors detected on sector 0} {Type:SuspectedLocalDiskErrors Status:False Transition:2025-07-22 15:12:04.080772281 +0000 UTC m=+720.067101179 Reason:NoSuspectedNVMEErrors Message:No nvme i/o errors detected}]}

@daveoy
Copy link
Owner Author

daveoy commented Jul 22, 2025

same pod, another config change, still running

Diff:   systemlogmonitor.MonitorConfig{
  	... // 2 identical fields
I0722 15:16:44.086788      13 log_monitor.go:139] Log monitor config /config/kernel-kmsg.json is changed, updating
  	Source:            "",
  	DefaultConditions: {{Type: "KernelDeadlock", Reason: "KernelHasNoDeadlock", Message: "Kernel has no deadlock"}, {Type: "KernelHardlock", Reason: "NoCPUHardLockup", Message: "Kernel has no CPU Hard Lockup"}, {Type: "ReadonlyFilesystem", Reason: "FilesystemIsNotReadOnly", Message: "Filesystem is not read-only"}, {Type: "LocalDiskErrors", Reason: "NoDiskErrors", Message: "Local NVMe is healthy"}, ...},
  	Rules: []types.Rule{
  		... // 49 identical elements
  		{Type: "permanent", Condition: "HardwareErrorInterruptCPU", Reason: "HardwareErrorFromAPEI", Pattern: `\[Hardware Error\]:   section_type: .* processor error`, ...},
  		{Type: "permanent", Condition: "HardwareErrorInterruptMemory", Reason: "HardwareErrorFromAPEI", Pattern: `\[Hardware Error\]:   section_type: memory error`, ...},
  		{
  			Type:      "temporary",
  			Condition: "",
  			Reason:    "HardwareErrorInterruptUnknown",
  			Pattern: strings.Join({
  				`\[`,
- 				"Hardware",
+ 				"DY",
  				` Error\]:   section_type: unknown.*`,
  			}, ""),
  			PatternGeneratedMessageSuffix: "",
  		},
  		{Type: "temporary", Reason: "KmsgWatchLoopStarted", Pattern: `\[npd-internal\] Entering watch loop for kernel log`},
  		{Type: "temporary", Reason: "KmsgParserRevived", Pattern: `\[npd-internal\] Reviving.*parser.*`},
  	},
  	EnableMetricsReporting: &true,
  }

E0722 15:16:44.087244      13 log_monitor.go:104] Failed to refresh log monitor config /config/kernel-kmsg.json: log monitor config change requires restart
I0722 15:16:44.087278      13 custom_plugin_monitor.go:118] Stop custom plugin monitor /config/network-monitor.json
I0722 15:16:44.087292      13 prometheus_exporter.go:65] Context cancelled; shutting down HTTP server on 0.0.0.0:20257
I0722 15:16:44.087326      13 plugin.go:63] Stopping plugin execution
I0722 15:16:44.087300      13 k8s_exporter.go:126] Context cancelled; shutting down HTTP server on 127.0.0.1:20256
I0722 15:16:44.087338      13 plugin.go:276] Stop plugin execution
I0722 15:16:44.087476      13 custom_plugin_monitor.go:148] Custom plugin monitor stopped: /config/network-monitor.json
I0722 15:16:44.087520      13 custom_plugin_monitor.go:118] Stop custom plugin monitor /config/pci-monitor.json
I0722 15:16:44.113144      13 plugin.go:63] Stopping plugin execution
I0722 15:16:44.113175      13 plugin.go:276] Stop plugin execution
I0722 15:16:44.113194      13 custom_plugin_monitor.go:148] Custom plugin monitor stopped: /config/pci-monitor.json
I0722 15:16:44.113212      13 log_monitor.go:191] Stop log monitor /config/kernel-kmsg.json
E0722 15:16:44.113249      13 logger.go:18] error reading /dev/kmsg: read /dev/kmsg: file already closed
E0722 15:16:44.113268      13 log_watcher_linux.go:120] Kmsg channel closed, reviving parser
E0722 15:16:49.116689      13 log_watcher_linux.go:120] Kmsg channel closed, reviving parser
I0722 15:16:54.117097      13 log_watcher_linux.go:108] Stop watching kernel log
I0722 15:16:54.117128      13 log_monitor.go:212] Log monitor stopped: /config/kernel-kmsg.json
I0722 15:16:54.117142      13 node_problem_detector_unix.go:54] Reloading node problem detector due to config change
I0722 15:16:54.117326      13 custom_plugin_monitor.go:80] Finish parsing custom plugin monitor config file /config/network-monitor.json: {Plugin:custom PluginGlobalConfig:{InvokeIntervalString:0xc0008f41a0 TimeoutString:0xc0008f41b0 InvokeInterval:1m0s Timeout:50s MaxOutputLength:0xc0008a29d0 Concurrency:0xc0008a29d8 EnableMessageChangeBasedConditionUpdate:0x2fec7f1 SkipInitialStatus:0x2fec7f2} Source:network-monitor DefaultConditions:[{Type:DNSFailure Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:DNSIsOk Message:DNS lookups are working} {Type:ConnectivityFailure Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:ConnectionIsOk Message:Internal connectivity is working} {Type:APIFailure Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:ConnectionIsOk Message:Internal connectivity to K8S APIServer is working} {Type:PubFailure Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:ConnectionIsOk Message:Public Internet connectivity is working}] Rules:[0xc0000fe230 0xc0000fe2a0 0xc0000fe310 0xc0000fe380] EnableMetricsReporting:0xc0008a29ef}
I0722 15:16:54.117493      13 custom_plugin_monitor.go:80] Finish parsing custom plugin monitor config file /config/pci-monitor.json: {Plugin:custom PluginGlobalConfig:{InvokeIntervalString:0xc0008f4420 TimeoutString:0xc0008f4430 InvokeInterval:20s Timeout:50s MaxOutputLength:0xc0008a2cf0 Concurrency:0xc0008a2cf8 EnableMessageChangeBasedConditionUpdate:0xc0008a2d00 SkipInitialStatus:0x2fec7f2} Source:pci-monitor DefaultConditions:[{Type:GPUPCIFault Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:PCIIsOk Message:VFIO PCI Connectivity is OK} {Type:InfiniBandLinkFault Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:LinkIsOk Message:InfiniBand Interfaces OK}] Rules:[0xc0000fe620 0xc0000fe850 0xc0000fe8c0] EnableMetricsReporting:0xc0008a2d0c}
I0722 15:16:54.118370      13 log_monitor.go:82] Finish parsing log monitor config file /config/kernel-kmsg.json: {WatcherConfig:{Plugin:kmsg PluginConfig:map[refresh:true refreshDurationSeconds:10 revive:true] SkipList:[] LogPath:/dev/kmsg Lookback:5m Delay:} BufferSize:1000 Source: DefaultConditions:[{Type:KernelDeadlock Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:KernelHasNoDeadlock Message:Kernel has no deadlock} {Type:KernelHardlock Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoCPUHardLockup Message:Kernel has no CPU Hard Lockup} {Type:ReadonlyFilesystem Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:FilesystemIsNotReadOnly Message:Filesystem is not read-only} {Type:LocalDiskErrors Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoDiskErrors Message:Local NVMe is healthy} {Type:GPUXID149 Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoGPUErrors Message:GPUs are not reporting xid 149} {Type:GPUWantsReset Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoGPUErrors Message:GPUs are not reporting non-fatal errors} {Type:GPUChannelRetirement Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoGPUErrors Message:GPUs are not reporting any channel retirement errors} {Type:GPUChannelRetirementFailure Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoGPUErrors Message:No channels have failed retirement} {Type:GPURowRemapFailure Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:RowRemapOk Message:No rows have failed remapping} {Type:GPUECCUncorrectableError Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoECCError Message:No GPUs have triggered an ECC uncorrectable error} {Type:GPUInvalidPushBuffer Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoGPUErrors Message:GPUs are not reporting invalid push buffer} {Type:GPUContextSwitchFault Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoGPUErrors Message:GPUs are not reporting context switch fault} {Type:GPUFault Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoGPUErrors Message:GPUs are not reporting errors} {Type:IBPCIUnavailable Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:IBPCIAvailable Message:InfiniBand adapters are not reporting PCI slot unavailable} {Type:GPUFallenOffBus Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoMatchingXid Message:No Xid 79 detected} {Type:GPUGSPTimeoutXid119 Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoMatchingXid Message:No Xid 119 detected} {Type:GPUContextSwitchTimeoutXid109 Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoMatchingXid Message:No Xid 109 detected} {Type:GPUGSPPanicXid120 Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoMatchingXid Message:No Xid 120 detected} {Type:PersistentStorageFault Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoStorageErrors Message:Storage subsystem is not reporting any errors} {Type:HardwareErrorFatal Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoHardwareErrorFatal Message:Platform is not reporting any fatal hardware errors} {Type:HardwareErrorInterruptCPU Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoInterruptsDetected Message:Platform is reporting no hardware errors via APEI for CPU} {Type:HardwareErrorInterruptMemory Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoInterruptsDetected Message:Platform is reporting no hardware errors via APEI for Memory} {Type:NVLinkXIDFatal Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoMatchingXid Message:No XID 144-150, 154-157 detected} {Type:NVLinkXIDFatalSwitch Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoMatchingXid Message:No XID 144-150, 154-157 detected from switch side} {Type:NVLinkMaskError Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoMaskError Message:No GPUs are reporting link mask errors} {Type:Sector0LocalDiskErrors Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoSector0NVMEErrors Message:No nvme i/o errors detected on sector 0} {Type:SuspectedLocalDiskErrors Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoSuspectedNVMEErrors Message:No nvme i/o errors detected}] Rules:[{Type:temporary Condition: Reason:SystemOOMKilling Pattern:Out of memory: Killed process \d+ (.+) total-vm:\d+kB, anon-rss:\d+kB, file-rss:\d+kB.* PatternGeneratedMessageSuffix:} {Type:temporary Condition: Reason:CGroupOOMKilling Pattern:Memory cgroup out of memory: Killed process \d+ (.*) total-vm:\d+kB, anon-rss:\d+kB, file-rss:\d+kB.* PatternGeneratedMessageSuffix:} {Type:temporary Condition: Reason:TaskHung Pattern:task \S+:\w+ blocked for more than \w+ seconds\. PatternGeneratedMessageSuffix:} {Type:temporary Condition: Reason:UnregisterNetDevice Pattern:unregister_netdevice: waiting for \w+ to become free. Usage count = \d+ PatternGeneratedMessageSuffix:} {Type:temporary Condition: Reason:KernelOops Pattern:BUG: unable to handle kernel NULL pointer dereference at .* PatternGeneratedMessageSuffix:} {Type:temporary Condition: Reason:KernelOops Pattern:divide error: 0000 \[#\d+\] SMP PatternGeneratedMessageSuffix:} {Type:temporary Condition: Reason:HardwareErrorRecoverable Pattern:\[Hardware Error\]: event severity: recoverable PatternGeneratedMessageSuffix:} {Type:temporary Condition: Reason:HardwareErrorCorrected Pattern:\[Hardware Error\]: event severity: corrected PatternGeneratedMessageSuffix:} {Type:temporary Condition: Reason:HardwareErrorInfo Pattern:\[Hardware Error\]: event severity: info PatternGeneratedMessageSuffix:} {Type:temporary Condition: Reason:PCIAER Pattern:AER: aer_status: .* PatternGeneratedMessageSuffix:} {Type:temporary Condition: Reason:NVSXidNonFatal Pattern:nvidia-nvswitch\d: SXid .* Non-fatal, .* PatternGeneratedMessageSuffix:} {Type:temporary Condition:NVLinkXIDNonfatal Reason: Pattern:NVRM: Xid \(PCI.+\): (14[4-9]|150|15[4-7]),.*Nonfatal.* PatternGeneratedMessageSuffix:} {Type:permanent Condition:NVLinkXIDFatal Reason:NVSwitch XID indicates fatal error Pattern:NVRM: Xid \(PCI.+\): (14[4-9]|150|15[4-7]),.*Fatal.* \(0x0[0-1|3-9].* PatternGeneratedMessageSuffix:} {Type:permanent Condition:NVLinkXIDFatalSwitch Reason:NVSwitch XID indicates fatal error from switch side Pattern:NVRM: Xid \(PCI.+\): (14[4-9]|150|15[4-7]),.*Fatal.* \(0x02.* PatternGeneratedMessageSuffix:} {Type:permanent Condition:NVLinkMaskError Reason:GPUs are reporting Link mask errors Pattern:NVRM: (knvlinkUpdatePostRxDetectLinkMask_IMPL: Failed to update Rx Detect Link mask|NVRM: knvlinkDiscoverPostRxDetLinks.*: Getting peer..s postRxDetLinkMask failed).* PatternGeneratedMessageSuffix:} {Type:permanent Condition:GPUWantsReset Reason:NVSwitch XID indicates GPU reset required Pattern:NVRM: Xid \(PCI.+\): (14[4-8]|150|15[4-7]),.*GPU Reset Required.* PatternGeneratedMessageSuffix:} {Type:permanent Condition:GPUXID149 Reason:NVSwitch XID indicates GPU reset required Pattern:NVRM: Xid \(PCI.+\): (149),.* PatternGeneratedMessageSuffix:} {Type:permanent Condition:GPUChannelRetirement Reason:GPU is reporting memory channel retirement due to repeat uncorrectable errors Pattern:NVRM: Xid \(PCI.+\): (160),.* PatternGeneratedMessageSuffix:} {Type:permanent Condition:GPUChannelRetirementFailure Reason:GPU is reporting memory channel retirement failure Pattern:NVRM: Xid \(PCI.+\): (161),.* PatternGeneratedMessageSuffix:} {Type:permanent Condition:KernelDeadlock Reason:CPUSoftLockup Pattern:watchdog: BUG: soft lockup .* PatternGeneratedMessageSuffix:} {Type:permanent Condition:KernelHardlock Reason:CPUHardLockup Pattern:NMI watchdog: Watchdog detected hard LOCKUP .* PatternGeneratedMessageSuffix:} {Type:permanent Condition:KernelDeadlock Reason:AUFSUmountHung Pattern:task umount\.aufs:\w+ blocked for more than \w+ seconds\. PatternGeneratedMessageSuffix:} {Type:permanent Condition:KernelDeadlock Reason:DockerHung Pattern:task docker:\w+ blocked for more than \w+ seconds\. PatternGeneratedMessageSuffix:} {Type:permanent Condition:ReadonlyFilesystem Reason:FilesystemIsReadOnly Pattern:Remounting filesystem read-only PatternGeneratedMessageSuffix:} {Type:permanent Condition:LocalDiskErrors Reason:XFSCorruption Pattern:XFS \(((.{0,4})|(.+loop.*)|(loo[^p].*)|(lo[^o].*)|(l[^o].*))\).*Corruption detected.* PatternGeneratedMessageSuffix:} {Type:permanent Condition:SuspectedLocalDiskErrors Reason:IOError Pattern:I/O error, dev nvme.*, sector (?:[^0].+) .* PatternGeneratedMessageSuffix:} {Type:permanent Condition:Sector0LocalDiskErrors Reason:IOError Pattern:I/O error, dev nvme.*, sector (?:[0].+) .* PatternGeneratedMessageSuffix:} {Type:permanent Condition:LocalDiskErrors Reason:IOError Pattern:nvme.+ I/O .+ timeout.* PatternGeneratedMessageSuffix:} {Type:permanent Condition:LocalDiskErrors Reason:CriticalMediumError Pattern:critical medium error, dev nvme.+ PatternGeneratedMessageSuffix:} {Type:permanent Condition:GPUFault Reason:CUDASegFault Pattern:cuda-.+ segfault at .* PatternGeneratedMessageSuffix:} {Type:permanent Condition:GPUFault Reason:GPUIsReportingErrors Pattern:NVRM: Xid \(PCI.+\): (1|2|3|4|5|6|7|8|9|10|11|12|14|15|16|17|18|19|20|21|22|23|24|25|26|27|28|29|30|33|34|35|36|37|38|39|40|41|42|46|47|49|50|51|52|53|54|55|56|57|58|59|60|61|62|65|66|67|69|70|71|72|73|74|75|76|77|78|80|81|82|83|84|85|86|87|88|89|90|91|92|93|96|97|98|99|100|101|102|103|104|105|106|107|108|110|111|112|113|114|115|116|117|118),.* PatternGeneratedMessageSuffix:} {Type:permanent Condition:GPUFault Reason:GPUIsReportingErrors Pattern:NVRM: Rate limiting GSP RPC error prints for GPU at PCI:.+ \(printing .+ of every .+\).  The GPU likely needs to be reset.* PatternGeneratedMessageSuffix:} {Type:permanent Condition:GPUInvalidPushBuffer Reason:GPUIsReportingInvalidPushBuffer Pattern:NVRM: Xid \(PCI.+\): 32,.* PatternGeneratedMessageSuffix:} {Type:permanent Condition:GPUContextSwitchFault Reason:GPUIsReportingContextSwitchFault Pattern:NVRM: Xid \(PCI.+\): 44,.* PatternGeneratedMessageSuffix:} {Type:permanent Condition:GPUWantsReset Reason:GPU Has a pending row remap Pattern:NVRM: Xid \(PCI.+\): (63),.* PatternGeneratedMessageSuffix:} {Type:permanent Condition:GPURowRemapFailure Reason:GPU Failed a row remap Pattern:NVRM: Xid \(PCI.+\): (64),.* PatternGeneratedMessageSuffix:} {Type:permanent Condition:GPUECCUncorrectableError Reason:GPU has encountered an uncorrectable ECC error Pattern:NVRM: Xid \(PCI.+\): (48|94|95),.* PatternGeneratedMessageSuffix:} {Type:permanent Condition:GPUFallenOffBus Reason:A GPU has fallen off the bus Pattern:NVRM: Xid \(PCI.+\): (79),.* PatternGeneratedMessageSuffix:} {Type:permanent Condition:GPUGSPTimeoutXid119 Reason:GPU System Processor is failing to respond, likely crashed or deadlocked Pattern:NVRM: Xid \(PCI.+\): (119),.* PatternGeneratedMessageSuffix:} {Type:permanent Condition:GPUContextSwitchTimeoutXid109 Reason:GPU is reporting a Xid 109 Pattern:NVRM: Xid \(PCI.+\): (109),.* PatternGeneratedMessageSuffix:} {Type:permanent Condition:GPUGSPPanicXid120 Reason:GPU is reporting a GSP task panic Pattern:NVRM: Xid \(PCI.+\): (120),.* PatternGeneratedMessageSuffix:} {Type:permanent Condition:NVSwitchFailure Reason:A NVSwitch has failed Pattern:nvidia-nvswitch.: SXid \(PCI.+\): (1900[4-6]|1901[3-7]|1904[6-8]|1905(3|4|6|8)|1906(0|1|3|4|6|7|9)|19070|20034|220(03|12)|2300[1-9]|2301[1-7]|2400[4-6]|2600[1-7]|2900(2|4)|3000(2|4)),.* PatternGeneratedMessageSuffix:} {Type:permanent Condition:GPUFault Reason:GPUSmiError Pattern:CW: GPU .* PatternGeneratedMessageSuffix:} {Type:permanent Condition:IBPCIUnavailable Reason:PCILost Pattern:mlx5_core.*PCI slot is unavailable.* PatternGeneratedMessageSuffix:} {Type:permanent Condition:PersistentStorageFault Reason:CephFSQuotaError Pattern:ceph: get_quota_realm: ino .+ PatternGeneratedMessageSuffix:} {Type:temporary Condition:NFSStorageFault Reason:NFSNotResponding Pattern:nfs: server .+ not responding.+ PatternGeneratedMessageSuffix:} {Type:permanent Condition:GPUFault Reason:ROMError Pattern:.*Invalid PCI ROM header signature.* PatternGeneratedMessageSuffix:} {Type:permanent Condition:HardwareErrorFatal Reason:HardwareErrorFatal Pattern:\[Hardware Error\]: event severity: fatal.* PatternGeneratedMessageSuffix:} {Type:temporary Condition: Reason:HardwareErrorInterruptPCIe Pattern:\[Hardware Error\]:   section_type: PCIe error PatternGeneratedMessageSuffix:} {Type:permanent Condition:HardwareErrorInterruptCPU Reason:HardwareErrorFromAPEI Pattern:\[Hardware Error\]:   section_type: .* processor error PatternGeneratedMessageSuffix:} {Type:permanent Condition:HardwareErrorInterruptMemory Reason:HardwareErrorFromAPEI Pattern:\[Hardware Error\]:   section_type: memory error PatternGeneratedMessageSuffix:} {Type:temporary Condition: Reason:HardwareErrorInterruptUnknown Pattern:\[DY Error\]:   section_type: unknown.* PatternGeneratedMessageSuffix:} {Type:temporary Condition: Reason:KmsgWatchLoopStarted Pattern:\[npd-internal\] Entering watch loop for kernel log PatternGeneratedMessageSuffix:} {Type:temporary Condition: Reason:KmsgParserRevived Pattern:\[npd-internal\] Reviving.*parser.* PatternGeneratedMessageSuffix:}] EnableMetricsReporting:0x2f38e62}
I0722 15:16:54.118563      13 log_watchers.go:40] Use log watcher of plugin "kmsg"
I0722 15:16:54.119612      13 log_monitor.go:82] Finish parsing log monitor config file /config/kernel-journald.json: {WatcherConfig:{Plugin:journald PluginConfig:map[source:kernel] SkipList:[] LogPath:/var/log/journal Lookback:5m Delay:} BufferSize:1000 Source: DefaultConditions:[{Type:KernelDeadlock Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:KernelHasNoDeadlock Message:Kernel has no deadlock} {Type:KernelHardlock Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoCPUHardLockup Message:Kernel has no CPU Hard Lockup} {Type:ReadonlyFilesystem Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:FilesystemIsNotReadOnly Message:Filesystem is not read-only} {Type:LocalDiskErrors Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoDiskErrors Message:Local NVMe is healthy} {Type:GPUXID149 Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoGPUErrors Message:GPUs are not reporting xid 149} {Type:GPUWantsReset Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoGPUErrors Message:GPUs are not reporting non-fatal errors} {Type:GPUChannelRetirement Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoGPUErrors Message:GPUs are not reporting any channel retirement errors} {Type:GPUChannelRetirementFailure Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoGPUErrors Message:No channels have failed retirement} {Type:GPURowRemapFailure Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:RowRemapOk Message:No rows have failed remapping} {Type:GPUECCUncorrectableError Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoECCError Message:No GPUs have triggered an ECC uncorrectable error} {Type:GPUInvalidPushBuffer Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoGPUErrors Message:GPUs are not reporting invalid push buffer} {Type:GPUContextSwitchFault Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoGPUErrors Message:GPUs are not reporting context switch fault} {Type:GPUFault Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoGPUErrors Message:GPUs are not reporting errors} {Type:IBPCIUnavailable Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:IBPCIAvailable Message:InfiniBand adapters are not reporting PCI slot unavailable} {Type:GPUFallenOffBus Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoMatchingXid Message:No Xid 79 detected} {Type:GPUGSPTimeoutXid119 Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoMatchingXid Message:No Xid 119 detected} {Type:GPUContextSwitchTimeoutXid109 Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoMatchingXid Message:No Xid 109 detected} {Type:GPUGSPPanicXid120 Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoMatchingXid Message:No Xid 120 detected} {Type:PersistentStorageFault Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoStorageErrors Message:Storage subsystem is not reporting any errors} {Type:HardwareErrorFatal Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoHardwareErrorFatal Message:Platform is not reporting any fatal hardware errors} {Type:HardwareErrorInterruptCPU Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoInterruptsDetected Message:Platform is reporting no hardware errors via APEI for CPU} {Type:HardwareErrorInterruptMemory Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoInterruptsDetected Message:Platform is reporting no hardware errors via APEI for Memory} {Type:NVLinkXIDFatal Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoMatchingXid Message:No XID 144-150, 154-157 detected} {Type:NVLinkXIDFatalSwitch Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoMatchingXid Message:No XID 144-150, 154-157 detected from switch side} {Type:NVLinkMaskError Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoMaskError Message:No GPUs are reporting link mask errors} {Type:Sector0LocalDiskErrors Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoSector0NVMEErrors Message:No nvme i/o errors detected on sector 0} {Type:SuspectedLocalDiskErrors Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoSuspectedNVMEErrors Message:No nvme i/o errors detected}] Rules:[{Type:temporary Condition: Reason:SystemOOMKilling Pattern:Out of memory: Killed process \d+ (.+) total-vm:\d+kB, anon-rss:\d+kB, file-rss:\d+kB.* PatternGeneratedMessageSuffix:} {Type:temporary Condition: Reason:CGroupOOMKilling Pattern:Memory cgroup out of memory: Killed process \d+ (.*) total-vm:\d+kB, anon-rss:\d+kB, file-rss:\d+kB.* PatternGeneratedMessageSuffix:} {Type:temporary Condition: Reason:TaskHung Pattern:task \S+:\w+ blocked for more than \w+ seconds\. PatternGeneratedMessageSuffix:} {Type:temporary Condition: Reason:UnregisterNetDevice Pattern:unregister_netdevice: waiting for \w+ to become free. Usage count = \d+ PatternGeneratedMessageSuffix:} {Type:temporary Condition: Reason:KernelOops Pattern:BUG: unable to handle kernel NULL pointer dereference at .* PatternGeneratedMessageSuffix:} {Type:temporary Condition: Reason:KernelOops Pattern:divide error: 0000 \[#\d+\] SMP PatternGeneratedMessageSuffix:} {Type:temporary Condition: Reason:HardwareErrorRecoverable Pattern:\[Hardware Error\]: event severity: recoverable PatternGeneratedMessageSuffix:} {Type:temporary Condition: Reason:HardwareErrorCorrected Pattern:\[Hardware Error\]: event severity: corrected PatternGeneratedMessageSuffix:} {Type:temporary Condition: Reason:HardwareErrorInfo Pattern:\[Hardware Error\]: event severity: info PatternGeneratedMessageSuffix:} {Type:temporary Condition: Reason:PCIAER Pattern:AER: aer_status: .* PatternGeneratedMessageSuffix:} {Type:temporary Condition: Reason:NVSXidNonFatal Pattern:nvidia-nvswitch\d: SXid .* Non-fatal, .* PatternGeneratedMessageSuffix:} {Type:temporary Condition:NVLinkXIDNonfatal Reason: Pattern:NVRM: Xid \(PCI.+\): (14[4-9]|150|15[4-7]),.*Nonfatal.* PatternGeneratedMessageSuffix:} {Type:permanent Condition:NVLinkXIDFatal Reason:NVSwitch XID indicates fatal error Pattern:NVRM: Xid \(PCI.+\): (14[4-9]|150|15[4-7]),.*Fatal.* \(0x0[0-1|3-9].* PatternGeneratedMessageSuffix:} {Type:permanent Condition:NVLinkXIDFatalSwitch Reason:NVSwitch XID indicates fatal error from switch side Pattern:NVRM: Xid \(PCI.+\): (14[4-9]|150|15[4-7]),.*Fatal.* \(0x02.* PatternGeneratedMessageSuffix:} {Type:permanent Condition:NVLinkMaskError Reason:GPUs are reporting Link mask errors Pattern:NVRM: (knvlinkUpdatePostRxDetectLinkMask_IMPL: Failed to update Rx Detect Link mask|NVRM: knvlinkDiscoverPostRxDetLinks.*: Getting peer..s postRxDetLinkMask failed).* PatternGeneratedMessageSuffix:} {Type:permanent Condition:GPUWantsReset Reason:NVSwitch XID indicates GPU reset required Pattern:NVRM: Xid \(PCI.+\): (14[4-8]|150|15[4-7]),.*GPU Reset Required.* PatternGeneratedMessageSuffix:} {Type:permanent Condition:GPUXID149 Reason:NVSwitch XID indicates GPU reset required Pattern:NVRM: Xid \(PCI.+\): (149),.* PatternGeneratedMessageSuffix:} {Type:permanent Condition:GPUChannelRetirement Reason:GPU is reporting memory channel retirement due to repeat uncorrectable errors Pattern:NVRM: Xid \(PCI.+\): (160),.* PatternGeneratedMessageSuffix:} {Type:permanent Condition:GPUChannelRetirementFailure Reason:GPU is reporting memory channel retirement failure Pattern:NVRM: Xid \(PCI.+\): (161),.* PatternGeneratedMessageSuffix:} {Type:permanent Condition:KernelDeadlock Reason:CPUSoftLockup Pattern:watchdog: BUG: soft lockup .* PatternGeneratedMessageSuffix:} {Type:permanent Condition:KernelHardlock Reason:CPUHardLockup Pattern:NMI watchdog: Watchdog detected hard LOCKUP .* PatternGeneratedMessageSuffix:} {Type:permanent Condition:KernelDeadlock Reason:AUFSUmountHung Pattern:task umount\.aufs:\w+ blocked for more than \w+ seconds\. PatternGeneratedMessageSuffix:} {Type:permanent Condition:KernelDeadlock Reason:DockerHung Pattern:task docker:\w+ blocked for more than \w+ seconds\. PatternGeneratedMessageSuffix:} {Type:permanent Condition:ReadonlyFilesystem Reason:FilesystemIsReadOnly Pattern:Remounting filesystem read-only PatternGeneratedMessageSuffix:} {Type:permanent Condition:LocalDiskErrors Reason:XFSCorruption Pattern:XFS \(((.{0,4})|(.+loop.*)|(loo[^p].*)|(lo[^o].*)|(l[^o].*))\).*Corruption detected.* PatternGeneratedMessageSuffix:} {Type:permanent Condition:SuspectedLocalDiskErrors Reason:IOError Pattern:I/O error, dev nvme.*, sector (?:[^0].+) .* PatternGeneratedMessageSuffix:} {Type:permanent Condition:Sector0LocalDiskErrors Reason:IOError Pattern:I/O error, dev nvme.*, sector (?:[0].+) .* PatternGeneratedMessageSuffix:} {Type:permanent Condition:LocalDiskErrors Reason:IOError Pattern:nvme.+ I/O .+ timeout.* PatternGeneratedMessageSuffix:} {Type:permanent Condition:LocalDiskErrors Reason:CriticalMediumError Pattern:critical medium error, dev nvme.+ PatternGeneratedMessageSuffix:} {Type:permanent Condition:GPUFault Reason:CUDASegFault Pattern:cuda-.+ segfault at .* PatternGeneratedMessageSuffix:} {Type:permanent Condition:GPUFault Reason:GPUIsReportingErrors Pattern:NVRM: Xid \(PCI.+\): (1|2|3|4|5|6|7|8|9|10|11|12|14|15|16|17|18|19|20|21|22|23|24|25|26|27|28|29|30|33|34|35|36|37|38|39|40|41|42|46|47|49|50|51|52|53|54|55|56|57|58|59|60|61|62|65|66|67|69|70|71|72|73|74|75|76|77|78|80|81|82|83|84|85|86|87|88|89|90|91|92|93|96|97|98|99|100|101|102|103|104|105|106|107|108|110|111|112|113|114|115|116|117|118),.* PatternGeneratedMessageSuffix:} {Type:permanent Condition:GPUFault Reason:GPUIsReportingErrors Pattern:NVRM: Rate limiting GSP RPC error prints for GPU at PCI:.+ \(printing .+ of every .+\).  The GPU likely needs to be reset.* PatternGeneratedMessageSuffix:} {Type:permanent Condition:GPUInvalidPushBuffer Reason:GPUIsReportingInvalidPushBuffer Pattern:NVRM: Xid \(PCI.+\): 32,.* PatternGeneratedMessageSuffix:} {Type:permanent Condition:GPUContextSwitchFault Reason:GPUIsReportingContextSwitchFault Pattern:NVRM: Xid \(PCI.+\): 44,.* PatternGeneratedMessageSuffix:} {Type:permanent Condition:GPUWantsReset Reason:GPU Has a pending row remap Pattern:NVRM: Xid \(PCI.+\): (63),.* PatternGeneratedMessageSuffix:} {Type:permanent Condition:GPURowRemapFailure Reason:GPU Failed a row remap Pattern:NVRM: Xid \(PCI.+\): (64),.* PatternGeneratedMessageSuffix:} {Type:permanent Condition:GPUECCUncorrectableError Reason:GPU has encountered an uncorrectable ECC error Pattern:NVRM: Xid \(PCI.+\): (48|94|95),.* PatternGeneratedMessageSuffix:} {Type:permanent Condition:GPUFallenOffBus Reason:A GPU has fallen off the bus Pattern:NVRM: Xid \(PCI.+\): (79),.* PatternGeneratedMessageSuffix:} {Type:permanent Condition:GPUGSPTimeoutXid119 Reason:GPU System Processor is failing to respond, likely crashed or deadlocked Pattern:NVRM: Xid \(PCI.+\): (119),.* PatternGeneratedMessageSuffix:} {Type:permanent Condition:GPUContextSwitchTimeoutXid109 Reason:GPU is reporting a Xid 109 Pattern:NVRM: Xid \(PCI.+\): (109),.* PatternGeneratedMessageSuffix:} {Type:permanent Condition:GPUGSPPanicXid120 Reason:GPU is reporting a GSP task panic Pattern:NVRM: Xid \(PCI.+\): (120),.* PatternGeneratedMessageSuffix:} {Type:permanent Condition:NVSwitchFailure Reason:A NVSwitch has failed Pattern:nvidia-nvswitch.: SXid \(PCI.+\): (1900[4-6]|1901[3-7]|1904[6-8]|1905(3|4|6|8)|1906(0|1|3|4|6|7|9)|19070|20034|220(03|12)|2300[1-9]|2301[1-7]|2400[4-6]|2600[1-7]|2900(2|4)|3000(2|4)),.* PatternGeneratedMessageSuffix:} {Type:permanent Condition:GPUFault Reason:GPUSmiError Pattern:CW: GPU .* PatternGeneratedMessageSuffix:} {Type:permanent Condition:IBPCIUnavailable Reason:PCILost Pattern:mlx5_core.*PCI slot is unavailable.* PatternGeneratedMessageSuffix:} {Type:permanent Condition:PersistentStorageFault Reason:CephFSQuotaError Pattern:ceph: get_quota_realm: ino .+ PatternGeneratedMessageSuffix:} {Type:temporary Condition:NFSStorageFault Reason:NFSNotResponding Pattern:nfs: server .+ not responding.+ PatternGeneratedMessageSuffix:} {Type:permanent Condition:GPUFault Reason:ROMError Pattern:.*Invalid PCI ROM header signature.* PatternGeneratedMessageSuffix:} {Type:permanent Condition:HardwareErrorFatal Reason:HardwareErrorFatal Pattern:\[Hardware Error\]: event severity: fatal.* PatternGeneratedMessageSuffix:} {Type:temporary Condition: Reason:HardwareErrorInterruptPCIe Pattern:\[Hardware Error\]:   section_type: PCIe error PatternGeneratedMessageSuffix:} {Type:permanent Condition:HardwareErrorInterruptCPU Reason:HardwareErrorFromAPEI Pattern:\[Hardware Error\]:   section_type: .* processor error PatternGeneratedMessageSuffix:} {Type:permanent Condition:HardwareErrorInterruptMemory Reason:HardwareErrorFromAPEI Pattern:\[Hardware Error\]:   section_type: memory error PatternGeneratedMessageSuffix:} {Type:temporary Condition: Reason:HardwareErrorInterruptUnknown Pattern:\[DY Error\]:   section_type: unknown.* PatternGeneratedMessageSuffix:} {Type:temporary Condition: Reason:JournaldKernelWatchLoopStarted Pattern:\[npd-internal\] Entering journald watch loop.*kernel.* PatternGeneratedMessageSuffix:} {Type:temporary Condition: Reason:JournaldKernelFailedToGetNextEntry Pattern:\[npd-internal\] Failed to get next journald entry.*kernel.* PatternGeneratedMessageSuffix:} {Type:temporary Condition: Reason:JournaldKernelFailedToGetEntry Pattern:\[npd-internal\] Failed to get journald entry.*kernel.* PatternGeneratedMessageSuffix:}] EnableMetricsReporting:0x2f38e62}
I0722 15:16:54.119822      13 log_watchers.go:40] Use log watcher of plugin "journald"
I0722 15:16:54.120159      13 log_monitor.go:82] Finish parsing log monitor config file /config/docker-monitor.json: {WatcherConfig:{Plugin:journald PluginConfig:map[source:dockerd] SkipList:[] LogPath:/var/log/journal Lookback:5m Delay:} BufferSize:10 Source:docker-monitor DefaultConditions:[] Rules:[{Type:temporary Condition: Reason:CorruptDockerImage Pattern:Error trying v2 registry: failed to register layer: rename /var/lib/docker/image/(.+) /var/lib/docker/image/(.+): directory not empty.* PatternGeneratedMessageSuffix:}] EnableMetricsReporting:0x2f38e62}
I0722 15:16:54.120179      13 log_watchers.go:40] Use log watcher of plugin "journald"
I0722 15:16:54.120306      13 log_monitor.go:82] Finish parsing log monitor config file /config/kubelet.json: {WatcherConfig:{Plugin:journald PluginConfig:map[source:kubelet] SkipList:[] LogPath:/var/log/journal Lookback:5m Delay:} BufferSize:10 Source: DefaultConditions:[{Type:RunContainerError Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoRunContainerError Message:No RunContainerErrors present} {Type:KillContainerFailed Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoKillContainerFailed Message:No KillContainerFailed Errors present}] Rules:[{Type:temporary Condition: Reason:JournaldKubeletWatchLoopStarted Pattern:\[npd-internal\] Entering journald watch loop.*kubelet.* PatternGeneratedMessageSuffix:} {Type:temporary Condition: Reason:JournaldKubeletFailedToGetNextEntry Pattern:\[npd-internal\] Failed to get next journald entry.*kubelet.* PatternGeneratedMessageSuffix:} {Type:temporary Condition: Reason:JournaldKubeletFailedToGetEntry Pattern:\[npd-internal\] Failed to get journald entry.*kubelet.* PatternGeneratedMessageSuffix:} {Type:permanent Condition:RunContainerError Reason:ContextDeadlineExceeded Pattern:.*rror syncing pod.*RunContainerError.*context deadline exceeded.* PatternGeneratedMessageSuffix:} {Type:permanent Condition:KillContainerFailed Reason:FailedToKillHPCVerificationContainer Pattern:.*ill container failed.*hpc-verification.* PatternGeneratedMessageSuffix:}] EnableMetricsReporting:0x2f38e62}
I0722 15:16:54.120341      13 log_watchers.go:40] Use log watcher of plugin "journald"
I0722 15:16:54.120715      13 k8s_exporter.go:56] Waiting for kube-apiserver to be ready (timeout 5m0s)...
I0722 15:16:54.126688      13 problem_client.go:128] Deleting deprecated conditions [GPUApplicationError GPUMMUErrorXid31 HardwareErrorInterruptPCIe HardwareErrorInterruptUnknown] (if present)...
I0722 15:16:54.128047      13 problem_client.go:159] No deprecated conditions to delete
I0722 15:16:54.128072      13 node_problem_detector.go:59] K8s exporter started.
I0722 15:16:54.128255      13 node_problem_detector.go:63] Prometheus exporter started.
I0722 15:16:54.128270      13 custom_plugin_monitor.go:111] Start custom plugin monitor /config/network-monitor.json
I0722 15:16:54.128277      13 custom_plugin_monitor.go:111] Start custom plugin monitor /config/pci-monitor.json
I0722 15:16:54.128286      13 log_monitor.go:166] Start log monitor /config/kernel-kmsg.json
I0722 15:16:54.128526      13 custom_plugin_monitor.go:312] Initialized conditions for /config/pci-monitor.json: [{Type:GPUPCIFault Status:False Transition:2025-07-22 15:16:54.128515826 +0000 UTC m=+1010.114844724 Reason:PCIIsOk Message:VFIO PCI Connectivity is OK} {Type:InfiniBandLinkFault Status:False Transition:2025-07-22 15:16:54.128515926 +0000 UTC m=+1010.114844834 Reason:LinkIsOk Message:InfiniBand Interfaces OK}]
I0722 15:16:54.128581      13 custom_plugin_monitor.go:301] Sending initial status for pci-monitor with conditions: [{Type:GPUPCIFault Status:False Transition:2025-07-22 15:16:54.128515826 +0000 UTC m=+1010.114844724 Reason:PCIIsOk Message:VFIO PCI Connectivity is OK} {Type:InfiniBandLinkFault Status:False Transition:2025-07-22 15:16:54.128515926 +0000 UTC m=+1010.114844834 Reason:LinkIsOk Message:InfiniBand Interfaces OK}]
I0722 15:16:54.128605      13 custom_plugin_monitor.go:312] Initialized conditions for /config/network-monitor.json: [{Type:DNSFailure Status:False Transition:2025-07-22 15:16:54.128598426 +0000 UTC m=+1010.114927324 Reason:DNSIsOk Message:DNS lookups are working} {Type:ConnectivityFailure Status:False Transition:2025-07-22 15:16:54.128598516 +0000 UTC m=+1010.114927414 Reason:ConnectionIsOk Message:Internal connectivity is working} {Type:APIFailure Status:False Transition:2025-07-22 15:16:54.128598586 +0000 UTC m=+1010.114927484 Reason:ConnectionIsOk Message:Internal connectivity to K8S APIServer is working} {Type:PubFailure Status:False Transition:2025-07-22 15:16:54.128598656 +0000 UTC m=+1010.114927554 Reason:ConnectionIsOk Message:Public Internet connectivity is working}]
I0722 15:16:54.128671      13 custom_plugin_monitor.go:301] Sending initial status for network-monitor with conditions: [{Type:DNSFailure Status:False Transition:2025-07-22 15:16:54.128598426 +0000 UTC m=+1010.114927324 Reason:DNSIsOk Message:DNS lookups are working} {Type:ConnectivityFailure Status:False Transition:2025-07-22 15:16:54.128598516 +0000 UTC m=+1010.114927414 Reason:ConnectionIsOk Message:Internal connectivity is working} {Type:APIFailure Status:False Transition:2025-07-22 15:16:54.128598586 +0000 UTC m=+1010.114927484 Reason:ConnectionIsOk Message:Internal connectivity to K8S APIServer is working} {Type:PubFailure Status:False Transition:2025-07-22 15:16:54.128598656 +0000 UTC m=+1010.114927554 Reason:ConnectionIsOk Message:Public Internet connectivity is working}]
I0722 15:16:54.131238      13 log_monitor.go:174] Log monitor /config/kernel-kmsg.json is configured to refresh periodically
I0722 15:16:54.131269      13 log_monitor.go:166] Start log monitor /config/kernel-journald.json
E0722 15:16:54.131293      13 problem_detector.go:57] Failed to start monitor &{/config/kernel-journald.json 0xc000052c60 0xc0006bd8c0 {{journald map[source:kernel] [] /var/log/journal 5m } 1000  [{KernelDeadlock  {0 0 <nil>} KernelHasNoDeadlock Kernel has no deadlock} {KernelHardlock  {0 0 <nil>} NoCPUHardLockup Kernel has no CPU Hard Lockup} {ReadonlyFilesystem  {0 0 <nil>} FilesystemIsNotReadOnly Filesystem is not read-only} {LocalDiskErrors  {0 0 <nil>} NoDiskErrors Local NVMe is healthy} {GPUXID149  {0 0 <nil>} NoGPUErrors GPUs are not reporting xid 149} {GPUWantsReset  {0 0 <nil>} NoGPUErrors GPUs are not reporting non-fatal errors} {GPUChannelRetirement  {0 0 <nil>} NoGPUErrors GPUs are not reporting any channel retirement errors} {GPUChannelRetirementFailure  {0 0 <nil>} NoGPUErrors No channels have failed retirement} {GPURowRemapFailure  {0 0 <nil>} RowRemapOk No rows have failed remapping} {GPUECCUncorrectableError  {0 0 <nil>} NoECCError No GPUs have triggered an ECC uncorrectable error} {GPUInvalidPushBuffer  {0 0 <nil>} NoGPUErrors GPUs are not reporting invalid push buffer} {GPUContextSwitchFault  {0 0 <nil>} NoGPUErrors GPUs are not reporting context switch fault} {GPUFault  {0 0 <nil>} NoGPUErrors GPUs are not reporting errors} {IBPCIUnavailable  {0 0 <nil>} IBPCIAvailable InfiniBand adapters are not reporting PCI slot unavailable} {GPUFallenOffBus  {0 0 <nil>} NoMatchingXid No Xid 79 detected} {GPUGSPTimeoutXid119  {0 0 <nil>} NoMatchingXid No Xid 119 detected} {GPUContextSwitchTimeoutXid109  {0 0 <nil>} NoMatchingXid No Xid 109 detected} {GPUGSPPanicXid120  {0 0 <nil>} NoMatchingXid No Xid 120 detected} {PersistentStorageFault  {0 0 <nil>} NoStorageErrors Storage subsystem is not reporting any errors} {HardwareErrorFatal  {0 0 <nil>} NoHardwareErrorFatal Platform is not reporting any fatal hardware errors} {HardwareErrorInterruptCPU  {0 0 <nil>} NoInterruptsDetected Platform is reporting no hardware errors via APEI for CPU} {HardwareErrorInterruptMemory  {0 0 <nil>} NoInterruptsDetected Platform is reporting no hardware errors via APEI for Memory} {NVLinkXIDFatal  {0 0 <nil>} NoMatchingXid No XID 144-150, 154-157 detected} {NVLinkXIDFatalSwitch  {0 0 <nil>} NoMatchingXid No XID 144-150, 154-157 detected from switch side} {NVLinkMaskError  {0 0 <nil>} NoMaskError No GPUs are reporting link mask errors} {Sector0LocalDiskErrors  {0 0 <nil>} NoSector0NVMEErrors No nvme i/o errors detected on sector 0} {SuspectedLocalDiskErrors  {0 0 <nil>} NoSuspectedNVMEErrors No nvme i/o errors detected}] [{temporary  SystemOOMKilling Out of memory: Killed process \d+ (.+) total-vm:\d+kB, anon-rss:\d+kB, file-rss:\d+kB.* } {temporary  CGroupOOMKilling Memory cgroup out of memory: Killed process \d+ (.*) total-vm:\d+kB, anon-rss:\d+kB, file-rss:\d+kB.* } {temporary  TaskHung task \S+:\w+ blocked for more than \w+ seconds\. } {temporary  UnregisterNetDevice unregister_netdevice: waiting for \w+ to become free. Usage count = \d+ } {temporary  KernelOops BUG: unable to handle kernel NULL pointer dereference at .* } {temporary  KernelOops divide error: 0000 \[#\d+\] SMP } {temporary  HardwareErrorRecoverable \[Hardware Error\]: event severity: recoverable } {temporary  HardwareErrorCorrected \[Hardware Error\]: event severity: corrected } {temporary  HardwareErrorInfo \[Hardware Error\]: event severity: info } {temporary  PCIAER AER: aer_status: .* } {temporary  NVSXidNonFatal nvidia-nvswitch\d: SXid .* Non-fatal, .* } {temporary NVLinkXIDNonfatal  NVRM: Xid \(PCI.+\): (14[4-9]|150|15[4-7]),.*Nonfatal.* } {permanent NVLinkXIDFatal NVSwitch XID indicates fatal error NVRM: Xid \(PCI.+\): (14[4-9]|150|15[4-7]),.*Fatal.* \(0x0[0-1|3-9].* } {permanent NVLinkXIDFatalSwitch NVSwitch XID indicates fatal error from switch side NVRM: Xid \(PCI.+\): (14[4-9]|150|15[4-7]),.*Fatal.* \(0x02.* } {permanent NVLinkMaskError GPUs are reporting Link mask errors NVRM: (knvlinkUpdatePostRxDetectLinkMask_IMPL: Failed to update Rx Detect Link mask|NVRM: knvlinkDiscoverPostRxDetLinks.*: Getting peer..s postRxDetLinkMask failed).* } {permanent GPUWantsReset NVSwitch XID indicates GPU reset required NVRM: Xid \(PCI.+\): (14[4-8]|150|15[4-7]),.*GPU Reset Required.* } {permanent GPUXID149 NVSwitch XID indicates GPU reset required NVRM: Xid \(PCI.+\): (149),.* } {permanent GPUChannelRetirement GPU is reporting memory channel retirement due to repeat uncorrectable errors NVRM: Xid \(PCI.+\): (160),.* } {permanent GPUChannelRetirementFailure GPU is reporting memory channel retirement failure NVRM: Xid \(PCI.+\): (161),.* } {permanent KernelDeadlock CPUSoftLockup watchdog: BUG: soft lockup .* } {permanent KernelHardlock CPUHardLockup NMI watchdog: Watchdog detected hard LOCKUP .* } {permanent KernelDeadlock AUFSUmountHung task umount\.aufs:\w+ blocked for more than \w+ seconds\. } {permanent KernelDeadlock DockerHung task docker:\w+ blocked for more than \w+ seconds\. } {permanent ReadonlyFilesystem FilesystemIsReadOnly Remounting filesystem read-only } {permanent LocalDiskErrors XFSCorruption XFS \(((.{0,4})|(.+loop.*)|(loo[^p].*)|(lo[^o].*)|(l[^o].*))\).*Corruption detected.* } {permanent SuspectedLocalDiskErrors IOError I/O error, dev nvme.*, sector (?:[^0].+) .* } {permanent Sector0LocalDiskErrors IOError I/O error, dev nvme.*, sector (?:[0].+) .* } {permanent LocalDiskErrors IOError nvme.+ I/O .+ timeout.* } {permanent LocalDiskErrors CriticalMediumError critical medium error, dev nvme.+ } {permanent GPUFault CUDASegFault cuda-.+ segfault at .* } {permanent GPUFault GPUIsReportingErrors NVRM: Xid \(PCI.+\): (1|2|3|4|5|6|7|8|9|10|11|12|14|15|16|17|18|19|20|21|22|23|24|25|26|27|28|29|30|33|34|35|36|37|38|39|40|41|42|46|47|49|50|51|52|53|54|55|56|57|58|59|60|61|62|65|66|67|69|70|71|72|73|74|75|76|77|78|80|81|82|83|84|85|86|87|88|89|90|91|92|93|96|97|98|99|100|101|102|103|104|105|106|107|108|110|111|112|113|114|115|116|117|118),.* } {permanent GPUFault GPUIsReportingErrors NVRM: Rate limiting GSP RPC error prints for GPU at PCI:.+ \(printing .+ of every .+\).  The GPU likely needs to be reset.* } {permanent GPUInvalidPushBuffer GPUIsReportingInvalidPushBuffer NVRM: Xid \(PCI.+\): 32,.* } {permanent GPUContextSwitchFault GPUIsReportingContextSwitchFault NVRM: Xid \(PCI.+\): 44,.* } {permanent GPUWantsReset GPU Has a pending row remap NVRM: Xid \(PCI.+\): (63),.* } {permanent GPURowRemapFailure GPU Failed a row remap NVRM: Xid \(PCI.+\): (64),.* } {permanent GPUECCUncorrectableError GPU has encountered an uncorrectable ECC error NVRM: Xid \(PCI.+\): (48|94|95),.* } {permanent GPUFallenOffBus A GPU has fallen off the bus NVRM: Xid \(PCI.+\): (79),.* } {permanent GPUGSPTimeoutXid119 GPU System Processor is failing to respond, likely crashed or deadlocked NVRM: Xid \(PCI.+\): (119),.* } {permanent GPUContextSwitchTimeoutXid109 GPU is reporting a Xid 109 NVRM: Xid \(PCI.+\): (109),.* } {permanent GPUGSPPanicXid120 GPU is reporting a GSP task panic NVRM: Xid \(PCI.+\): (120),.* } {permanent NVSwitchFailure A NVSwitch has failed nvidia-nvswitch.: SXid \(PCI.+\): (1900[4-6]|1901[3-7]|1904[6-8]|1905(3|4|6|8)|1906(0|1|3|4|6|7|9)|19070|20034|220(03|12)|2300[1-9]|2301[1-7]|2400[4-6]|2600[1-7]|2900(2|4)|3000(2|4)),.* } {permanent GPUFault GPUSmiError CW: GPU .* } {permanent IBPCIUnavailable PCILost mlx5_core.*PCI slot is unavailable.* } {permanent PersistentStorageFault CephFSQuotaError ceph: get_quota_realm: ino .+ } {temporary NFSStorageFault NFSNotResponding nfs: server .+ not responding.+ } {permanent GPUFault ROMError .*Invalid PCI ROM header signature.* } {permanent HardwareErrorFatal HardwareErrorFatal \[Hardware Error\]: event severity: fatal.* } {temporary  HardwareErrorInterruptPCIe \[Hardware Error\]:   section_type: PCIe error } {permanent HardwareErrorInterruptCPU HardwareErrorFromAPEI \[Hardware Error\]:   section_type: .* processor error } {permanent HardwareErrorInterruptMemory HardwareErrorFromAPEI \[Hardware Error\]:   section_type: memory error } {temporary  HardwareErrorInterruptUnknown \[DY Error\]:   section_type: unknown.* } {temporary  JournaldKernelWatchLoopStarted \[npd-internal\] Entering journald watch loop.*kernel.* } {temporary  JournaldKernelFailedToGetNextEntry \[npd-internal\] Failed to get next journald entry.*kernel.* } {temporary  JournaldKernelFailedToGetEntry \[npd-internal\] Failed to get journald entry.*kernel.* }] 0x2f38e62} [] <nil> 0xc00072d6c0 0xc0005ac360}: failed to stat the log path "/var/log/journal": stat /var/log/journal: no such file or directory
I0722 15:16:54.131392      13 log_monitor.go:166] Start log monitor /config/docker-monitor.json
E0722 15:16:54.131408      13 problem_detector.go:57] Failed to start monitor &{/config/docker-monitor.json 0xc000052d80 0xc000574f40 {{journald map[source:dockerd] [] /var/log/journal 5m } 10 docker-monitor [] [{temporary  CorruptDockerImage Error trying v2 registry: failed to register layer: rename /var/lib/docker/image/(.+) /var/lib/docker/image/(.+): directory not empty.* }] 0x2f38e62} [] <nil> 0xc00072dd50 0xc00046d390}: failed to stat the log path "/var/log/journal": stat /var/log/journal: no such file or directory
I0722 15:16:54.131421      13 log_monitor.go:166] Start log monitor /config/kubelet.json
E0722 15:16:54.131434      13 problem_detector.go:57] Failed to start monitor &{/config/kubelet.json 0xc000052ea0 0xc000575100 {{journald map[source:kubelet] [] /var/log/journal 5m } 10  [{RunContainerError  {0 0 <nil>} NoRunContainerError No RunContainerErrors present} {KillContainerFailed  {0 0 <nil>} NoKillContainerFailed No KillContainerFailed Errors present}] [{temporary  JournaldKubeletWatchLoopStarted \[npd-internal\] Entering journald watch loop.*kubelet.* } {temporary  JournaldKubeletFailedToGetNextEntry \[npd-internal\] Failed to get next journald entry.*kubelet.* } {temporary  JournaldKubeletFailedToGetEntry \[npd-internal\] Failed to get journald entry.*kubelet.* } {permanent RunContainerError ContextDeadlineExceeded .*rror syncing pod.*RunContainerError.*context deadline exceeded.* } {permanent KillContainerFailed FailedToKillHPCVerificationContainer .*ill container failed.*hpc-verification.* }] 0x2f38e62} [] <nil> 0xc000765180 0xc00046d4b0}: failed to stat the log path "/var/log/journal": stat /var/log/journal: no such file or directory
I0722 15:16:54.131456      13 problem_detector.go:77] Problem detector started
I0722 15:16:54.131482      13 log_monitor.go:305] Initialize condition generated: [{Type:KernelDeadlock Status:False Transition:2025-07-22 15:16:54.131469488 +0000 UTC m=+1010.117798396 Reason:KernelHasNoDeadlock Message:Kernel has no deadlock} {Type:KernelHardlock Status:False Transition:2025-07-22 15:16:54.131469568 +0000 UTC m=+1010.117798476 Reason:NoCPUHardLockup Message:Kernel has no CPU Hard Lockup} {Type:ReadonlyFilesystem Status:False Transition:2025-07-22 15:16:54.131469648 +0000 UTC m=+1010.117798556 Reason:FilesystemIsNotReadOnly Message:Filesystem is not read-only} {Type:LocalDiskErrors Status:False Transition:2025-07-22 15:16:54.131469728 +0000 UTC m=+1010.117798626 Reason:NoDiskErrors Message:Local NVMe is healthy} {Type:GPUXID149 Status:False Transition:2025-07-22 15:16:54.131469798 +0000 UTC m=+1010.117798696 Reason:NoGPUErrors Message:GPUs are not reporting xid 149} {Type:GPUWantsReset Status:False Transition:2025-07-22 15:16:54.131469868 +0000 UTC m=+1010.117798766 Reason:NoGPUErrors Message:GPUs are not reporting non-fatal errors} {Type:GPUChannelRetirement Status:False Transition:2025-07-22 15:16:54.131469948 +0000 UTC m=+1010.117798846 Reason:NoGPUErrors Message:GPUs are not reporting any channel retirement errors} {Type:GPUChannelRetirementFailure Status:False Transition:2025-07-22 15:16:54.131470008 +0000 UTC m=+1010.117798916 Reason:NoGPUErrors Message:No channels have failed retirement} {Type:GPURowRemapFailure Status:False Transition:2025-07-22 15:16:54.131470088 +0000 UTC m=+1010.117798986 Reason:RowRemapOk Message:No rows have failed remapping} {Type:GPUECCUncorrectableError Status:False Transition:2025-07-22 15:16:54.131470158 +0000 UTC m=+1010.117799056 Reason:NoECCError Message:No GPUs have triggered an ECC uncorrectable error} {Type:GPUInvalidPushBuffer Status:False Transition:2025-07-22 15:16:54.131470228 +0000 UTC m=+1010.117799126 Reason:NoGPUErrors Message:GPUs are not reporting invalid push buffer} {Type:GPUContextSwitchFault Status:False Transition:2025-07-22 15:16:54.131470298 +0000 UTC m=+1010.117799196 Reason:NoGPUErrors Message:GPUs are not reporting context switch fault} {Type:GPUFault Status:False Transition:2025-07-22 15:16:54.131470368 +0000 UTC m=+1010.117799266 Reason:NoGPUErrors Message:GPUs are not reporting errors} {Type:IBPCIUnavailable Status:False Transition:2025-07-22 15:16:54.131470438 +0000 UTC m=+1010.117799336 Reason:IBPCIAvailable Message:InfiniBand adapters are not reporting PCI slot unavailable} {Type:GPUFallenOffBus Status:False Transition:2025-07-22 15:16:54.131470508 +0000 UTC m=+1010.117799406 Reason:NoMatchingXid Message:No Xid 79 detected} {Type:GPUGSPTimeoutXid119 Status:False Transition:2025-07-22 15:16:54.131470578 +0000 UTC m=+1010.117799476 Reason:NoMatchingXid Message:No Xid 119 detected} {Type:GPUContextSwitchTimeoutXid109 Status:False Transition:2025-07-22 15:16:54.131470648 +0000 UTC m=+1010.117799546 Reason:NoMatchingXid Message:No Xid 109 detected} {Type:GPUGSPPanicXid120 Status:False Transition:2025-07-22 15:16:54.131470718 +0000 UTC m=+1010.117799616 Reason:NoMatchingXid Message:No Xid 120 detected} {Type:PersistentStorageFault Status:False Transition:2025-07-22 15:16:54.131470788 +0000 UTC m=+1010.117799686 Reason:NoStorageErrors Message:Storage subsystem is not reporting any errors} {Type:HardwareErrorFatal Status:False Transition:2025-07-22 15:16:54.131470858 +0000 UTC m=+1010.117799756 Reason:NoHardwareErrorFatal Message:Platform is not reporting any fatal hardware errors} {Type:HardwareErrorInterruptCPU Status:False Transition:2025-07-22 15:16:54.131470928 +0000 UTC m=+1010.117799836 Reason:NoInterruptsDetected Message:Platform is reporting no hardware errors via APEI for CPU} {Type:HardwareErrorInterruptMemory Status:False Transition:2025-07-22 15:16:54.131471008 +0000 UTC m=+1010.117799906 Reason:NoInterruptsDetected Message:Platform is reporting no hardware errors via APEI for Memory} {Type:NVLinkXIDFatal Status:False Transition:2025-07-22 15:16:54.131471088 +0000 UTC m=+1010.117799986 Reason:NoMatchingXid Message:No XID 144-150, 154-157 detected} {Type:NVLinkXIDFatalSwitch Status:False Transition:2025-07-22 15:16:54.131471148 +0000 UTC m=+1010.117800046 Reason:NoMatchingXid Message:No XID 144-150, 154-157 detected from switch side} {Type:NVLinkMaskError Status:False Transition:2025-07-22 15:16:54.131471218 +0000 UTC m=+1010.117800116 Reason:NoMaskError Message:No GPUs are reporting link mask errors} {Type:Sector0LocalDiskErrors Status:False Transition:2025-07-22 15:16:54.131471288 +0000 UTC m=+1010.117800196 Reason:NoSector0NVMEErrors Message:No nvme i/o errors detected on sector 0} {Type:SuspectedLocalDiskErrors Status:False Transition:2025-07-22 15:16:54.131471358 +0000 UTC m=+1010.117800256 Reason:NoSuspectedNVMEErrors Message:No nvme i/o errors detected}]
I0722 15:16:54.131790      13 log_monitor.go:97] Start log monitor config refresh loop for /config/kernel-kmsg.json
I0722 15:16:54.137289      13 log_monitor.go:229] New status generated: &{Source: Events:[{Severity:warn Timestamp:2025-07-22 15:16:54.131366198 +0000 UTC m=+1010.117695116 Reason:KmsgWatchLoopStarted Message:[npd-internal] Entering watch loop for kernel log}] Conditions:[{Type:KernelDeadlock Status:False Transition:2025-07-22 15:16:54.131469488 +0000 UTC m=+1010.117798396 Reason:KernelHasNoDeadlock Message:Kernel has no deadlock} {Type:KernelHardlock Status:False Transition:2025-07-22 15:16:54.131469568 +0000 UTC m=+1010.117798476 Reason:NoCPUHardLockup Message:Kernel has no CPU Hard Lockup} {Type:ReadonlyFilesystem Status:False Transition:2025-07-22 15:16:54.131469648 +0000 UTC m=+1010.117798556 Reason:FilesystemIsNotReadOnly Message:Filesystem is not read-only} {Type:LocalDiskErrors Status:False Transition:2025-07-22 15:16:54.131469728 +0000 UTC m=+1010.117798626 Reason:NoDiskErrors Message:Local NVMe is healthy} {Type:GPUXID149 Status:False Transition:2025-07-22 15:16:54.131469798 +0000 UTC m=+1010.117798696 Reason:NoGPUErrors Message:GPUs are not reporting xid 149} {Type:GPUWantsReset Status:False Transition:2025-07-22 15:16:54.131469868 +0000 UTC m=+1010.117798766 Reason:NoGPUErrors Message:GPUs are not reporting non-fatal errors} {Type:GPUChannelRetirement Status:False Transition:2025-07-22 15:16:54.131469948 +0000 UTC m=+1010.117798846 Reason:NoGPUErrors Message:GPUs are not reporting any channel retirement errors} {Type:GPUChannelRetirementFailure Status:False Transition:2025-07-22 15:16:54.131470008 +0000 UTC m=+1010.117798916 Reason:NoGPUErrors Message:No channels have failed retirement} {Type:GPURowRemapFailure Status:False Transition:2025-07-22 15:16:54.131470088 +0000 UTC m=+1010.117798986 Reason:RowRemapOk Message:No rows have failed remapping} {Type:GPUECCUncorrectableError Status:False Transition:2025-07-22 15:16:54.131470158 +0000 UTC m=+1010.117799056 Reason:NoECCError Message:No GPUs have triggered an ECC uncorrectable error} {Type:GPUInvalidPushBuffer Status:False Transition:2025-07-22 15:16:54.131470228 +0000 UTC m=+1010.117799126 Reason:NoGPUErrors Message:GPUs are not reporting invalid push buffer} {Type:GPUContextSwitchFault Status:False Transition:2025-07-22 15:16:54.131470298 +0000 UTC m=+1010.117799196 Reason:NoGPUErrors Message:GPUs are not reporting context switch fault} {Type:GPUFault Status:False Transition:2025-07-22 15:16:54.131470368 +0000 UTC m=+1010.117799266 Reason:NoGPUErrors Message:GPUs are not reporting errors} {Type:IBPCIUnavailable Status:False Transition:2025-07-22 15:16:54.131470438 +0000 UTC m=+1010.117799336 Reason:IBPCIAvailable Message:InfiniBand adapters are not reporting PCI slot unavailable} {Type:GPUFallenOffBus Status:False Transition:2025-07-22 15:16:54.131470508 +0000 UTC m=+1010.117799406 Reason:NoMatchingXid Message:No Xid 79 detected} {Type:GPUGSPTimeoutXid119 Status:False Transition:2025-07-22 15:16:54.131470578 +0000 UTC m=+1010.117799476 Reason:NoMatchingXid Message:No Xid 119 detected} {Type:GPUContextSwitchTimeoutXid109 Status:False Transition:2025-07-22 15:16:54.131470648 +0000 UTC m=+1010.117799546 Reason:NoMatchingXid Message:No Xid 109 detected} {Type:GPUGSPPanicXid120 Status:False Transition:2025-07-22 15:16:54.131470718 +0000 UTC m=+1010.117799616 Reason:NoMatchingXid Message:No Xid 120 detected} {Type:PersistentStorageFault Status:False Transition:2025-07-22 15:16:54.131470788 +0000 UTC m=+1010.117799686 Reason:NoStorageErrors Message:Storage subsystem is not reporting any errors} {Type:HardwareErrorFatal Status:False Transition:2025-07-22 15:16:54.131470858 +0000 UTC m=+1010.117799756 Reason:NoHardwareErrorFatal Message:Platform is not reporting any fatal hardware errors} {Type:HardwareErrorInterruptCPU Status:False Transition:2025-07-22 15:16:54.131470928 +0000 UTC m=+1010.117799836 Reason:NoInterruptsDetected Message:Platform is reporting no hardware errors via APEI for CPU} {Type:HardwareErrorInterruptMemory Status:False Transition:2025-07-22 15:16:54.131471008 +0000 UTC m=+1010.117799906 Reason:NoInterruptsDetected Message:Platform is reporting no hardware errors via APEI for Memory} {Type:NVLinkXIDFatal Status:False Transition:2025-07-22 15:16:54.131471088 +0000 UTC m=+1010.117799986 Reason:NoMatchingXid Message:No XID 144-150, 154-157 detected} {Type:NVLinkXIDFatalSwitch Status:False Transition:2025-07-22 15:16:54.131471148 +0000 UTC m=+1010.117800046 Reason:NoMatchingXid Message:No XID 144-150, 154-157 detected from switch side} {Type:NVLinkMaskError Status:False Transition:2025-07-22 15:16:54.131471218 +0000 UTC m=+1010.117800116 Reason:NoMaskError Message:No GPUs are reporting link mask errors} {Type:Sector0LocalDiskErrors Status:False Transition:2025-07-22 15:16:54.131471288 +0000 UTC m=+1010.117800196 Reason:NoSector0NVMEErrors Message:No nvme i/o errors detected on sector 0} {Type:SuspectedLocalDiskErrors Status:False Transition:2025-07-22 15:16:54.131471358 +0000 UTC m=+1010.117800256 Reason:NoSuspectedNVMEErrors Message:No nvme i/o errors detected}]}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant