Skip to content

Conversation

cbiffle
Copy link
Collaborator

@cbiffle cbiffle commented Sep 18, 2025

I've become convinced that the "LostInterrupt" failures we've intermittently seen in the I2C driver are due to timer notification mismanagement. I recently merged a commit (dd76a4c) that reworked timer notification handling to eliminate a case where the driver could crash itself in a narrow race condition window. That also eliminated a way to crashed it from other task, by posting to its timer notification bit (bit 31) while it was working -- both cases were fixed by double checking that the timer had actually gone off before responding.

With that fixed, there's only one way remaining to get the LostInterrupt panic, which is: keep i2c_driver from running for 100ms during any I2C transaction. You can do this pretty easily by triggering a large coredump. As a result, the driver contains a system-load-induced panic that isn't detecting any real error condition, and acts as a load multiplier (by triggering crash response and retries).

From what I can tell from the commit history, the software timer (the thing that explodes after 100ms) was added while attempting to track down some other state machine misbehavior, which I believe has since been fixed. Certainly, none of our investigations into the LostInterrupt failures over the past year have yielded any concrete evidence of a hardware bug. They've mostly seemed random... which is what you'd expect if they were being triggered by unrelated system load spikes.

This commit removes the entire mechanism. The hardware SMBus 25ms timeout remains, but that's separate and ran in parallel to this software timer. The path leading to the LostInterrupt panic no longer exists in the code.

Fixes #2004.

I've become convinced that the "LostInterrupt" failures we've
intermittently seen in the I2C driver are due to timer notification
mismanagement. I recently merged a commit (dd76a4c) that reworked
timer notification handling to eliminate a case where the driver could
crash _itself_ in a narrow race condition window. That also eliminated a
way to crashed it from other task, by posting to its timer notification
bit (bit 31) while it was working -- both cases were fixed by double
checking that the timer had actually gone off before responding.

With that fixed, there's only one way remaining to get the LostInterrupt
panic, which is: keep i2c_driver from running for 100ms during any I2C
transaction. You can do this pretty easily by triggering a large
coredump. As a result, the driver contains a system-load-induced panic
that isn't detecting any real error condition, and acts as a load
multiplier (by triggering crash response and retries).

From what I can tell from the commit history, the software timer (the
thing that explodes after 100ms) was added while attempting to track
down some _other_ state machine misbehavior, which I believe has since
been fixed. Certainly, none of our investigations into the LostInterrupt
failures over the past year have yielded any concrete evidence of a
hardware bug. They've mostly seemed random... which is what you'd expect
if they were being triggered by unrelated system load spikes.

This commit removes the entire mechanism. The hardware SMBus 25ms
timeout remains, but that's separate and ran in parallel to this
software timer. The path leading to the LostInterrupt panic no longer
exists in the code.

Fixes #2004.
@cbiffle cbiffle added service processor Related to the service processor. robustness Fixing this would improve robustness of deployed firmware labels Sep 18, 2025
@cbiffle cbiffle requested review from mkeeter and labbott September 18, 2025 21:34
Comment on lines +559 to 560
self.wfi();
sys_irq_control(notification, true);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps better suited for a follow-up PR, but it looks like all the callers of wfi immediately call sys_irq_control with the notification mask that was waited on...what do you think about having wfi just do that?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it.

Though I feel like it might be more obviously correct if wfi enabled the interrupt before sleeping. Currently you have to kind of convince yourself that it's enabled, and the routine won't just sleep forever. Lemme squint at it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.....actually, I would argue this crate is currently wrong -- it never turns on the interrupt.

It has an undocumented precondition that any caller turn on the interrupt, which stm32xx-i2c-server happens to do.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to move this work to a followup so I can test it more thoroughly, if that's alright.

Copy link
Member

@hawkw hawkw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As @jclulow would say, "timeouts, timeouts, always wrong / some too short, and some too long" :)

@cbiffle
Copy link
Collaborator Author

cbiffle commented Sep 18, 2025

In case anyone's curious, taking Grapefruit as an example, the effect of this change is:

Flash: 9148 bytes -> 8640 bytes
RAM: almost unchanged
Peak stack: 504 -> 408 bytes

@cbiffle cbiffle merged commit e199470 into master Sep 18, 2025
135 checks passed
@cbiffle cbiffle deleted the cbiffle/i2c-remove-lost-interrupt branch September 18, 2025 23:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
robustness Fixing this would improve robustness of deployed firmware service processor Related to the service processor.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

I2C lost interrupt panics
2 participants