-
Notifications
You must be signed in to change notification settings - Fork 207
stm32xx-i2c: remove soft timeout, LostInterrupt #2237
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I've become convinced that the "LostInterrupt" failures we've intermittently seen in the I2C driver are due to timer notification mismanagement. I recently merged a commit (dd76a4c) that reworked timer notification handling to eliminate a case where the driver could crash _itself_ in a narrow race condition window. That also eliminated a way to crashed it from other task, by posting to its timer notification bit (bit 31) while it was working -- both cases were fixed by double checking that the timer had actually gone off before responding. With that fixed, there's only one way remaining to get the LostInterrupt panic, which is: keep i2c_driver from running for 100ms during any I2C transaction. You can do this pretty easily by triggering a large coredump. As a result, the driver contains a system-load-induced panic that isn't detecting any real error condition, and acts as a load multiplier (by triggering crash response and retries). From what I can tell from the commit history, the software timer (the thing that explodes after 100ms) was added while attempting to track down some _other_ state machine misbehavior, which I believe has since been fixed. Certainly, none of our investigations into the LostInterrupt failures over the past year have yielded any concrete evidence of a hardware bug. They've mostly seemed random... which is what you'd expect if they were being triggered by unrelated system load spikes. This commit removes the entire mechanism. The hardware SMBus 25ms timeout remains, but that's separate and ran in parallel to this software timer. The path leading to the LostInterrupt panic no longer exists in the code. Fixes #2004.
self.wfi(); | ||
sys_irq_control(notification, true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps better suited for a follow-up PR, but it looks like all the callers of wfi
immediately call sys_irq_control
with the notification mask that was waited on...what do you think about having wfi
just do that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like it.
Though I feel like it might be more obviously correct if wfi
enabled the interrupt before sleeping. Currently you have to kind of convince yourself that it's enabled, and the routine won't just sleep forever. Lemme squint at it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.....actually, I would argue this crate is currently wrong -- it never turns on the interrupt.
It has an undocumented precondition that any caller turn on the interrupt, which stm32xx-i2c-server
happens to do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to move this work to a followup so I can test it more thoroughly, if that's alright.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As @jclulow would say, "timeouts, timeouts, always wrong / some too short, and some too long" :)
In case anyone's curious, taking Grapefruit as an example, the effect of this change is: Flash: 9148 bytes -> 8640 bytes |
I've become convinced that the "LostInterrupt" failures we've intermittently seen in the I2C driver are due to timer notification mismanagement. I recently merged a commit (dd76a4c) that reworked timer notification handling to eliminate a case where the driver could crash itself in a narrow race condition window. That also eliminated a way to crashed it from other task, by posting to its timer notification bit (bit 31) while it was working -- both cases were fixed by double checking that the timer had actually gone off before responding.
With that fixed, there's only one way remaining to get the LostInterrupt panic, which is: keep i2c_driver from running for 100ms during any I2C transaction. You can do this pretty easily by triggering a large coredump. As a result, the driver contains a system-load-induced panic that isn't detecting any real error condition, and acts as a load multiplier (by triggering crash response and retries).
From what I can tell from the commit history, the software timer (the thing that explodes after 100ms) was added while attempting to track down some other state machine misbehavior, which I believe has since been fixed. Certainly, none of our investigations into the LostInterrupt failures over the past year have yielded any concrete evidence of a hardware bug. They've mostly seemed random... which is what you'd expect if they were being triggered by unrelated system load spikes.
This commit removes the entire mechanism. The hardware SMBus 25ms timeout remains, but that's separate and ran in parallel to this software timer. The path leading to the LostInterrupt panic no longer exists in the code.
Fixes #2004.