-
-
Notifications
You must be signed in to change notification settings - Fork 64
Remove very bad loop in Serial transmit driver that causes the CPU #42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
to spend most of its time looping in the interrupt driver, entirely locking out other operations and also causing incoming data on other Serial ports to be lost.
@egilkv that function only gets called on |
Are you kidding me? That function is called by the transmit data register interrupt!
If you try my test program (rename txt to ino) you will see how bad it is. |
There are also some design issues in this code. Since it is also an interrupt handler, it should be optimized for that, and whatever extra code is needed for the flush() should be outside. But we have some other bugs to cover before we get to that... |
It gets called in the data register empty interrupt, so the next byte in the buffer can be written.
I've been testing your patch, which is faster of course, but ALL the bytes being sent are wrong. Of course you won't see it if you just sent 0x55 all the time. |
I was going to argue that ISR is only called when the USART_DREIF bit is set (which should remain set as long as the data register remains empty), but then I noticed that this loop is after writing a byte to the buffer (AFAIU meaning that the CPU is indeed pretty much blocked inside the serial ISR all the time (since it writes a byte to DR, then waits for DR to become empty again and only then returns from the ISR which will trigger immediately again).
Agreed. The original design in the regular AVR core was like that, but it seems the megaavr version has seen some changes intended to fix serial printing from inside ISRs. On regular AVR, this was already handled by checking the global interrupt mask, but it seems on megaavr, interrupts have different levels and thus also different ways of being disabled. These changes were introduced in commit eba45bc, which might need a closer look. Just removing the loop as this PR does will probably reintroduce that serial writing bug (or likely a different bug), but might still be a step forward. That commit also introduces interrupt level elevation for the ISR, but I'm not sure I understand why (AFAICS that means that the interrupt can be called by the system again, but the flush and write methods still block, so the level0 ISR still blocks, so the previous approach of calling the DRE ISR manually (which is still done when interrupts are globally enabled) should still work (and is a lot simpler). @krbekken, perhaps you have some input to this discussion? :-) |
@matthijskooijman thanks for referring to commit eba45bc. Yes, there are fixes needed for supporting serial writing in an interrupt handler, but they are elsewhere. I believe I have also a proper fix for that, but it is a bit further down my list of priorities I must admit. The "fix" in eba45bc just blows up everything, and makes simultaneous use of more than one serial channel utterly unreliable, especially if they are at different baudrates. The original code in avr is significantly cleaner, albeit not without some issues. I also wondered about the elevation thing, but I am not familiar with the mechanism, and have so far not reviewed it further. |
@facchinm you are saying this yourself: The data interrupt service routine must assume the DREIF is set, and assume it can write to the data register. I believe some of the problems here is that this routine is used also in flush() and write(), which confuses the matter. But do note that in both these places where the function is invoked from code, the status of DREIF is explicitly tested before the function is invoked. So it can assume DREIF is set., which it should. Problems you may have had with wrong data must be because my test code writes async data continuously without any pause at all, which can make it hard to get in sync. I do not know how your test setup is, but I would propse try adding a good delay at the end of the setup(). |
@matthijskooijman you said:
This is spot on, and explained much better than I can do. It simply eats all you CPU cycles and blocks everything else. As bad as it can get, really. Removing this line restores sanity :) |
Yes, but...
This is not true, it only checks DREIE. The check for DREIF used to be there (and still is in regular AVR), but was removed by the commit mentioned above. I suspect this means that (with the loop removed), the code (write when the buffer is full and flush) will keep writing values to the DR even when it is not empty, overwriting the previous bytes and corrupting the data. This is probaly what @facchinm is seeing. |
@facchinm if you have problems getting synced, I suggest you change the write loop to this:
This creates a "random" pattern that makes it easier for async data to sync. And yes, it works for me :) |
@matthijskooijman Good points, I will investigate this a bit more and come back. And no, it is not the problems @facchinm is seeing. Everything is sent through regular interrupts in my test. I believe the problems he had must be since he probably just changed my \x55 to some other repeating letter. Which in async communication can be "impossible" to sync in this case, so just a matter of the test setup. Sending a more random pattern as in my example above will avoid that. (There was a good reason I originally used the 55-pattern , and that has to do with measurement of baudrate:) ) |
@matthijskooijman You are right, of course. DREIF should absolutely be tested before invoking the irq "by hand", and as you say, that code is no longer there. The AVR code is fine, and really should serve as a model. As said, I am 100% sure this is not a problem for my test case. For any regular write, the byte is either written in polled fashion in the beginning (where DREIF is tested) or it is written by the interrupt. But if we have a regular write interrupted by an interrupt that invokes a write, I believe there are some problems with the code, also in AVR. But this is a bit special case, we have more obvious and urgent issues to solve first. My conclusions so far are:
Since the invocation from flush() and write() really is the same, it would be better to collect them in one place. Cleaner, and save a few bytes of code. I can make a proposal. |
I think for AVR we took the stance that writing from ISR was not really supported. It should never lock up, but if it very rarely threw away some bytes, that was acceptable. Agreed on your conclusions. |
and also test DREIF befre actually invoking the emulated interrupt.
@matthijskooijman while I understand the sentiment, having 40 years of experience in the field, there is only one way of dealing with code like this and keep ones sanity. and that is to implement things absolutely right. Trying to figure out what really happens in half arsed solutions only adds complexity and makes your head spin. In this code treatment of esp. critical regions is not perfect. I have an old fix for the AVR code that properly implements this, also supporting buffers > 256, which I have used myself for a long time but never came around submitting. Will revisit that, it should be easy to port. But that will have to be on another thread, the dust needs to settle on this one first :) |
Since this bug will cause the CPU to waste more than 99,9% of its capacity needlessly spinning in an interrupt handler while writing on serial, it surely must also cause other problems, and I would assume this is one case like that: Again, I am very surprised such a grave issue has gone undetected for so long, but maybe it is the case that the new Uno Wifi rev2 simply has not been used much in the field yet. I would guess that changes when the Nano Every is properly on the market. I have also reviewed eba45bc by @krbekken a bit more. There are a lot of changes for what it claims to fix, and I believe much of the code is very problematic. So I would actually claim the best thing would be to revert it all. The fix for the noise on Tx on startup is easy enough, and trivial (although the critical region has too much stuff in it). For the other things, I would actually recommend reverting eba45bc, implementing critical regions properly, and then revisit. And maybe also be clearer about what to support. Invoking serial write() in an interrupt routine is one thing, flush() or end() is madness, and really should not be supported IMHO (or does anyone know a good reason it should be supported?). |
Sorry I am a slow thinker, but continuing my review of eba45bc all the changes for elevated interrupt level hack are probably not right (at least there is no guarantee the interrupt level will be reinstated in all cases), but more importantly they are definitely not needed. If one wants to support writing from interrupt handlers of higher priority and the buffer is full, the simple solution is to turn off interrupts and poll the register empty bit - the CPU is spinning regardless. I will explain this in detail later Sorry for this digression, the most important thing is to get this immediate issue resolved first. |
This is not always true - implementing things "right" in this sense adds latency for (other) interrupts, so in some cases that might be "worse" rather than "better". But in general I agree that you should do atomicity right whenever possible :-)
I would want to support
I have not reviewed the commit thoroughly, but from what I've seen I also think this would be a good approach.
Then it would be good to re-apply that in a separate commit after reverting eba45bc.
The latter (not being needed) is also what I gathered from a quick scan of the code. The former (not undoing the elevation properly) is another compelling reason to revert this code. One more downside to the elevation I realized later is that (IIUC) elevating the serial interrupt might temporarily replace another elevated interrupt, which will not run while the serial buffer is full. That other interrupt is probably important (since it was elevated), so that sounds like an issue to me.
That is also my understanding. As for next steps, I think it would be best to go ahead with the revert and then apply more fixes, rather than applying some partial fixes now and messing up the history even more (making it harder to revert as well). I personally have no time to actually do the work myself, but I am interested and willing to provide ideas and feedback as I've done so far. Also, I'm not the one who will decide to merge, so maybe @facchinm could also comment on the proposed route. As for the code you included in this PR now, adding a
As for the fixes to be applies after reverting, I guess the most important ones are:
After that, we can have a closer look at further atomicity, but these seem separate and less urgent. |
Very constructive response, thanks. I will comment a few things
The overhead of implementing an ATOMIC_BLOCK is very often 2+1 instructions, which is really very little. I found implementing critical zones gets rid of some of the previous hacks and extra checks, so less code. Also, as long as the blocks are small, no longer than a typical interrupt handler, effect on latency is also small. Since this processor does not turn off the interrupt enable bit in interrupts, it actually makes a few things rather tricky, but I am trying to get my head around it.
I noticed there were some good things too, for instance how TXCIF is cleared.
I guess I disagree. Cleaning up properly will take some more time, but I think what we have here is a workable interim thing just to get rid of the fully catastrophic problem. (I must repeat my surprise this has not been found before)
That would be great! I am working right now on a proper, clean solution. It is still work in progress, it needs some time yet but there will be quite a few changes (branch SerialCleanup on my fork if you are interested in a peek). Will invite reviews when I believe it is ready. Should also be tested properly, so it will take some time.
Fair enough. The function is simply just collecting the common code in flush() and write() for invoking the interrupt routine "manually". It has helped me understand things better when |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please follow the formatting conventions of the file:
- 4 space indents, not true tabs.
- Break the opening brace of functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This approval is specific to the formatting changes I requested in my previous review, which have now been done. Thanks!
@per1234 thanks for this, now updated |
I've taken a look at eba45bc and completely agree that this for some reason was sent into the wild prematurely. Unfortunately, I don't recall the exact circumstances, but whatever tests and reviews that were done were definitively not thorough enough. With regards to the next steps, I agree most with @egilkv in terms of applying some fixes instead of reverting the whole commit. As suggested, I also think it would be a good idea to first:
then:
There is an additional reason to remove the interrupt priority elevation other than that it being incomplete and/or unnecessary: So the interrupt elevation definitively needs to go. With regards to similarity with the AVR code I would suggest to either add the _poll_tx_data/udr_empty()-function in both the AVR and megaAVR code or keep it "distributed" in both. Thanks for investigating this. Hopefully, it can be fixed and tested in way or another relatively soon. |
@krbekken thanks for taking time to review. I'm happy to also remove the interrupt elevation now, but in that case I want to disable interrupt in the _poll_tx_data_empty() also, as explained earilier. I then really would want to test this all properly. I can do this within a day or two, assuming real life does not come in the way. I must admit I do not entirely understand the sentiment against the poll function - in my world avoiding duplicating functionality is among the main design goals. Ideally, programmers editors should have copy/paste disabled ;) I'd be happy to make a proposal porting those changes that would be relevant back to AVR once the megaAVR issues are dealt with. |
Thanks everyone for tackling this. In fact I spent some time on it too and my first conclusion was plainly wrong, my apologies. The PR code looks fine right now. I'm also super positive in reworking the elevation stuff but I would keep the ability to call Serial.print() from inside an interrupt without hanging the board (in fact, I prefer the board to reset due to the errata than an hang) A good strategy could be merging the PR as it is now, releasing the core for broader testing and in the meantime trying to find a definitive solution. |
@facchinm I am absolutely also fine with your approach. The fix for the elevation I am proposing will definitely also work with write() in interrupts, but I really want to subject it to some more rigorous testing, as mentioned. (I did not quite manage to do so the previous two days) |
I have tested the changes in UART.cpp and UART.h with this code:
It seems it is working fine |
See arduino/ArduinoCore-megaavr#42 for details Thanks to @egilkv for providing a fix
Remove very bad loop in Serial transmit driver that causes the CPU
to spend most of its time looping in the interrupt driver, entirely
locking out other operations and also causing incoming data on other
Serial ports to be lost.