Skip to content

Remove very bad loop in Serial transmit driver that causes the CPU #42

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Aug 19, 2019

Conversation

egilkv
Copy link
Contributor

@egilkv egilkv commented Aug 2, 2019

Remove very bad loop in Serial transmit driver that causes the CPU
to spend most of its time looping in the interrupt driver, entirely
locking out other operations and also causing incoming data on other
Serial ports to be lost.

to spend most of its time looping in the interrupt driver, entirely
locking out other operations and also causing incoming data on other
Serial ports to be lost.
@facchinm
Copy link
Member

facchinm commented Aug 2, 2019

@egilkv that function only gets called on flush() (it means that you absolutely want all bytes to be written) or when the buffer is full (see https://github.com/arduino/ArduinoCore-megaavr/blob/master/cores/arduino/UART.cpp#L298).
Calling a normal write() or print() doesn't stop the CPU unless one of these conditions (which both looks very sane to avoid data loss).
Let me now if I missed something, but I wouldn't apply your patch

@egilkv
Copy link
Contributor Author

egilkv commented Aug 2, 2019

Are you kidding me? That function is called by the transmit data register interrupt!

ISR(HWSERIAL_DRE_VECTOR)
{
  Serial._tx_data_empty_irq();
}

If you try my test program (rename txt to ino) you will see how bad it is.

@egilkv
Copy link
Contributor Author

egilkv commented Aug 2, 2019

There are also some design issues in this code. Since it is also an interrupt handler, it should be optimized for that, and whatever extra code is needed for the flush() should be outside. But we have some other bugs to cover before we get to that...

@facchinm
Copy link
Member

facchinm commented Aug 2, 2019

It gets called in the data register empty interrupt, so the next byte in the buffer can be written.
Quoting the megaAVR datasheet (http://ww1.microchip.com/downloads/en/DeviceDoc/megaAVR0-series-Family-Data-Sheet-DS40002015B.pdf) section 22.3.2.3

TXDATA can only be written when the Data Register Empty Interrupt Flag (the DREIF bit in the
USARTn.STATUS register) is set, indicating that the register is empty and ready for new data.

I've been testing your patch, which is faster of course, but ALL the bytes being sent are wrong. Of course you won't see it if you just sent 0x55 all the time.

@matthijskooijman
Copy link
Collaborator

Are you kidding me? That function is called by the transmit data register interrupt!

I was going to argue that ISR is only called when the USART_DREIF bit is set (which should remain set as long as the data register remains empty), but then I noticed that this loop is after writing a byte to the buffer (AFAIU meaning that the CPU is indeed pretty much blocked inside the serial ISR all the time (since it writes a byte to DR, then waits for DR to become empty again and only then returns from the ISR which will trigger immediately again).

There are also some design issues in this code. Since it is also an interrupt handler, it should be optimized for that, and whatever extra code is needed for the flush() should be outside. But we have some other bugs to cover before we get to that...

Agreed. The original design in the regular AVR core was like that, but it seems the megaavr version has seen some changes intended to fix serial printing from inside ISRs. On regular AVR, this was already handled by checking the global interrupt mask, but it seems on megaavr, interrupts have different levels and thus also different ways of being disabled.

These changes were introduced in commit eba45bc, which might need a closer look. Just removing the loop as this PR does will probably reintroduce that serial writing bug (or likely a different bug), but might still be a step forward.

That commit also introduces interrupt level elevation for the ISR, but I'm not sure I understand why (AFAICS that means that the interrupt can be called by the system again, but the flush and write methods still block, so the level0 ISR still blocks, so the previous approach of calling the DRE ISR manually (which is still done when interrupts are globally enabled) should still work (and is a lot simpler).

@krbekken, perhaps you have some input to this discussion? :-)

@egilkv
Copy link
Contributor Author

egilkv commented Aug 2, 2019

@matthijskooijman thanks for referring to commit eba45bc. Yes, there are fixes needed for supporting serial writing in an interrupt handler, but they are elsewhere. I believe I have also a proper fix for that, but it is a bit further down my list of priorities I must admit. The "fix" in eba45bc just blows up everything, and makes simultaneous use of more than one serial channel utterly unreliable, especially if they are at different baudrates.

The original code in avr is significantly cleaner, albeit not without some issues.

I also wondered about the elevation thing, but I am not familiar with the mechanism, and have so far not reviewed it further.

@egilkv
Copy link
Contributor Author

egilkv commented Aug 2, 2019

@facchinm you are saying this yourself: The data interrupt service routine must assume the DREIF is set, and assume it can write to the data register.
Your quote only applies if you want to write without use of interrupts, aka polling. And finally, the while loop is AFTER the data is written, so none of this is applicable.

I believe some of the problems here is that this routine is used also in flush() and write(), which confuses the matter. But do note that in both these places where the function is invoked from code, the status of DREIF is explicitly tested before the function is invoked. So it can assume DREIF is set., which it should.

Problems you may have had with wrong data must be because my test code writes async data continuously without any pause at all, which can make it hard to get in sync. I do not know how your test setup is, but I would propse try adding a good delay at the end of the setup().

@egilkv
Copy link
Contributor Author

egilkv commented Aug 2, 2019

@matthijskooijman you said:

I noticed that this loop is after writing a byte to the buffer (AFAIU meaning that the CPU is indeed pretty much blocked inside the serial ISR all the time (since it writes a byte to DR, then waits for DR to become empty again and only then returns from the ISR which will trigger immediately again).

This is spot on, and explained much better than I can do. It simply eats all you CPU cycles and blocks everything else. As bad as it can get, really. Removing this line restores sanity :)

@matthijskooijman
Copy link
Collaborator

Removing this line restores sanity :)

Yes, but...

But do note that in both these places where the function is invoked from code, the status of DREIF is explicitly tested before the function is invoked.

This is not true, it only checks DREIE. The check for DREIF used to be there (and still is in regular AVR), but was removed by the commit mentioned above. I suspect this means that (with the loop removed), the code (write when the buffer is full and flush) will keep writing values to the DR even when it is not empty, overwriting the previous bytes and corrupting the data. This is probaly what @facchinm is seeing.

@egilkv
Copy link
Contributor Author

egilkv commented Aug 2, 2019

@facchinm if you have problems getting synced, I suggest you change the write loop to this:

if (Serial3.availableForWrite() > 37+1) {
Serial3.println("The quick fox jumps over a lazy dog");
txchars+=37;
}

This creates a "random" pattern that makes it easier for async data to sync. And yes, it works for me :)

@egilkv
Copy link
Contributor Author

egilkv commented Aug 2, 2019

@matthijskooijman Good points, I will investigate this a bit more and come back.

And no, it is not the problems @facchinm is seeing. Everything is sent through regular interrupts in my test. I believe the problems he had must be since he probably just changed my \x55 to some other repeating letter. Which in async communication can be "impossible" to sync in this case, so just a matter of the test setup. Sending a more random pattern as in my example above will avoid that.

(There was a good reason I originally used the 55-pattern , and that has to do with measurement of baudrate:) )

@egilkv
Copy link
Contributor Author

egilkv commented Aug 2, 2019

@matthijskooijman You are right, of course. DREIF should absolutely be tested before invoking the irq "by hand", and as you say, that code is no longer there. The AVR code is fine, and really should serve as a model. As said, I am 100% sure this is not a problem for my test case. For any regular write, the byte is either written in polled fashion in the beginning (where DREIF is tested) or it is written by the interrupt. But if we have a regular write interrupted by an interrupt that invokes a write, I believe there are some problems with the code, also in AVR. But this is a bit special case, we have more obvious and urgent issues to solve first.

My conclusions so far are:

  1. The while loop in the interrupt is very, very wrong, and must go. Definitely.
  2. The test of DREIF before invoking the interrupt from code must be re-instated.

Since the invocation from flush() and write() really is the same, it would be better to collect them in one place. Cleaner, and save a few bytes of code. I can make a proposal.

@matthijskooijman
Copy link
Collaborator

If we have a regular write interrupted by an interrupt that invokes a write, I believe there are some problems with the code, also in AVR.

I think for AVR we took the stance that writing from ISR was not really supported. It should never lock up, but if it very rarely threw away some bytes, that was acceptable.

Agreed on your conclusions.

and also test DREIF befre actually invoking the emulated interrupt.
@egilkv
Copy link
Contributor Author

egilkv commented Aug 3, 2019

I think for AVR we took the stance that writing from ISR was not really supported. It should never lock up, but if it very rarely threw away some bytes, that was acceptable.

@matthijskooijman while I understand the sentiment, having 40 years of experience in the field, there is only one way of dealing with code like this and keep ones sanity. and that is to implement things absolutely right. Trying to figure out what really happens in half arsed solutions only adds complexity and makes your head spin. In this code treatment of esp. critical regions is not perfect. I have an old fix for the AVR code that properly implements this, also supporting buffers > 256, which I have used myself for a long time but never came around submitting. Will revisit that, it should be easy to port. But that will have to be on another thread, the dust needs to settle on this one first :)

@egilkv
Copy link
Contributor Author

egilkv commented Aug 3, 2019

Since this bug will cause the CPU to waste more than 99,9% of its capacity needlessly spinning in an interrupt handler while writing on serial, it surely must also cause other problems, and I would assume this is one case like that:
#34

Again, I am very surprised such a grave issue has gone undetected for so long, but maybe it is the case that the new Uno Wifi rev2 simply has not been used much in the field yet. I would guess that changes when the Nano Every is properly on the market.

I have also reviewed eba45bc by @krbekken a bit more. There are a lot of changes for what it claims to fix, and I believe much of the code is very problematic. So I would actually claim the best thing would be to revert it all. The fix for the noise on Tx on startup is easy enough, and trivial (although the critical region has too much stuff in it). For the other things, I would actually recommend reverting eba45bc, implementing critical regions properly, and then revisit. And maybe also be clearer about what to support. Invoking serial write() in an interrupt routine is one thing, flush() or end() is madness, and really should not be supported IMHO (or does anyone know a good reason it should be supported?).

@egilkv
Copy link
Contributor Author

egilkv commented Aug 4, 2019

Sorry I am a slow thinker, but continuing my review of eba45bc all the changes for elevated interrupt level hack are probably not right (at least there is no guarantee the interrupt level will be reinstated in all cases), but more importantly they are definitely not needed. If one wants to support writing from interrupt handlers of higher priority and the buffer is full, the simple solution is to turn off interrupts and poll the register empty bit - the CPU is spinning regardless. I will explain this in detail later Sorry for this digression, the most important thing is to get this immediate issue resolved first.

@matthijskooijman
Copy link
Collaborator

there is only one way of dealing with code like this and keep ones sanity. and that is to implement things absolutely right.

This is not always true - implementing things "right" in this sense adds latency for (other) interrupts, so in some cases that might be "worse" rather than "better". But in general I agree that you should do atomicity right whenever possible :-)

Invoking serial write() in an interrupt routine is one thing, flush() or end() is madness, and really should not be supported IMHO (or does anyone know a good reason it should be supported?).

I would want to support flush(), since sometimes you are debugging a lockup and you need flush() to know for sure that some code has executed (e.g. if you get into a situation where interrupts are disabled indefinitely, there might be leftover data in the buffer that you'll never see). Supporting end() in an ISR does not really seem needed (but if you support flush(), end() might not be too hard anyway?).

For the other things, I would actually recommend reverting eba45bc, implementing critical regions properly, and then revisit.

I have not reviewed the commit thoroughly, but from what I've seen I also think this would be a good approach.

The fix for the noise on Tx on startup is easy enough,

Then it would be good to re-apply that in a separate commit after reverting eba45bc.

all the changes for elevated interrupt level hack are probably not right (at least there is no guarantee the interrupt level will be reinstated in all cases), but more importantly they are definitely not needed.

The latter (not being needed) is also what I gathered from a quick scan of the code. The former (not undoing the elevation properly) is another compelling reason to revert this code. One more downside to the elevation I realized later is that (IIUC) elevating the serial interrupt might temporarily replace another elevated interrupt, which will not run while the serial buffer is full. That other interrupt is probably important (since it was elevated), so that sounds like an issue to me.

If one wants to support writing from interrupt handlers of higher priority and the buffer is full, the simple solution is to turn of interrupts and poll the register empty bit - the CPU is spinning regardless.

That is also my understanding.

As for next steps, I think it would be best to go ahead with the revert and then apply more fixes, rather than applying some partial fixes now and messing up the history even more (making it harder to revert as well). I personally have no time to actually do the work myself, but I am interested and willing to provide ideas and feedback as I've done so far. Also, I'm not the one who will decide to merge, so maybe @facchinm could also comment on the proposed route.

As for the code you included in this PR now, adding a _tx_data_empty_soft wrapper function, I'm not overly enthousiastic about this approach because:

  • This deviates from the regular AVR code structure (which could be fixed by also changing that).
  • I'm having a hard time to understand what this function actually does (i.e. what is its goal/contract), which is also reflected in a name that does not really tell me anything.

As for the fixes to be applies after reverting, I guess the most important ones are:

  • Applying this TX startup noise problem (which I haven't looked at at all yet).
  • Applying the "soft tx interrupt polling" when running inside ISRs as well as when the global interrupt flag is disabled (which, AFAIU, is distinct on megaavr unlike normal avr).

After that, we can have a closer look at further atomicity, but these seem separate and less urgent.

@egilkv
Copy link
Contributor Author

egilkv commented Aug 5, 2019

Very constructive response, thanks. I will comment a few things

implementing things "right" in this sense adds latency for (other) interrupts, so in some cases that might be "worse" rather than "better".

The overhead of implementing an ATOMIC_BLOCK is very often 2+1 instructions, which is really very little. I found implementing critical zones gets rid of some of the previous hacks and extra checks, so less code. Also, as long as the blocks are small, no longer than a typical interrupt handler, effect on latency is also small.

Since this processor does not turn off the interrupt enable bit in interrupts, it actually makes a few things rather tricky, but I am trying to get my head around it.

I have not reviewed the commit thoroughly, but from what I've seen I also think this would be a good approach.

I noticed there were some good things too, for instance how TXCIF is cleared.

As for next steps, I think it would be best to go ahead with the revert and then apply more fixes, rather than applying some partial fixes now and messing up the history even more (making it harder to revert as well).

I guess I disagree.
What I would suggest right now is to just get rid of the devastating while loop and add testing of DRE, as in my pull request. I am sure many Wifi Rev2 and Every users right now are struggling with strange issues because of this thing. I found a couple just browsing through the bug list here on gitbub. This should be an easy decision, I hope?

Cleaning up properly will take some more time, but I think what we have here is a workable interim thing just to get rid of the fully catastrophic problem. (I must repeat my surprise this has not been found before)

I am interested and willing to provide ideas and feedback as I've done so far.

That would be great! I am working right now on a proper, clean solution. It is still work in progress, it needs some time yet but there will be quite a few changes (branch SerialCleanup on my fork if you are interested in a peek). Will invite reviews when I believe it is ready. Should also be tested properly, so it will take some time.

As for the code you included in this PR now, adding a _tx_data_empty_soft wrapper function, I'm not overly enthousiastic about this approach

Fair enough. The function is simply just collecting the common code in flush() and write() for invoking the interrupt routine "manually". It has helped me understand things better when
restructuring. The name is stupid. Could always copy-paste back into flush and write(), if you think that is better?

Copy link
Contributor

@per1234 per1234 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please follow the formatting conventions of the file:

  • 4 space indents, not true tabs.
  • Break the opening brace of functions.

Copy link
Contributor

@per1234 per1234 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approval is specific to the formatting changes I requested in my previous review, which have now been done. Thanks!

@egilkv
Copy link
Contributor Author

egilkv commented Aug 6, 2019

@per1234 thanks for this, now updated

@krbekken
Copy link
Contributor

krbekken commented Aug 6, 2019

I've taken a look at eba45bc and completely agree that this for some reason was sent into the wild prematurely. Unfortunately, I don't recall the exact circumstances, but whatever tests and reviews that were done were definitively not thorough enough.

With regards to the next steps, I agree most with @egilkv in terms of applying some fixes instead of reverting the whole commit. As suggested, I also think it would be a good idea to first:

  • Remove the infamous while loop
  • Reinstate the DREIF-check before calling _tx_data_empty_irq() from write() and flush()
    But also:
  • Remove mechanisms regarding interrupt priority elevation.

then:

  • Look at implementing atomicity similar to the AVR code
  • After at is in place, investigate the support for calling write() and flush() from ISRs and propose changes if needed

There is an additional reason to remove the interrupt priority elevation other than that it being incomplete and/or unnecessary:
We have recently discovered a bug related to the level 1 interrupt execution on the mega4809 which is described in the latest Errata-document, where the recommendation is to not use the level 1 interrupt (page 8):
http://ww1.microchip.com/downloads/en/DeviceDoc/megaAVR-0-series-Errata-and-Clarification-80000777C.pdf

So the interrupt elevation definitively needs to go.

With regards to similarity with the AVR code I would suggest to either add the _poll_tx_data/udr_empty()-function in both the AVR and megaAVR code or keep it "distributed" in both.

Thanks for investigating this. Hopefully, it can be fixed and tested in way or another relatively soon.

@egilkv
Copy link
Contributor Author

egilkv commented Aug 7, 2019

@krbekken thanks for taking time to review. I'm happy to also remove the interrupt elevation now, but in that case I want to disable interrupt in the _poll_tx_data_empty() also, as explained earilier. I then really would want to test this all properly. I can do this within a day or two, assuming real life does not come in the way.

I must admit I do not entirely understand the sentiment against the poll function - in my world avoiding duplicating functionality is among the main design goals. Ideally, programmers editors should have copy/paste disabled ;)

I'd be happy to make a proposal porting those changes that would be relevant back to AVR once the megaAVR issues are dealt with.

@facchinm facchinm added enhancement New feature or request and removed waiting for feedback labels Aug 8, 2019
@facchinm
Copy link
Member

facchinm commented Aug 8, 2019

Thanks everyone for tackling this. In fact I spent some time on it too and my first conclusion was plainly wrong, my apologies.

The PR code looks fine right now.

I'm also super positive in reworking the elevation stuff but I would keep the ability to call Serial.print() from inside an interrupt without hanging the board (in fact, I prefer the board to reset due to the errata than an hang)

A good strategy could be merging the PR as it is now, releasing the core for broader testing and in the meantime trying to find a definitive solution.

@egilkv
Copy link
Contributor Author

egilkv commented Aug 10, 2019

@facchinm I am absolutely also fine with your approach.

The fix for the elevation I am proposing will definitely also work with write() in interrupts, but I really want to subject it to some more rigorous testing, as mentioned. (I did not quite manage to do so the previous two days)

@GDJose
Copy link

GDJose commented Aug 19, 2019

I have tested the changes in UART.cpp and UART.h with this code:

/* Sweep
 by BARRAGAN <http://barraganstudio.com>
 This example code is in the public domain.

 modified 8 Nov 2013
 by Scott Fitzgerald
 http://www.arduino.cc/en/Tutorial/Sweep
*/

#include <Servo.h>

Servo myservo;  // create servo object to control a servo
// twelve servo objects can be created on most boards

int pos = 0;    // variable to store the servo position

void setup() {
  myservo.attach(9);  // attaches the servo on pin 9 to the servo object
  Serial.begin(9600);
}

void loop() {
  for (pos = 0; pos <= 180; pos += 1) { // goes from 0 degrees to 180 degrees
    // in steps of 1 degree
    myservo.write(pos);              // tell servo to go to position in variable 'pos'
    delay(100);                       // waits 15ms for the servo to reach the position
    Serial.println(pos);
  }
  for (pos = 180; pos >= 0; pos -= 1) { // goes from 180 degrees to 0 degrees
    myservo.write(pos);              // tell servo to go to position in variable 'pos'
    delay(100);                       // waits 15ms for the servo to reach the position
    Serial.println(pos);
  }
}

It seems it is working fine

@facchinm facchinm merged commit bfeae4c into arduino:master Aug 19, 2019
@MCUdude
Copy link
Contributor

MCUdude commented Aug 19, 2019

@egilkv would you like to create a PR with this for MegaCoreX as well?

MCUdude added a commit to MCUdude/MegaCoreX that referenced this pull request Sep 10, 2019
See arduino/ArduinoCore-megaavr#42 for details
Thanks to @egilkv for providing a fix
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants