Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

drivers, at86rf2xx: cleanup #7276

Merged
merged 2 commits into from
Nov 7, 2017

Conversation

smlng
Copy link
Member

@smlng smlng commented Jun 29, 2017

[originally - but now not anymore -] much simpler fix than #7275 , while still using the redefined at86rf2xx_set_state(...) function of #7115.

[edit] plus avoiding the redundant do {} while ...

@smlng smlng requested review from aabadie and miri64 June 29, 2017 12:36
@smlng smlng added Area: drivers Area: Device drivers Type: enhancement The issue suggests enhanceable parts / The PR enhances parts of the codebase / documentation CI: ready for build If set, CI server will compile all applications for all available boards for the labeled PR labels Jun 29, 2017
@kaspar030
Copy link
Contributor

much simpler fix than #7275 , while still using the redefined at86rf2xx_set_state(...) function of #7115.

please state whats the problem and how it is fixed by this PR

@smlng
Copy link
Member Author

smlng commented Jun 29, 2017

the problem is prematurely merged #7275 (IMHO)

@smlng
Copy link
Member Author

smlng commented Jun 29, 2017

add simplification of at86rf2xx_reset_state_machine as suggested by @thomaseichinger in comment

@smlng smlng changed the title drivers,at86rf2xx: fix regression in tx prepare drivers, at86rf2xx: cleanup Jun 29, 2017
@smlng smlng added the Type: cleanup The issue proposes a clean-up / The PR cleans-up parts of the codebase / documentation label Jun 29, 2017
@miri64
Copy link
Member

miri64 commented Jun 29, 2017

the problem is prematurely merged #7275 (IMHO)

I think a bugfix has the priority of fixing the bug ASAP, and #7275 did that. We waste way to much time discussing style in simple bugfixes. So I wouldn't say #7275 was merged prematurely.

@smlng
Copy link
Member Author

smlng commented Jun 29, 2017

then #7115 was

@miri64
Copy link
Member

miri64 commented Jun 29, 2017

then #7115 was

on that I agree :>

@kaspar030
Copy link
Contributor

We waste way to much time discussing style in simple bugfixes.

If the same bugfix can be done with 17 lines at the same time, that's not a style discussion anymore.
We all know that "quick bugixes" tend to have a very low priority for getting cleaned up.

So I wouldn't say #7275 was merged prematurely.

@smlng should've posted that he's onto another solution.

@aabadie
Copy link
Contributor

aabadie commented Jun 29, 2017

then #7115 was

sure, and I'm the one to blame here.

@miri64
Copy link
Member

miri64 commented Jun 29, 2017

If the same bugfix can be done with 17 lines at the same time, that's not a style discussion anymore.
We all know that "quick bugixes" tend to have a very low priority for getting cleaned up.

I wasn't even aware there was an alternative and my priority was to get master working again.

@smlng
Copy link
Member Author

smlng commented Jun 29, 2017

OK, sorry didn't want to be rude ...

I was just looking into #7275 and thought, mhm maybe there is a simpler solution - but looking again it was already merged - no time for a comment left.

@kaspar030
Copy link
Contributor

I wasn't even aware there was an alternative and my priority was to get master working again.

Sorry, that was indeed important.

@aabadie
Copy link
Contributor

aabadie commented Jun 29, 2017

I wasn't even aware there was an alternative and my priority was to get master working again.

That's why the on hardware testrunner is so important :)

@miri64
Copy link
Member

miri64 commented Jun 29, 2017

I'm still not sure how these two functions relate. This seems to chop of functionality, so please explain, what you did exactly and why?

@smlng
Copy link
Member Author

smlng commented Jun 29, 2017

in #7115 @DipSwitch changed the signature of at86rf2xx_set_state to return the previous state (instead of void), though the actual idea of #7115 was to fix a race condition. However, the changed signature made is possible to simplify at86rf2xx_tx_prepare and at86rf2xx_reset_state_machine, the former was already done in #7115 - but introduced the regression which was hastily/suboptimally fixed in #7275.

This PR enhances the fix of #7275 and also simplifies at86rf2xx_reset_state_machine as was suggested by @thomaseichinger in #7115 - see also discussion there.

/* Wait for any state transitions to complete before forcing TRX_OFF */
do {
old_state = at86rf2xx_get_status(dev);
} while (old_state == AT86RF2XX_STATE_IN_PROGRESS);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@miri64 this check was added in #7115 to function at86rf2xx_set_state see here, so we can directly call that function and omit this now obsolete do ... whilehere

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright. Can you remove the newline before this removal, too.

} while (state == AT86RF2XX_STATE_BUSY_RX_AACK ||
state == AT86RF2XX_STATE_BUSY_TX_ARET);

uint8_t state = at86rf2xx_set_state(dev, AT86RF2XX_STATE_TX_ARET_ON);
if (state != AT86RF2XX_STATE_TX_ARET_ON) {
Copy link
Member Author

@smlng smlng Jun 29, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this check was originally deleted in #7115 and is the root cause of the regression, that #7275 fixed - but but by plain reversion without exploiting the new behaviour of at86rf2xx_set_state.

@miri64
Copy link
Member

miri64 commented Jun 29, 2017

This PR enhances the fix of #7275 and also simplifies at86rf2xx_reset_state_machine() as was suggested by @thomaseichinger in #7115 - see also discussion there.

So the change to at86rf2xx_reset_state_machine() has nothing to do with #7275?

@smlng
Copy link
Member Author

smlng commented Jun 29, 2017

that's why I changed the title of this PR to "cleanup" ...

Copy link
Member

@miri64 miri64 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor cosmetic changes.

} while (state == AT86RF2XX_STATE_BUSY_RX_AACK ||
state == AT86RF2XX_STATE_BUSY_TX_ARET);

uint8_t state = at86rf2xx_set_state(dev, AT86RF2XX_STATE_TX_ARET_ON);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you leave the variable definition at the top of the function, please?

/* Wait for any state transitions to complete before forcing TRX_OFF */
do {
old_state = at86rf2xx_get_status(dev);
} while (old_state == AT86RF2XX_STATE_IN_PROGRESS);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright. Can you remove the newline before this removal, too.

@smlng smlng force-pushed the drivers/at86rf2xx/fix_regression branch from 651b50d to 0b940d8 Compare June 29, 2017 18:29
@smlng smlng added CI: needs squashing Commits in this PR need to be squashed; If set, CI systems will mark this PR as unmergable and removed CI: ready for build If set, CI server will compile all applications for all available boards for the labeled PR labels Jun 29, 2017
@smlng
Copy link
Member Author

smlng commented Nov 7, 2017

comments addressed.

@miri64 miri64 added CI: ready for build If set, CI server will compile all applications for all available boards for the labeled PR and removed CI: ready for build If set, CI server will compile all applications for all available boards for the labeled PR labels Nov 7, 2017
Copy link
Member

@miri64 miri64 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hopefully last batch of comments

@@ -18,7 +19,7 @@
* @author Thomas Eichinger <thomas.eichinger@fu-berlin.de>
* @author Joakim Nohlgård <joakim.nohlgard@eistec.se>
* @author Hauke Petersen <hauke.petersen@fu-berlin.de>
*
* @author Sebastian Meiling <s@mlng.net>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one as well. Feel free to remove

@@ -103,7 +102,7 @@ static int _send(netdev_t *netdev, const struct iovec *vector, unsigned count)
at86rf2xx_tx_prepare(dev);

/* load packet data into FIFO */
for (unsigned i = 0; i < count; i++, ptr++) {
for (unsigned i = 0; i < count; ++i, ++ptr) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has the same effect, but okay :P

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haha it seems faster though ... brains are stupid :P

@@ -19,6 +19,7 @@
* @author Hauke Petersen <hauke.petersen@fu-berlin.de>
* @author Kaspar Schleiser <kaspar@schleiser.de>
* @author Kévin Roussel <Kevin.Roussel@inria.fr>
*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated change

* @{
*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated changes?

@smlng
Copy link
Member Author

smlng commented Nov 7, 2017

@kaspar030 Murdock complains about:

--- FAILED build outputs:
--- build output of app examples/javascript for board nucleo-l152 (raw, runtime=22.6s):
-- running on worker e52 thread 8, build number 4757.

[skip]

main.c: In function 'main':
main.c:72:12: error: 'main_js' undeclared (first use in this function)
     js_run(main_js, main_js_len);
            ^~~~~~~
main.c:72:12: note: each undeclared identifier is reported only once for each function it appears in
main.c:72:21: error: 'main_js_len' undeclared (first use in this function)
     js_run(main_js, main_js_len);

@miri64
Copy link
Member

miri64 commented Nov 7, 2017

Not anymore, see #7953.

@smlng
Copy link
Member Author

smlng commented Nov 7, 2017

I guess I need ☕️ - overlooking so many things here ... btw. last comments addressed

@miri64
Copy link
Member

miri64 commented Nov 7, 2017

Do it like we do. Drink more 🍺 instead of ☕ :P

Copy link
Member

@miri64 miri64 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK

@miri64
Copy link
Member

miri64 commented Nov 7, 2017

Please squash

    - update documentation
    - adapt constness
    - fix code alignment and stlye
@smlng smlng force-pushed the drivers/at86rf2xx/fix_regression branch from 64a0f07 to 13639c1 Compare November 7, 2017 20:38
@smlng
Copy link
Member Author

smlng commented Nov 7, 2017

can anybody make some tests, too, before merging this - it seems to work, but would be great to have someone verify this, too

@miri64
Copy link
Member

miri64 commented Nov 7, 2017

Tested on samr21-xpro with:

  • send (both with and without 6LoWPAN/IPv6)
  • receive (both with and without 6LoWPAN/IPv6)
  • changing channels (and send/receive afterwards)
  • changing PAN ID (and send/receive afterwards)
  • changing TX power
  • changing some other random options

all worked as expected => go

@miri64 miri64 merged commit 7c4a2f9 into RIOT-OS:master Nov 7, 2017
@smlng smlng deleted the drivers/at86rf2xx/fix_regression branch November 14, 2017 17:43
miri64 added a commit to miri64/RIOT that referenced this pull request Nov 30, 2017
miri64 added a commit that referenced this pull request Nov 30, 2017
at86rf2xx: revert regression introduced in #7276
@miri64 miri64 mentioned this pull request Jan 30, 2018
@DipSwitch
Copy link
Member

DipSwitch commented Aug 11, 2018

Sorry for the some what late reply... @smlng

in #7115 @DipSwitch changed the signature of at86rf2xx_set_state to return the previous state (instead of void), though the actual idea of #7115 was to fix a race condition. However, the changed signature made is possible to simplify at86rf2xx_tx_prepare and at86rf2xx_reset_state_machine, the former was already done in #7115 - but introduced the regression which was hastily/suboptimally fixed in #7275.

I wanted to clarify that changing the signature was actually part of the fix. There was a race condition between the get_state call and the set state call. The action (state changing) by itself was not atomic. So the driver state was able to change between the get and set call.

But now I'm thinking about it, the windows closed from 20 instructions to 2 or so. Not sure and would need to reread the code completely.

The problem I tried to fix had something to do that multiple threads and interrupts wanting to change the driver state. (This bug is because of that depended on MCU speed, transmission speeds and response times) It can only be tested by... actually testing all possible implementations ever going to be made.

Because the checking logic was splited, if (state == old_state) return didn't account for the possibility that after that check old_state could actually change. By removing the special cases. And make the code run linear without any branches to different path's (for example if check above) there (now) maybe remaines a 2 instruction window exactly where the code falls out of the do { } while(old_state == ... line 458

By triggering the RX interrupt, it can enter the receive state. Where this is canceled because old_state will not be updated before changing the state. (This explains the possible packet loss you were experiencing)

This can probably only be fixed by a mutex around the the main state (sleep, idle, tx, RX) since interrupts cannot be disable around the while loop.. another solution might be: every interrupt queues a message WITOUTH any processing. And all the logic in the IRQ handler is moved to the thread loop of the driver...

Sorry for not finding the time to clarify this when it was requested. There where some other things needing more attention.

@DipSwitch
Copy link
Member

Async / multi-threaded problems are quite hard to explain ... Ever ... Haha

This part is back in the thread previously wanting to change the state.
"Where this is canceled because old_state will not be updated before changing the state."

@DipSwitch
Copy link
Member

DipSwitch commented Aug 11, 2018

It is possibly also the reason that it only affected very few people. My MCU was running at 36MHz and was using the AT86RT212B (sub-gig) when doing full speed/throughput tests where 3 nodes responded to the same multicast message send by the master. Where the master waited for all the responses within a given windows. I had options to change the general wait time and response times.

All 4 devices where connected through our AT command interface. (Self made, not the RIOT-OS based one) The devices where connected to a PC with an USB to serial converter to the same application. The application did the actual control of the master and the slave and checking if everything got from one end to another.
4 threads per node, shell / debug port and another UART interface as used for AT command handling.
So actually only the RIOT network stack was tested in this case (and the AT command interface as well of course, but this was just a weak wrapper around the API)
The test application monitors missed packets on either end (master and slave) and response times in between, transmit and receive times i'm not sure.

The more slave nodes that where added to the test, the faster one dead-locked, the second one less fast, etc. etc. etc. the failure rate looked quadratic.

After my fix I was able to send over maybe 20 million packets where there was first a possibility one of the slave nodes dead-locking between 0 and 4 million packets ...
However, enabling the debug log also fixed the problem. But that wasn't a actual solution ihmo ;)
Changing the master wait time after receiving all 3 answers fixed it (out top of my head) or maybe adding a random response time after receiving the multicast.
Our AT command interface also handled UART signaling and transparent mode, in transparent mode (less wait times) the crash seemed to happen faster than with signaling. (due to Windows not being able to generate events at all threads at the same time of the CTS line state change)

With the other information I've just read you all found. I think in my case the MCU always hanged in a wait lock before allowing the state change. However due to another node trying to run the exact same function (claiming air time) at the exact same time (one backing off by the CSMA/CA) some waired stuff was happening... "Best explanation ever +1"

@DipSwitch
Copy link
Member

When using the queue implementation it may probably only be queued once. Because I thing every status change is an interrupt. However reading it once actually handles all interrupts. (So any queued interrupts still in the queue are false and must be flushed). Or add a pointer to some counter in the data pointer (assuming adding the same message twice to the queue is prohibited by the queue)

Than there is also another possible dead-lock there. Because the read of the status takes time. If during the read the interrupt is triggered. Has it been read or is there a new one?

The only solution would be to read the status while you actually know there is no new status. If the read has finished and there is not a message queued your save to go...

@DipSwitch
Copy link
Member

DipSwitch commented Aug 11, 2018

I hope however I'm wrong here because that's is really not a good solution... :D

Say the EXTI IRQ priority is lower than the SPI priority.. the MCU runs slow. The SPI interrupt is never left during transfer. The IRQ is received after the complete transfer.
The other way around, EXTI IRQ priority high, SPI priority low. The EXTI is received during transfer but when does the latch from status to shift register ocure within the AT86RF?
If SPI runs blocking, is the same thing.

If something else has a higher priority. The data could already be received but the SPI IRQ is pending the EXTI is received assuming it is a high priority.. (here I think it's possible you loss the ability to know which one came first or last)

I think if you play long enough with these terms one will give a possible missed EXTI irq as well.. so the double read is required...

Even though one could technically already have been performed. Due to none real time interrupt receiving. This could be guarded by fixing EXTI as priority 0/1 system timer as priority 0 data irqs lower than 3 or whatever...

@DipSwitch
Copy link
Member

I would strongly suggest to first fix this bug before trying to fix bugs in other code using the driver. Possible problems with the TCP stack could just as easily lay here...

Since TCP for example would ACK right away. If you have a line (bus) topology where one node is 2 hops away and one 3 hops away, and both would receive a packet after each other first the 3th node than the 2nd this bug might dead-lock the 2nd node because it would try to handle 3 to 4 packets at the same time (during routing to the 3th node, receiving it's own packet, and sending it's own ACK and receiving the ACK of the 3th node).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area: drivers Area: Device drivers CI: ready for build If set, CI server will compile all applications for all available boards for the labeled PR Type: cleanup The issue proposes a clean-up / The PR cleans-up parts of the codebase / documentation Type: enhancement The issue suggests enhanceable parts / The PR enhances parts of the codebase / documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants