-
Notifications
You must be signed in to change notification settings - Fork 6.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
frdm_k64f: Sometimes, there may be 1000+ms roundtrip for pings and other Ethernet packets (instead of sub-ms) #3132
Comments
by Paul Sokolovsky: As a bit of more info, I saw this before https://gerrit.zephyrproject.org/r/10902 (dynamic PHY config on cable connect/disconnect), and saw with it too. |
by Paul Sokolovsky: A case walkthru:
|
by Paul Sokolovsky: Ok, a response always arrives actually, but perhaps after 4-5s. Typing a next one quickly makes previous response arrive immediately. All that, I never saw with QEMU. |
by Paul Sokolovsky: Ok, in such state, TCP echo doesn't work at all, so it's definitely another manifestation of ZEP-1724. |
by Paul Sokolovsky: Now I'm torturing it for the whole 3 minutes, and it still works. |
by Paul Sokolovsky: The same session now underwent 1000 pings (standard pings), and all 6 comm channels ((ICMP + UDP + TCP) * (IPv4 + IPv6)) work well, w/o any delays. (I pay special attention to pings, because I usually start testing with them, and in most cases, the issue manifests soon with them). This proves that this is not inherent systematic issue, but a race condition. And we can't even be sure it's race in eth_mcux, because Z IP stack surely has its own share of races and weird things yet. And likely, it's eth_mcux race attenuated by Z stack race. That was "good" news, bad news is that we can't reproduce it in at our will (though natural reoccurrence is quite high definitely). |
This have been occurring for me regularly all this time. No specific pattern for triggering, except that it tends to occur with high probability after sufficient network load. Another hypothesis, away from frdm_k64f/eth issue, is that it triggers some bug in the IP stack. Why it doesn't happen with QEMU is because of the low speed of it. With Ethernet, everything is "real time", which makes it easier to trigger e.g. some concurrency issue, after which the IP stack is stuck in this weird state. |
Adding @therealprof in that he's been testing ethernet on a Nucleo STM32 board. It might be interesting to compare notes to see if the bug lies in the Z IP stack or something specific to MCUX shims. |
I've seen that, too. Last time it happened to me yesterday when my DHCP server ran out of IP addresses (courtesy of long lease times and frequent reboots combined with randomised MAC addresses). For some reason it received an IP address after a short while but all network communication was unbearably slow. Completely reproducible until I shortened the lease time and manually freed all random MAC addresses from the pool. |
Thanks for the confirmation. Just in case, I almost never use DHCP in my testing, as that only adds extra point of disturbance to the testing. Overall, this looks like a generic issue, the root cause of which is well described by the title of #7571. |
The way to reproduce this 100% in my setting, based on the motives of #7831. So:
|
This is probably related to logging/debugging being enabled, we have seen this behaviour and it was gone when debug was disabled, the new logging system should resolve this. See #8816 (comment) in #8816 |
@pfalcon Is this still an issue? |
Absolutely, as nobody explained why it happened and provided a fix for it ;-). (I'll retest once I've finished with post-vacation backlog.) |
That would be too simple, isn't it? ;-) No, it's not related to logging being enabled, at least not in the way you suggest. All testing and reports happen against the pristine mainline, which has logging enabled only for errors. (And if it's related to logging for errors being enabled, not used (no errors printed), but it still affects network stack, then we have quite big a problem. As in, there're 2 unrelated subsystem affect each other in a way which nobody can explain. If that can happen at all, it can happen with new logging subsystem too (more so, because it's more complex than the older system)). |
Test process as described in #3132 (comment) . Master d003d0e.The issue is fully reproducible (this time it happened on ~50th ApacheBench request). |
Is this still relevant? |
Yes, this one of the long-standing and tedious issues which gets little attention from the network maintainers, and the background issue for #7831 (expired) andhttps://github.com//pull/11530 (resubmitted) - which however workaround the issue, don't solve it. |
cc @jukkar |
I can reproduce this issue with frdm_k64f board by sending UDP packets to the echo_server example. It takes at least a few minutes to get high latency (up to a second) responses for all network requests. It makes no difference if all logging features are disabled or enabled. But when limiting CONFIG_ETH_MCUX_RX_BUFFERS to 1 (default value is 2) it seems to work. (Typical latency was < 1ms within more than 100000 packets in 14 hours, only a few responses took up to 4ms) Limiting the buffers is not a solution, but maybe it helps to find the problem. |
@i8080: Thanks for providing useful information, my testcases usually involved TCP, where with multiple HTTP requests, the issue manifests itself in seconds or dozens of seconds. @jukkar: Let me kindly remind what I said at the last week's meeting: that this waits for your response on reproducing this issue (again, the very detailed info on reproduction is available in #7831). |
@pfalcon This is certainly in my backlog |
I did a quick test in order to make sure that we do not leak any buffers in upper IP stack, no leaks were found.
As suggested by @i8080 setting CONFIG_ETH_MCUX_RX_BUFFERS to 1 "fixes" the issue and the connectivity is restored after the flooding has happened. I suggest that the default is set to 1 instead of 2 for the driver until the culprit is found. Nothing in my tests indicate an issue in upper layers of the IP stack, all symptoms indicate an issue in frdm_k64f eth driver or HAL. |
@jukkar: Thanks for quick testing!
So, I assume this is "err" logging, i.e. enabled by default. In my usual testing (which involves TCP/HTTP/1000 requests of that, non-flooding), I sometimes see errors like that, but rarely, and they don't cause any visible effect.
Again, in my testing, the connectivity persists, but as described here and in #7831, I get 1000ms round trip for packets (then 2000ms, then 3000ms). No obvious logging happens, hence developments like #8769.
Ok, thanks for these results, let me clear my backlog first, before reproducing them, and getting back to concentrate on this type of issues overall. |
Thanks much for this info and testing. And sorry, this got backlogged by net_bug refactors and work on QEMU ethernet drivers, and holidays. #12560 reminded me that I had this in queue for testing, and indeed, with CONFIG_ETH_MCUX_RX_BUFFERS=1, I get the dumb_http_server sample to serve 1000 requests without a hitch (even despite #12560). With 10K requests, it eventually runs into the ground still, it stops to respond and instead throws errors:
But that's about the same as I've seen with eth_smsc911x driver in #11680 (comment) . So, hopefully now we finally will be able to separate HW issues from stack issues and pursue the latter. I'll prepare a patch to make CONFIG_ETH_MCUX_RX_BUFFERS=1 default. |
It was reported, and confirmed by multiple parties that default CONFIG_ETH_MCUX_RX_BUFFERS=2 under some packet load leads to 1s and increasing packet processing delay and eventual deadlock. No reports were about CONFIG_ETH_MCUX_TX_BUFFERS=2, but be on safe side and just set that to the minimal value as the current default, to allow us to have good conservative base to test various networking stack issues. Fixes: zephyrproject-rtos#3132 Signed-off-by: Paul Sokolovsky <paul.sokolovsky@linaro.org>
It was reported, and confirmed by multiple parties that default CONFIG_ETH_MCUX_RX_BUFFERS=2 under some packet load leads to 1s and increasing packet processing delay and eventual deadlock. No reports were about CONFIG_ETH_MCUX_TX_BUFFERS=2, but be on safe side and just set that to the minimal value as the current default, to allow us to have good conservative base to test various networking stack issues. Fixes: #3132 Signed-off-by: Paul Sokolovsky <paul.sokolovsky@linaro.org>
Reported by Paul Sokolovsky:
I never saw this issue with QEMU, so theorize that it's frdm_k64t Ethernet driver issue.
Sometimes, instead of usual sub-millisecond pings:
you're starting to get:
So far, I wasn't able to find an exact way to reproduce, but it happens often. Say, I see it in every testing session (5-6 board resets/Ethernet cable reconnects).
When this happens, all other packets seem to be delayed too, not just pings. For example, using netcat UDP with echo_server shows the similar effect of 1s delay.
(Imported from Jira ZEP-1678)
The text was updated successfully, but these errors were encountered: