Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

frdm_k64f: Sometimes, there may be 1000+ms roundtrip for pings and other Ethernet packets (instead of sub-ms) #3132

Closed
zephyrbot opened this issue Feb 7, 2017 · 26 comments
Assignees
Labels
area: Networking bug The issue is a bug, or the PR is fixing a bug platform: NXP NXP priority: low Low impact/importance bug

Comments

@zephyrbot
Copy link
Collaborator

zephyrbot commented Feb 7, 2017

Reported by Paul Sokolovsky:

I never saw this issue with QEMU, so theorize that it's frdm_k64t Ethernet driver issue.

Sometimes, instead of usual sub-millisecond pings:

64 bytes from 192.0.2.1: icmp_seq=1 ttl=64 time=0.788 ms
64 bytes from 192.0.2.1: icmp_seq=2 ttl=64 time=0.476 ms
64 bytes from 192.0.2.1: icmp_seq=3 ttl=64 time=0.295 ms
64 bytes from 192.0.2.1: icmp_seq=4 ttl=64 time=0.475 ms

you're starting to get:

64 bytes from 192.0.2.1: icmp_seq=88 ttl=64 time=1008 ms
64 bytes from 192.0.2.1: icmp_seq=89 ttl=64 time=1008 ms
64 bytes from 192.0.2.1: icmp_seq=90 ttl=64 time=493 ms
64 bytes from 192.0.2.1: icmp_seq=91 ttl=64 time=1007 ms

So far, I wasn't able to find an exact way to reproduce, but it happens often. Say, I see it in every testing session (5-6 board resets/Ethernet cable reconnects).

When this happens, all other packets seem to be delayed too, not just pings. For example, using netcat UDP with echo_server shows the similar effect of 1s delay.

(Imported from Jira ZEP-1678)

@zephyrbot
Copy link
Collaborator Author

by Paul Sokolovsky:

As a bit of more info, I saw this before https://gerrit.zephyrproject.org/r/10902 (dynamic PHY config on cable connect/disconnect), and saw with it too.

@zephyrbot
Copy link
Collaborator Author

by Paul Sokolovsky:

A case walkthru:

  1. Flashed freshly built echo_server.
  2. ping 192.0.2.1 - ok
  3. ping6 2001:db8::1 - ok
  4. nc -u6 2001:db8::1 4242 - ok
  5. nc -u 192.0.2.1 4242 - problem begins. How it look like is that when type in 1st packet, no response is generated. Typing 2nd line and sending it, I get response to previous packet, etc. Sometimes, after wait, a last expected response does arrive on its own, by after more than 1s subjectively, maybe 2-3s.

@zephyrbot
Copy link
Collaborator Author

by Paul Sokolovsky:

Ok, a response always arrives actually, but perhaps after 4-5s. Typing a next one quickly makes previous response arrive immediately.

All that, I never saw with QEMU.

@zephyrbot
Copy link
Collaborator Author

by Paul Sokolovsky:

Ok, in such state, TCP echo doesn't work at all, so it's definitely another manifestation of ZEP-1724.

@zephyrbot
Copy link
Collaborator Author

by Paul Sokolovsky:

Now I'm torturing it for the whole 3 minutes, and it still works.

@zephyrbot
Copy link
Collaborator Author

by Paul Sokolovsky:

The same session now underwent 1000 pings (standard pings), and all 6 comm channels ((ICMP + UDP + TCP) * (IPv4 + IPv6)) work well, w/o any delays. (I pay special attention to pings, because I usually start testing with them, and in most cases, the issue manifests soon with them).

This proves that this is not inherent systematic issue, but a race condition. And we can't even be sure it's race in eth_mcux, because Z IP stack surely has its own share of races and weird things yet. And likely, it's eth_mcux race attenuated by Z stack race.

That was "good" news, bad news is that we can't reproduce it in at our will (though natural reoccurrence is quite high definitely).

@zephyrbot zephyrbot added priority: low Low impact/importance bug area: Networking bug The issue is a bug, or the PR is fixing a bug labels Sep 23, 2017
@pfalcon pfalcon changed the title frdm_k64f: Sometimes, there may be 1s delay for pings and other Ethernet packets frdm_k64f: Sometimes, there may be 1000+ms roundtrip for pings and other Ethernet packets (instead of sub-ms) May 18, 2018
@pfalcon
Copy link
Contributor

pfalcon commented May 18, 2018

This have been occurring for me regularly all this time. No specific pattern for triggering, except that it tends to occur with high probability after sufficient network load.

Another hypothesis, away from frdm_k64f/eth issue, is that it triggers some bug in the IP stack. Why it doesn't happen with QEMU is because of the low speed of it. With Ethernet, everything is "real time", which makes it easier to trigger e.g. some concurrency issue, after which the IP stack is stuck in this weird state.

@mike-scott
Copy link
Contributor

Adding @therealprof in that he's been testing ethernet on a Nucleo STM32 board. It might be interesting to compare notes to see if the bug lies in the Z IP stack or something specific to MCUX shims.

@therealprof
Copy link
Contributor

I've seen that, too. Last time it happened to me yesterday when my DHCP server ran out of IP addresses (courtesy of long lease times and frequent reboots combined with randomised MAC addresses). For some reason it received an IP address after a short while but all network communication was unbearably slow. Completely reproducible until I shortened the lease time and manually freed all random MAC addresses from the pool.

@pfalcon
Copy link
Contributor

pfalcon commented May 24, 2018

I've seen that, too. Last time it happened to me yesterday when my DHCP server

Thanks for the confirmation. Just in case, I almost never use DHCP in my testing, as that only adds extra point of disturbance to the testing. Overall, this looks like a generic issue, the root cause of which is well described by the title of #7571.

@pfalcon
Copy link
Contributor

pfalcon commented May 24, 2018

The way to reproduce this 100% in my setting, based on the motives of #7831. So:

  1. Run dumb_http_server on frdm_k64f without Revert "samples: net: Fix sanitycheck for sam_e70_xplained board" #7831 applied. Just tested with the master of 5b20350.
  2. Run "ab -n1000 http://192.0.2.1:8080/".
  3. When the slowdown from running ab starts (see frdm_k64f serial console, ~1req/s is the slowdown), break ab.
  4. ping 192.0.2.1 will have 1s+ responses:
$ ping 192.0.2.1
PING 192.0.2.1 (192.0.2.1) 56(84) bytes of data.
64 bytes from 192.0.2.1: icmp_seq=1 ttl=64 time=1026 ms
64 bytes from 192.0.2.1: icmp_seq=2 ttl=64 time=1024 ms
64 bytes from 192.0.2.1: icmp_seq=3 ttl=64 time=1024 ms

@nashif
Copy link
Member

nashif commented Jul 25, 2018

This is probably related to logging/debugging being enabled, we have seen this behaviour and it was gone when debug was disabled, the new logging system should resolve this. See #8816 (comment) in #8816

@MaureenHelm
Copy link
Member

@pfalcon Is this still an issue?

@pfalcon
Copy link
Contributor

pfalcon commented Aug 8, 2018

@pfalcon Is this still an issue?

Absolutely, as nobody explained why it happened and provided a fix for it ;-). (I'll retest once I've finished with post-vacation backlog.)

@pfalcon
Copy link
Contributor

pfalcon commented Aug 8, 2018

@nashif

This is probably related to logging/debugging being enabled

That would be too simple, isn't it? ;-) No, it's not related to logging being enabled, at least not in the way you suggest. All testing and reports happen against the pristine mainline, which has logging enabled only for errors. (And if it's related to logging for errors being enabled, not used (no errors printed), but it still affects network stack, then we have quite big a problem. As in, there're 2 unrelated subsystem affect each other in a way which nobody can explain. If that can happen at all, it can happen with new logging subsystem too (more so, because it's more complex than the older system)).

@pfalcon
Copy link
Contributor

pfalcon commented Aug 9, 2018

Test process as described in #3132 (comment) .

Master d003d0e.The issue is fully reproducible (this time it happened on ~50th ApacheBench request).

@pfalcon
Copy link
Contributor

pfalcon commented Aug 9, 2018

And with #7831 applied, that particular test doesn't reproduce the issue (#7831).

@galak
Copy link
Collaborator

galak commented Dec 6, 2018

Is this still relevant?

@pfalcon
Copy link
Contributor

pfalcon commented Dec 6, 2018

Yes, this one of the long-standing and tedious issues which gets little attention from the network maintainers, and the background issue for #7831 (expired) andhttps://github.com//pull/11530 (resubmitted) - which however workaround the issue, don't solve it.

@pfalcon
Copy link
Contributor

pfalcon commented Dec 6, 2018

cc @jukkar

@ghost
Copy link

ghost commented Dec 11, 2018

I can reproduce this issue with frdm_k64f board by sending UDP packets to the echo_server example. It takes at least a few minutes to get high latency (up to a second) responses for all network requests.

It makes no difference if all logging features are disabled or enabled.

But when limiting CONFIG_ETH_MCUX_RX_BUFFERS to 1 (default value is 2) it seems to work.

(Typical latency was < 1ms within more than 100000 packets in 14 hours, only a few responses took up to 4ms)

Limiting the buffers is not a solution, but maybe it helps to find the problem.

@pfalcon
Copy link
Contributor

pfalcon commented Dec 11, 2018

@i8080: Thanks for providing useful information, my testcases usually involved TCP, where with multiple HTTP requests, the issue manifests itself in seconds or dozens of seconds.

@jukkar: Let me kindly remind what I said at the last week's meeting: that this waits for your response on reproducing this issue (again, the very detailed info on reproduction is available in #7831).

@jukkar
Copy link
Member

jukkar commented Dec 11, 2018

@pfalcon This is certainly in my backlog

@jukkar
Copy link
Member

jukkar commented Dec 11, 2018

I did a quick test in order to make sure that we do not leak any buffers in upper IP stack, no leaks were found.
It is quite easy to trigger this:

  • flash echo-server to frdm_k64f
  • in linux host, use net-tools/echo-client like this ./echo-client 2001:db8::1 -e -F. This will flood the link and one can see lot of errors like this eth_mcux.eth_rx: Failed to get fragment buf because of this. The system will not recover from this so there is no connectivity after this flooding.

As suggested by @i8080 setting CONFIG_ETH_MCUX_RX_BUFFERS to 1 "fixes" the issue and the connectivity is restored after the flooding has happened. I suggest that the default is set to 1 instead of 2 for the driver until the culprit is found.

Nothing in my tests indicate an issue in upper layers of the IP stack, all symptoms indicate an issue in frdm_k64f eth driver or HAL.

@pfalcon
Copy link
Contributor

pfalcon commented Dec 12, 2018

@jukkar: Thanks for quick testing!

This will flood the link and one can see lot of errors like this eth_mcux.eth_rx: Failed to get fragment buf because of this.

So, I assume this is "err" logging, i.e. enabled by default. In my usual testing (which involves TCP/HTTP/1000 requests of that, non-flooding), I sometimes see errors like that, but rarely, and they don't cause any visible effect.

The system will not recover from this so there is no connectivity after this flooding.

Again, in my testing, the connectivity persists, but as described here and in #7831, I get 1000ms round trip for packets (then 2000ms, then 3000ms). No obvious logging happens, hence developments like #8769.

Nothing in my tests indicate an issue in upper layers of the IP stack, all symptoms indicate an issue in frdm_k64f eth driver or HAL.

Ok, thanks for these results, let me clear my backlog first, before reproducing them, and getting back to concentrate on this type of issues overall.

@pfalcon
Copy link
Contributor

pfalcon commented Jan 18, 2019

As suggested by @i8080 setting CONFIG_ETH_MCUX_RX_BUFFERS to 1 "fixes" the issue

Thanks much for this info and testing. And sorry, this got backlogged by net_bug refactors and work on QEMU ethernet drivers, and holidays.

#12560 reminded me that I had this in queue for testing, and indeed, with CONFIG_ETH_MCUX_RX_BUFFERS=1, I get the dumb_http_server sample to serve 1000 requests without a hitch (even despite #12560). With 10K requests, it eventually runs into the ground still, it stops to respond and instead throws errors:

Connection #8908 from 192.0.2.2
Connection from 192.0.2.2 closed
[00:00:31.189,273] <err> net_arp: Gateway not set for iface 0x20007dc0
[00:00:31.599,136] <err> net_arp: Gateway not set for iface 0x20007dc0

But that's about the same as I've seen with eth_smsc911x driver in #11680 (comment) .

So, hopefully now we finally will be able to separate HW issues from stack issues and pursue the latter. I'll prepare a patch to make CONFIG_ETH_MCUX_RX_BUFFERS=1 default.

pfalcon added a commit to pfalcon/zephyr that referenced this issue Jan 18, 2019
It was reported, and confirmed by multiple parties that default
CONFIG_ETH_MCUX_RX_BUFFERS=2 under some packet load leads to
1s and increasing packet processing delay and eventual deadlock.
No reports were about CONFIG_ETH_MCUX_TX_BUFFERS=2, but be on safe
side and just set that to the minimal value as the current default,
to allow us to have good conservative base to test various networking
stack issues.

Fixes: zephyrproject-rtos#3132

Signed-off-by: Paul Sokolovsky <paul.sokolovsky@linaro.org>
jukkar pushed a commit that referenced this issue Jan 21, 2019
It was reported, and confirmed by multiple parties that default
CONFIG_ETH_MCUX_RX_BUFFERS=2 under some packet load leads to
1s and increasing packet processing delay and eventual deadlock.
No reports were about CONFIG_ETH_MCUX_TX_BUFFERS=2, but be on safe
side and just set that to the minimal value as the current default,
to allow us to have good conservative base to test various networking
stack issues.

Fixes: #3132

Signed-off-by: Paul Sokolovsky <paul.sokolovsky@linaro.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: Networking bug The issue is a bug, or the PR is fixing a bug platform: NXP NXP priority: low Low impact/importance bug
Projects
None yet
Development

No branches or pull requests

9 participants