-
Notifications
You must be signed in to change notification settings - Fork 6.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Revert "samples: net: Fix sanitycheck for sam_e70_xplained board" #7831
Conversation
As promised in #6789 (comment) , I submit a patch to revert back unduly increase in network buffers just for the Ethernet L2. The commit message goes into details as to why. Here I can just post complete ApacheBench reports (the commit message has them abbreviated). With a0df4f6 (breaking after ~30s):
Without a0df4f6 (i.e. with this revert patch):
|
I'd like to add that the behavior is in full accordance with the definition of https://en.wikipedia.org/wiki/Bufferbloat - throwing more buffers on the fire seems like solve one issue, but actually pops up different. |
Codecov Report
@@ Coverage Diff @@
## master #7831 +/- ##
==========================================
+ Coverage 52.12% 52.31% +0.18%
==========================================
Files 212 212
Lines 25939 25937 -2
Branches 5590 5589 -1
==========================================
+ Hits 13521 13569 +48
+ Misses 10177 10122 -55
- Partials 2241 2246 +5
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of reverting this fully, I am hoping you provide then a patch that fixes the issue with sam_e70_xplained board which ran out of memory. I am ok if we revert this but then a solution for this atmel board needs to be provided.
Unfortunately, I don't have sam_e70_xplained, so can prepare only a mechanical patch which you'd need to test still. In that case, my choice would be to put the overrides into boards/arm/sam_e70_xplained/ (perhaps, sam_e70_xplained_defconfig), with a clear note that there's no logical explanation why these overrides are needed for this board. But the main point - this is posted for review and confirmation of the issue. Did you have a chance to try the tests which are described in the commit message on frdm_k64f? Because so far only I report that there's a problem, and unless reproduced independently it doesn't cost much. (And results of reproducing may vary, as e.g. #7818 show.) Thanks. |
Also please do that without #7849 , as it may change the behavior. (Not fix the issue, just change the behavior, e.g. make the issue masked for the specific test scenario.) |
I was not able to test with frdm_k64f as the dump_http_server application kept hanging when printing something. Dunno what was wrong with it. Anyway, it worked a bit better with qemu_x86. I managed to get memory allocation report which says this:
It is a bit difficult to say whether this will affect the outcome but we need to fix the leak first. After running apache bench for a while, the application crashed
The net-shell was still responsive after the crash. |
It might be that there is no "leak" here as the net_pkt might be just waiting for ACK that we did not receive yet. Thus this needs more testing. |
This is really weird. I mean, the kind of the outcome you get. The difference in our outcomes per se doesn't surprise me however, I myself may get differences testing few days ago and today (most are due to "false positives" though). Or you can compare investigation in #7818, where for @rlubos something 100% doesn't work, which has always worked 100% (or 99%) for me. We'd need to sync on our testing and get to the bottom of that. For that, we'd need to be as exact as possible to allow for repeatability. Can you please confirm the exact git rev you tested, so I can retest it, and we can contrast the results for the same rev? Any further details on the test process are helpful too to pinpoint any differences in it. I try to describe mine in https://github.com/pfalcon/zephyr/wiki/NetworkingTestPlan |
Well, qemu_x86 has been working OK for me for quite some time lately (IIRC), which I tribute to slowness of qemu and SLIP (so, contingency points aren't hit, as with fast Ethernet).
So, dumb_http_server sample doesn't enable net_shell, which means you already tested something different than I ;-). (And yes, any change in config may have a butterfly effect. I many times tried to debug these issues with logging - the problem behavior goes away or radically changes.)
Didn't see anything like that for quite some time. For reference, I just ran "ab -n1000 http://192.0.2.1:8080/" against qemu_x86 with master c4b0f1c. No issues:
|
@rlubos : Wanna join fun with testing dumb_http_server too? ;-) |
@pfalcon Heh, thanks for the invitation :) Well I don't have sam_e70_xplained nor frdm_k64f so all I can do is to run it on qemu. So I ran it on qemu_x86, and I was just going to report that it works fine for me, when the app froze when I was writing this comment. It just paused after handling ~5K connections, no crash or whatsoever. I was hoping to reproduce the issue to get some more detail, unfortunately wasn't able on qemu_x86... But, then I ran it on qemu_cortex_m3 and it's fairly easy to reproduce, at least on my desk, the application freezes just after few connections are established. Yet it's still not a crash. Not sure why is it happening, perhaps I can mess a little bit with debugging. I ran vanilla version of |
@rlubos :
Great, thanks! And well, I'm talking about 1000 requests ( But anyway, I recommend standardizing on 1000 requests. The number chosen because it's, well, at least reasonable if not venerable, and because test with it completes quickly (under 1min), so can be run regularly (as a smoke test again). When all IP stack developers will be able to see those 1K requests served, we have a good baseline to resolve "more rare" issues. As @jukkar's case shows, we aren't even there :-(.
Thanks for the info, will try to play with that too. |
Friday testing was with c4b0f1c + your "Revert "samples: net: Fix sanitycheck for sam_e70_xplained board" applied. I also enabled net-shell + some other relevant / not so relevant options
|
Thanks. For reference, testing this revision (c4b0f1c) in various ways on my side. Using frdm_k64f connected via Ethernet directly to laptop (e1000e driver on the Linux side just in case).
Slowdown on req. 33. ab failed after 44 requests with:
After this, net shell is NOT responsive. Specifically, I could press Enter 3-4 times and they were echoed (i.e. new lines were fed), but no "shell>" prompt was printed. After that, further presses of Enter weren't processed. Summary based on this: in my testing, the more there pkts/buffers, the worse performance gets. I'd even say that it smells that we have O(n) or maybe even worse (like O(n^2)) algo somewhere which leads to large delays with many packets. |
Ok, I can reproduce this, if "freezes" is defined as "the sample is stuck when processing a particular request; after some time, ApacheBench times out". But restarting |
Now fast forward to 4a693c3 which is HEAD as of now, testing without this PR. qemu_x86: Now, unlike e.g. with a149232, the processing gets stuck on some request, just as described for qemu_cortex_m3. In 3 runs, this happened on ~400th, ~200th, 993th request. qemu_cortex_m3: Similar to a149232, gets stuck on <100th request. |
Now 4a693c3 + revert from this PR: qemu_x86: Works without a hitch, run 5 times * 1000 requests, even without qemu restart. qemu_cortex_m3: Stuck on <100th request, as before. |
Debugging qemu_cortex_m3 being stuck with just net shell doesn't show obvious probs, e.g. there's no shortage of free pkts/bufs. Submitted few cosmetic improvements to net shell instead. |
Not trying to enable debug logging, as I know it'll skew results. Instead posted #8168. |
Ok, good (enough) understanding what happens there: #8187 , #8188 . |
1176af5
to
7ea430b
Compare
As of today's master 58e40cb, the situation described in this PR still holds for me: with pristine master, running ab -n1000 results in slowdown after ~50th request, while with this patch applied, ab -n1000 goes thru. |
Retested with today's master d003d0e, the situation is the same as in previous comment and before. |
7ea430b
to
5d90a7f
Compare
c186dda
to
91996bf
Compare
91996bf
to
923d9fd
Compare
This PR has become very convoluted and difficult to read. What we are trying to achieve here with this one as I do not see much point reverting the original patch and cause various errors in sanitychecker so I suggest we close this one. |
What exactly is difficult to read? The patch is simple (it's no longer a literal revert, had to be updated for later changes; the commit message can be fixed of course). Comments? But that's good there're comments, much worse when there're no comments (a typical situation with networking issue reports).
It's the same as before, as described in the rather detailed commit message (also description of this PR). Summary: a) fix regression introduce to at least frdm_k64f; b) generally, we should no solve issue but increasing number of buffers, we should have defaults which should work across all hardware we support. (Like the older ones, which this patch restores.)
The only reason why there're sanitycheck errors is because one single driver (sam_e70's etehrnet) doesn't fit well with the rest of Zephyr. We have #9015 on that, and until it's fixed properly, we can/should blacklist sam_e70. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reverting a commit (as the PR title says) from 3 months ago is not the right thing to do now, the original commit fixed some issues and I guess you are reverting because it introduced other issues? Can you submit a PR that fixes those new issues instead of the revert which is starting to be confusing?
No, the claim is that it worked around some issues, and while doing so, introduced more.
This is such PR, containing wealth of the information on the matter. The "revert" title will be changed. |
This reverts commit a0df4f6. This commit has the following description: """ Some of the sanitycheck tests were having too small limit for network buffers when compiling for sam_e70_xplained board. Increase the buffer limits when testing this for this board. """ But the actual code changes do not correspond to this description: instead of making changes just for the affected sam_e70_xplained, actually the defaults for all Ethernet boards are changed. This has negative impact on BOARD=frdm_k64f. Specifically, without a0df4f6, using samples/net/sockets/dumb_http_server and ApacheBench's "ab -n1000 http://192.0.2.1:8080/" (i.e. load-test the Zephyr IP stack with serving 1000 consecutive connections), everything works as expected (excerpts from the ab report): Time taken for tests: 8.171 seconds Complete requests: 1000 Percentage of the requests served within a certain time (ms) 50% 8 66% 8 75% 8 80% 8 90% 8 95% 8 98% 8 99% 9 100% 21 (longest request) However, with a0df4f6 in effect, running the same command serves 10-100 requests OK, but then the visible slowdown starts. After breaking ab after 30s (otherwise, it could take hour(s) to finish), the result is: Time taken for tests: 35.449 seconds Complete requests: 51 Percentage of the requests served within a certain time (ms) 50% 8 66% 1216 75% 1312 80% 1312 90% 1472 95% 1984 98% 2560 99% 3776 100% 3776 (longest request) Thus, revert a0df4f6, as generally an Ethernet board works OK with the settings as were before. Suggestions regarding sam_e70_xplained: 1. Try to anylize why its behavior is different from e.g. frdm_k64f. (Perhaps, the matter is not just one board vs another board, but one sample vs another. Different samples should be tested (including samples/net/sockets/), and only affected should have config changed.) 2. If truly needed, sam_e70_xplained-specific settings should go in its specific config(s).
923d9fd
to
838d8cc
Compare
stale PR, you can open an issue if this is still relevant or submit a new PR. The information here is not not going to be lost. |
The PR updated. The issue it fixes is as fresh as it was more than half-year ago. The current status is that there're slow negotiations with @mnkp in #9015 on how to fix it in addition to this patch. And I actually wait for weekly PR reviews promised by you, @nashif (or TSC), to argue that this patch should be merged. |
Interesting, I reopened this PR yesterday... Well, nevermind, resubmitted as #11530 |
This reverts commit a0df4f6.
This commit has the following description:
"""
Some of the sanitycheck tests were having too small limit for
network buffers when compiling for sam_e70_xplained board.
Increase the buffer limits when testing this for this board.
"""
But the actual code changes do not correspond to this description:
instead of making changes just for the affected sam_e70_xplained,
actually the defaults for all Ethernet boards are changed.
This has negative impact on BOARD=frdm_k64f. Specifically, without
a0df4f6, using samples/net/sockets/dumb_http_server and
ApacheBench's "ab -n1000 http://192.0.2.1:8080/" (i.e. load-test
the Zephyr IP stack with serving 1000 consecutive connections),
everything works as expected (excerpts from the ab report):
Time taken for tests: 8.171 seconds
Complete requests: 1000
Percentage of the requests served within a certain time (ms)
50% 8
66% 8
75% 8
80% 8
90% 8
95% 8
98% 8
99% 9
100% 21 (longest request)
However, with a0df4f6 in effect, running the same command serves
10-100 requests OK, but then the visible slowdown starts. After
breaking ab after 30s (otherwise, it could take hour(s) to finish),
the result is:
Time taken for tests: 35.449 seconds
Complete requests: 51
Percentage of the requests served within a certain time (ms)
50% 8
66% 1216
75% 1312
80% 1312
90% 1472
95% 1984
98% 2560
99% 3776
100% 3776 (longest request)
Thus, revert a0df4f6, as generally an Ethernet board works OK with
the settings as were before. Suggestions regarding sam_e70_xplained:
(Perhaps, the matter is not just one board vs another board, but
one sample vs another. Different samples should be tested (including
samples/net/sockets/), and only affected should have config changed.)
its specific config(s).