Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TCP memory leak fixes #23334

Merged
merged 8 commits into from
Mar 17, 2020
Merged

Conversation

jukkar
Copy link
Member

@jukkar jukkar commented Mar 8, 2020

Fixes #23246

@jukkar jukkar added area: Networking DNM This PR should not be merged (Do Not Merge) labels Mar 8, 2020
@jukkar jukkar mentioned this pull request Mar 8, 2020
@jukkar jukkar removed the DNM This PR should not be merged (Do Not Merge) label Mar 10, 2020
@jukkar jukkar changed the title [WIP] TCP memory leak fixes TCP memory leak fixes Mar 10, 2020
@zephyrbot zephyrbot added the area: API Changes to public APIs label Mar 10, 2020
@jukkar jukkar requested a review from rlubos March 11, 2020 13:02
Copy link
Contributor

@rlubos rlubos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TCP over 6lowpan works fine.

subsys/net/ip/tcp.c Outdated Show resolved Hide resolved
@jukkar jukkar force-pushed the tcp-mem-leak-checking branch 2 times, most recently from 5ce60fa to f25c198 Compare March 11, 2020 16:18
@jukkar
Copy link
Member Author

jukkar commented Mar 11, 2020

Found still some corner cases when the packet was already about to be sent in tx thread and it was removed from tcp code.

@jukkar
Copy link
Member Author

jukkar commented Mar 11, 2020

@pfalcon You had some nice stress tests for TCP, are you able to run them against this PR?

@pfalcon
Copy link
Contributor

pfalcon commented Mar 11, 2020

@pfalcon You had some nice stress tests for TCP, are you able to run them against this PR?

In all fairness, I don't have further stress tests beyond those mentioned in #23246, #23302. I'm aware of tools like wrk, I also used boom because it's Python-scriptable, but for Zephyr testing I so far stick with ab, because it's widely known and available tool, and standardizing on it would allow to have apples to apples comparison with other people.

Regarding testing this: ever since we last talked in November, I'm working on setting up automated integration tests for Zephyr networking (and not only networking) - which is of course similar to what many Intel folks do. Except the tool we need to use here, LAVA, is designed for Linux testing, and requires adding more functionality for small baremetal devices like Zephyr. Then deployment to testing lab, then figuring out issues, etc., overall, goes slow. And right now I'm working on preparing "last" round of patches for LAVA monthly release.

Bottom line: I would hope to test this by the end of week. If that's too long and other folks will approve it, I'll test if after merge then. Thanks.

@@ -178,6 +178,13 @@ struct net_pkt {
* AF_UNSPEC.
*/
u8_t ppp_msg : 1; /* This is a PPP message */

u8_t tcp_first_msg : 1; /* Is this the first time this
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd find both this code comment and commit message which introduces it ("net: pkt: Add flag for telling if this is TCP 1st packet") to be unclear and confusing. To start with, we already maintain linked list of packets which were already sent (and may need to be resent) as tcp->sent_list. So, if a packet is not in that list, it's "1st packet" (ahem to the terminology), and if it's in that list, it's "not 1st packet". So, do we need an extra flag for that? If we do, how about naming it more clear? E.g., "resent" (or "resent_pkt", or "pkt_resent"). But note that we already have flags sent_or_eof and pkt_queued. So, as I said, this is getting more and more confusing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The flag tells is this the 1st time this packet has been sent. This is done in order to avoid doing one extra ref when sending it first time. It is indeed confusing, I try to clarify the documentation a bit. The sent_or_eof flag is for another purpose, there it means whether the pkt was sent in the first place. I tried to use the sent flag for detecting whether we need to ref the pkt in relevant places but it became very convoluted so it was much easier to just add a new flag for this. I agree that there are several flags now in the system and their interaction is hairy, but it is not easy to send the packet as there are two threads involved and the interaction is difficult when we have timers that expire when the packet is not even sent yet.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that there are several flags now in the system and their interaction is hairy, but it is not easy to send the packet as there are two threads ...

Ok, thanks for clarification.

@pfalcon
Copy link
Contributor

pfalcon commented Mar 13, 2020

Would make sense to squash "net: pkt: Print caller and line when allocation fails" and "net: pkt: Print size of the failed net_buf allocation" together?

Copy link
Contributor

@pfalcon pfalcon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for patchset. I see that the "meat" of the fixes is in the "net: tcp: Fix memory leak when lot of incoming packets" commit, and it's hard to understand them from just speculative review. However, for other commits, I made some comments calling to increase clarity. Because otherwise we may have situation when some things are fixed, but the code gets only more convoluted (potentially obscuring further issues).

I still didn't have a chance to test it thoroughly.

subsys/net/ip/tcp.c Outdated Show resolved Hide resolved
subsys/net/ip/net_if.c Show resolved Hide resolved
subsys/net/ip/net_if.c Show resolved Hide resolved
@pfalcon
Copy link
Contributor

pfalcon commented Mar 16, 2020

Ok, preparing to test this thoroughly, and found that with the latest master 22b9167 I can't run big_http_download for qemu_x86 without #23487 .

Copy link
Contributor

@pfalcon pfalcon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so I did my usual testing matrix, which involves dumb_http_server/big_http_download, qemu_x86/frdm_k64f, and don't see neither improvements nor regressions with this patch. Which I think is a good sign. My results, as usual, are in https://docs.google.com/spreadsheets/d/1_8CsACPEXqrMIbxBKxPAds091tNAwnwdWkMKr3994QY/edit#gid=0 , which provides rather sparse, but still some trajectory over the years.

So, I think this patch is a positive matter, but I don't give +1 right away, as I think there's room to improve clarity/efficiency of the code, per my earlier comments. Thanks.

@jukkar
Copy link
Member Author

jukkar commented Mar 17, 2020

I think there's room to improve clarity/efficiency of the code, per my earlier comments

Sure, I was waiting your testing results before addressing the comments. I will send a new version next.

TCP code needs to know whether the pkt is sent the first time
or is it a resent one. This information is used when deciding
if the pkt ref count needs to be increased or not. The packet
does not need to increase ref count when sent first time, as
the ref count is already 1 when the pkt is created. But for the
2nd time the packet is sent, we will need to increase the ref
count in order to avoid buffer leak.

Signed-off-by: Jukka Rissanen <jukka.rissanen@linux.intel.com>
The code was leaking memory in TX side when there was lot of
incoming packets. The reason was that the net_pkt_sent() flag
was manipulated in two threads which caused races. The solution
is to move the sent flag check only to tcp.c.

Fixes zephyrproject-rtos#23246

Signed-off-by: Jukka Rissanen <jukka.rissanen@linux.intel.com>
It is useful to know that we received ACK when debugging the
TCP code.

Signed-off-by: Jukka Rissanen <jukka.rissanen@linux.intel.com>
The RX fragment debug print does not work (compile error)
if memory allocation debugging is enabled, so disable it
for time being.

Signed-off-by: Jukka Rissanen <jukka.rissanen@linux.intel.com>
It is useful to know who called the net_pkt allocator when
we run out of buffers.

Signed-off-by: Jukka Rissanen <jukka.rissanen@linux.intel.com>
If we are receiving UDP packet and if there is some error happening
inside zsock_recv_dgram(), then make sure that the net_pkt received
from recv_q is freed.

Signed-off-by: Jukka Rissanen <jukka.rissanen@linux.intel.com>
This information is useful when debugging things.

Signed-off-by: Jukka Rissanen <jukka.rissanen@linux.intel.com>
It is possible that net_pkt will disappear while we are sending
it, so save link address if we need that information.

Signed-off-by: Jukka Rissanen <jukka.rissanen@linux.intel.com>
@jukkar
Copy link
Member Author

jukkar commented Mar 17, 2020

Rebased against upstream and updated according to comments.

@jukkar jukkar merged commit 907dede into zephyrproject-rtos:master Mar 17, 2020
@jukkar jukkar deleted the tcp-mem-leak-checking branch March 17, 2020 11:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: API Changes to public APIs area: Networking
Projects
None yet
Development

Successfully merging this pull request may close these issues.

net: tx_bufs are not freed when NET_TCP_BACKLOG_SIZE is too high
4 participants