-
Notifications
You must be signed in to change notification settings - Fork 13.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory leak each loop at IwIP: 2 LM & HB (not at 1.4 HB) #7059
Comments
I am suspecting LWIP in a similar situation on the ESP32, see here, in this case (using a WiFiClientSecure) I suspect ssl_client->socket = lwip_socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) in ssl_client, because if I add lwip_close(ssl_client->socket) directly following this line, memory has leaked. These are lwip calls and so have nothing to do with mbedtls / sll and should thus be comparable to your situation. I am not sure, because I do not now if a combination of lwip_socket() followed by a lwip_close() should release all memory, I am not familiar enough with LWIP, but that seems logical to me. I am also not sure if the ESP32 uses the same lwip version. |
I have something similar here. I've also tried to recreate the WiFiClient object again at each reconnect attempt, but that doesn't seem to make any difference. mqtt = WiFiClient(); // workaround see: https://github.com/esp8266/Arduino/issues/4497#issuecomment-373023864 |
It seems to be a different case because I create only 1 connection at setup and then at the loop cycle only send 1 mqtt message and even without any mqtt message memory leak will present. So this is about something inside IwIP lib because switching to the oldest version 1.4 helps to resolve the situation! In yours case, in my mind, it's about creating new connections and after fail attempt to establish it there is time_wait state which destroys connection only after 2 minutes delay. |
@civilman2006 I agree it does look like a different set of symptoms. Last night I als tested with the latest Git head of esp8266/Arduino and that does at least seem to solve this reboot issue due to memory exhaustion. The issue I was talking about needed roughly 18 minutes to run out of memory, so I am not sure those idle connection attempts were destroyed after 2 minutes. |
I just tried the OP sketch and I have not the same output Current git master, no debug:
2.6.3, debug enabled
|
@civilman2006 lwip2 version you use is "glue:1.2-16" which is about the one shipped with 2.6.3 while I have used "glue:1.2-31". That may explain that. Unfortunately, #6887 is not merged yet. |
Because I don't see what should have changed about an eventual memory leak, I tried with the same lwip2 version, and unfortunately I can see no leak.
|
Thx for the reply! I try to like 2 hours to find how to update glue from 1.2-16 to 1.2-31+, but I can't find the right way to do it at windows, because I don't have 'make' and so on... I update the board from git, that works fine, but I have the same version of glue = 1.2-16 and got the same error with a memory leak. If it possible comment me on how to update glue? And in the next message, you wrote that there is no problem at old glue - but the version is different glue:1.2-17 and mine is 1.2-16... |
Thx for the reply! I can't find way to update with #6887 because windows Arduino IDE & no compiler & make tool... So if you can help me with providing a link to some instruction I can try to test this future merge.. Or I can try alpha release... |
I just updated a script on
https://gist.github.com/Juppit/5e1e61eceb4c9a63136ff4d5411b5ff1, which
will create
- a cygwin installation on windows
- downloads the esp8266 repository
- and build a board support package for the Arduino IDE.
A http server will deliver the stuff at least to the IDE wenn you use
'http://localhost:8000/versions/package_esp8266com_index.json'
in the IDE settings for the IDE.
On 05.02.2020 at 06:35 wrote Dmitriy Khizhinskiy:
… If not I can try to generate an alpha release so you can try with
the arduino board installer.
Thx for the reply! I can't find way to update with #6887
<#6887> because windows Arduino
IDE & no compiler & make tool... So if you can help me with providing a
link to some instruction I can try to test this future merge.. Or I can
try alpha release...
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#7059?email_source=notifications&email_token=ACSTHJC2RKV5XY6PLCL66DLRBJFZRA5CNFSM4KO66GH2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEK2GN2Q#issuecomment-582248170>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACSTHJCILD7DAV4HOREH6C3RBJFZRANCNFSM4KO66GHQ>.
|
Thx for the script! I think that I correctly install boards: "C:\Users\Дмитрий\AppData\Local\Arduino15\packages\esp8266\hardware\esp8266\2.7.0-dev-nightly+20200216"\
|
HELLO, I'm having this problem using the 2.6.3 kernel, I still haven't found anything to solve. |
@laercionit @civilman2006 Please have a try with beta version 0.0.1. |
Hello .. Good night. I separated an IDE from Arduino to install the beta core. The project is working normally, but still has a HEAP below what it should. What I find strange is that the same code generates different HEAP on different routers using IPV4. I think the memory loss is related to a low quality network, with packet loss or with high latency. I used WIFIClient to make a connection to google.com.br on port 80 to check if the equipment had internet access. I removed this routine to use the LIB ESP8266Ping and stopped booting the equipment due to lack of HEAP. Using CORE BETA lwIP 2.0 or 2.6.3 lwIP 2.0 I still have significant loss of memory with PING, but nothing that generates boot. I also use a WifiClient to post to the MQTT server. |
Is this fixed yet in the core? Am seeing it today in 2.6.3. My example occurs with no wifi calls at all (save for auto hidden "background" reconnect) ie. the following MVCE triggers it: void setup() {
// put your setup code here, to run once:
Serial.begin(74880);
}
void loop() {
// put your main code here, to run repeatedly:
Serial.printf("%u\n",ESP.getFreeHeap());
delay(1000);
} Same versions etc as OP: ...as long as a valid reconnect has occurred in the past of course. If a full erase is done hence no wifif running on boot of above MVCE (or lwip1.4 used, as above) - no memory leak. For my 2c - it occurs if lwip runs at all irrespective of code calling wificlient/ ssl etc etc it just has to be there |
@d-a-v "more testers..." I can offer some insight perhaps as I have two systems (apparently identical in terms of release levels, both 20:50:27.460 -> SDK:2.2.2-dev(38a443e)/Core:2.6.3=20603000/lwIP:STABLE-2_1_2_RELEASE/glue:1.2-16-ge23a07e/BearSSL:89454af) One of which does it every time and the other which never does. ANy use? I'm thinking strange / special / unusual protocol packet on one of the nets, but not the other, that triggers the latent bug? |
@philbowles Do these 2 boards connect to the same network with similar connection speed and stability? I've seen different nodes running the same build on the same network being a lot less responsive. Apart from this strange behavior, it may also be like the buffers in LWIP fill up and are not released. |
@TD-er. One is my own in France, the other is one of my users in the UK who happens to have A LOT of devices (200+) on it. Mine is a more humble 20-30 - but big "air gap" between the two no common packets. :) We have spent the best part of 2 days whittling, stripping back to ensure a) identical code b) the minimum non-WiFi MVCE above. He ALWAYS gets it even when he rolls back to 2.5.2 - I nver get it , ever...took me 2 days trying before I found this exact match for his symptoms...so again for my 2c its been lurking for some time and only comes out to play when network busy or packet-corrupted etc. |
PS his first job tomorrow is to wireshark it. I will report back his findings of anything he sees every 30 sec or just prior |
@TD-er "Apart from this strange behavior, it may also be like the buffers in LWIP fill up and are not released. void setup() {
// put your setup code here, to run once:
Serial.begin(74880);
}
void loop() {
// put your main code here, to run repeatedly:
Serial.printf("%u\n",ESP.getFreeHeap());
delay(1000);
} |
Well if the node does auto-reconnect then it is connected, so it has the IP stack active. |
@TD-er agreed, but short of e.g MDNS / UPNP broadcasts nothing is directly targeting or trying to access the non-WiFi sketch... I do believe that some kind of unsolicited broadcast is indeed the cause, hence the wiseshark plan for tomorrow...other than that, we have exhausted pretty much everything else. But whatever it is, they don't occur on my net, or at least not in any volume...depends on what the true cause/mix is, but....it doesn't happen here, always happens there. :( |
Maybe the code has not been tested for left-driving WiFi traffic... |
You mean the "wrong" side of the road? I may live in France, but I'm a Brit! :) |
Well no need to state which side is the correct side. Well, at least it used to be, but now with Corona, I hardly drive these days. |
Luckily my user's ESP8266 is not in his engine management system. He'd get 500yards at a time, virus or no virus! |
@philbowles, surely you mean 500 meters. 📏 I think we're all slowly losing it under lockdown! |
@earlephilhower I lost it years ago :) |
FWIW got this from my user today "Looks like there is something coming in around every 5-6s. Most times the ESP recovers, but not always and that is when the heap loss happens. Capturing with WireShark, the only equipment sending packets out at this interval (that my PC also sees) are my SkyQ boxes. So I'm waiting to shut those down to see what difference that makes and I'll report back at some point." ...followed later by: [turning off the] "SkyQ boxes reduced significantly the heap changes/losses. So not the whole story. " From his 4.3MB wireshark capture, I count 8593 SSDP broadcasts from his skyQ boxes within a 3-minute window, or 48 per SECOND My money is on lwip not freeing a udp asset during periods of (exceedingly) high throughput - hence why same config on my "quiet" little home net doesn't show the problem (yet) |
Very interesting. |
@d-a-v I'm just hacking a udp+heap logger to try to pin it down further and see if its an "offensive packet" issue or a "rate" issue - will pass your comments to him - but my be a while as he's doing "real" work :) today |
FYI I'm sustaining bursts of 20/sec with no heap loss:
|
New & improved version - just can't provoke my system to to broadcast more than 20/sec:
|
Do you have a switch or access point that supports IGMP snooping? |
I will suggest that too - at the moment I'm placing all my bets on the high ssdp traffic, but - as ever - I could be wrong |
what happens if you block the traffic for a few minutes after you noticed a steady decline in available heap memory? Can you also plot the heap fragmentation? |
TD-er good suggestions - I was worried about fragmentation when I saw that last fail @ 14k remaining heap...may amend my logger to include largest free block - trying to avoid a quick n dirty single-purpose bug-hunting tool becoming another distracting project! Problem with other option he has 200 -odd devices, all dhcp'd so he'd be pulling cables for a week, but I'll suggest it. What's your view on the 115sec UPNP "TTL" ? |
@TD-er Decode trace of the above fail - looks like it is on a malloc, so maybe the fragmentation is also a problem. I stress "also" because most of the fails drain it down to a handful of bytes.
|
Latest: user has throttled his router and ESP now receiving max 100k/s all traffic ...and - heap stable @ 95% of start value, so its back to looking like a rate thing with ESP / core / lwip not able to free packets fast enough AND that triggering a memleak. I don't know what else I can do., but happy try try sensible and polite suggestions |
The idea I had when suggesting it was something like this. Just assume the unprocessed packets remain in memory for N seconds until they are cleared. This all goes well as long as the time needed to process it remains constant and the rate of messages fluctuates so that it gets below the limit the node can handle every now and then. Given this theory, you would see an increase of speed at which the free heap declines. Or at least as long as the average rate of packets remains constant and just about the initial threshold of what the node can handle. Also the rate must fluctuate to see this happening. |
A simple test could be to run the ESP at 160 MHz. If it can keep up longer with the other conditions the same, then my theory is a bit more plausible. |
@TD-er thats our plan for tomorrow: Run 1: unthrottle router re-run @ 160Mhz, Run 2: throttle router to 100kb/s then rerun |
@philbowles #6895 was intended to solve a similar UDP issue (just read #6831). |
@d-a-v Sorry still using 2.6.3 trying to nail the beast - if its been fixed we are wasting time, so will try to get him to do latest master tomorrow, merci! (et salutations de L'Orne 61330) :) |
The stack dump above says core 2.5.2, not 2.6.3. |
@d-a-v in case you haven't noticed it, @philbowles is using the AsyncUDP lib, not ours. |
Good news and bad news. "My man" has rerun some tests this morning and...debuggers worst nightmare: It's gone away. Nothing he has tried (2.5.2 reversion, 2.6.3 etc) will now cause the heap loss. Even more surprising his SkyQ box is still flooding the netwrok, and my logger shows it is actually peaking at 55 broadcasts/second and stable as a rock, bouncing up n down between 80% and 95% as the rate fluctuates, but basically, "flatlining" Tail of ths a.m. log after 45 minutes uptime:
His bewildered suggestion is that his boxes got firmware uploaded overnight. At a total loss for an explanation, I tend to agree with him, but only because I can think of few other realistic explanations. :( The only +ve from this is that rate does not now seem to be the core issue. We think "bad packet by sky box (now fixed)" is/was the answer. I wish I could tell you something different, but now neither of us can reproduce the problem. I am still happy to try to help if i can , of course. |
Maybe you also switched WiFi channels on the ESP, to one with less disturbances (less retransmits)? |
Hi I am following this thread closely, since I have a prob with ESP8266 resetting since 6 months. I basically loose heap in every reconnection to Wifi attempt when my router is switched off and can't be reached... I tried a bunch of different things, no solution yet. Currently running tests with the beta 0.0.2 as indicated above, but behaviour stays the same. Will share the next stack prints. I am currently trying to make a minimal version to be able to reproduce the behaviour. Library Version 2.6.3 + Beta 0.0.2 Let me know if I can be of any help to this. |
@FinduschkaLi that sounds like a completely different problem. Please don't hijack this thread, which is specific to a reported mem leak on each loop. |
@philbowles it sounds to me like your friend had a corrupted build. I've seen that reported, and a clean build from scratch would make the problem go away. |
@civilman2006 Your original mcve uses pubsub mqtt. I've seen mem leaks reported when using that 3rd party lib, and failure to reproduce without it using just our core. I suggest working with tbe authors of pubsub to reach a mcve that uses only our core. |
Hardware: ESP8266EX
Core Version: SDK:2.2.2-dev(38a443e)/Core:2.6.3=20603000/lwIP:STABLE-2_1_2_RELEASE/glue:1.2-16-ge23a07e/BearSSL:89454af
Development Env: Arduino IDE
Operating System: Windows
Module: LOLIN Wemos D1 mini Pro & Wemos D1 r2 mini
Flash Size: 16MB
lwip Variant: v2 Lower Memory and Higher Bandwidth
Flash Frequency: 40Mhz
CPU Frequency: 80Mhz
Upload Using: SERIAL
Upload Speed: 460800
At v2 Lower Memory and Higher Bandwidth I saw memory leak each LOOP - 32 bytes or more. Try many options, with or without debug and so on... After ~26 minutes of run ESP goes to :oom and reboot with dump. Sometimes one or two or three loops go without a leak, but than mem leak continues.
If I switch to IwIP variant 1.4 Higher Bandwidth - memory leak stops and all work fine!
( SDK:2.2.2-dev(38a443e)/Core:2.6.3=20603000/lwIP:1.4.0rc2/BearSSL:89454af - that variant work fine) I can provide debug log but it will be the same as below, exclude memory leak.
Debug log:
The text was updated successfully, but these errors were encountered: