Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak each loop at IwIP: 2 LM & HB (not at 1.4 HB) #7059

Closed
civilman2006 opened this issue Feb 3, 2020 · 54 comments
Closed

Memory leak each loop at IwIP: 2 LM & HB (not at 1.4 HB) #7059

civilman2006 opened this issue Feb 3, 2020 · 54 comments
Assignees

Comments

@civilman2006
Copy link

civilman2006 commented Feb 3, 2020

  • Hardware: ESP8266EX

  • Core Version: SDK:2.2.2-dev(38a443e)/Core:2.6.3=20603000/lwIP:STABLE-2_1_2_RELEASE/glue:1.2-16-ge23a07e/BearSSL:89454af

  • Development Env: Arduino IDE

  • Operating System: Windows

  • Module: LOLIN Wemos D1 mini Pro & Wemos D1 r2 mini

  • Flash Size: 16MB

  • lwip Variant: v2 Lower Memory and Higher Bandwidth

  • Flash Frequency: 40Mhz

  • CPU Frequency: 80Mhz

  • Upload Using: SERIAL

  • Upload Speed: 460800

At v2 Lower Memory and Higher Bandwidth I saw memory leak each LOOP - 32 bytes or more. Try many options, with or without debug and so on... After ~26 minutes of run ESP goes to :oom and reboot with dump. Sometimes one or two or three loops go without a leak, but than mem leak continues.

If I switch to IwIP variant 1.4 Higher Bandwidth - memory leak stops and all work fine!
( SDK:2.2.2-dev(38a443e)/Core:2.6.3=20603000/lwIP:1.4.0rc2/BearSSL:89454af - that variant work fine) I can provide debug log but it will be the same as below, exclude memory leak.

#include <PubSubClient.h>
#include <ESP8266WiFi.h>

#define ESP8266

#define DEBUG 1

const char* ssid = "zzzzzz";
const char* password = "xxxxxxx";
const char* mqtt_server = "111.222.222.222";
const char* mqtt_user = "zzzz";
const char* mqtt_pass = "zzzz";
const char* mqtt_clientId = "gost-temp";
const char* mqtt_ping_topic = "apr/home/ping";
const char* mqtt_online_message = "online";
const char* mqtt_last_will = "offline";
const char* mqtt_topic_base = "apr/home/";
const int ping_time = 30;
bool firstRun = true;

WiFiClient espClient;
PubSubClient client(espClient);

uint32_t originalram;

unsigned long lastMeasure = 0;

void setup_wifi() {
  int zz = 0;
  delay(10);
#ifdef DEBUG
  Serial.println();
  Serial.print("Connecting to ");
  Serial.println(ssid);
#endif
  WiFi.mode(WIFI_STA);
  WiFi.begin(ssid, password);
  delay(500);
  while (WiFi.status() != WL_CONNECTED) {
    delay(100);
#ifdef DEBUG
    Serial.print(".");
#endif
    zz=zz+1;
    if (zz >= 100) {
      ESP.restart();
    }
    Serial.print(zz);
  }
#ifdef DEBUG
  Serial.println("");
  Serial.print("WiFi connected - ESP IP address: ");
  Serial.println(WiFi.localIP());
#endif
  delay(1000);
}

void reconnect() {
  delay(500);
  while (!client.connected()) {
#ifdef DEBUG
    Serial.print("Attempting MQTT connection...");
#endif
    String mqtt_clientIdRand = mqtt_clientId;
    mqtt_clientIdRand += String(random(0xffff), HEX);
    if (client.connect(String(mqtt_clientIdRand).c_str(), mqtt_user, mqtt_pass, (String(mqtt_ping_topic)).c_str(), 0, 1, mqtt_last_will)) {
#ifdef DEBUG
      Serial.println("connected");
#endif
    } else {
#ifdef DEBUG
      Serial.print("failed, rc=");
      Serial.print(client.state());
      Serial.println(" try again in 5 seconds");
#endif
      delay(5000);
    }
  }
}

void setup() {
  randomSeed(millis());
  Serial.begin(115200);
  setup_wifi();

  client.setServer(mqtt_server, 1883);

  if (!client.connected()) {
    reconnect();
  }
  client.loop();
  client.publish(mqtt_ping_topic, mqtt_online_message);
  client.loop();

  originalram = ESP.getFreeHeap();
}

void loop() {
  if ((millis() - lastMeasure) > (ping_time * 1000)) {
    lastMeasure = millis();

    client.loop();
    client.publish(mqtt_ping_topic, mqtt_online_message);
    client.loop();

    uint32_t ram = ESP.getFreeHeap();
    Serial.printf("RAM: %d  change %d\n", ram, (ram - originalram ));
  }
  delay(30);
}

Debug log:

07:49:35.414 -> SDK:2.2.2-dev(38a443e)/Core:2.6.3=20603000/lwIP:STABLE-2_1_2_RELEASE/glue:1.2-16-ge23a07e/BearSSL:89454af
07:49:35.414 -> 
07:49:35.414 -> Connecting to XXXXX
07:49:35.414 -> bcn 0
07:49:35.414 -> del if1
07:49:35.414 -> usl
07:49:35.414 -> mode : sta(84:f3:eb:db:5a:3c)
07:49:35.414 -> add if0
07:49:35.586 -> wifi evt: 8
07:49:36.171 -> .1.2wifi evt: 2
07:49:36.378 -> .3.4.5.6.7.8.9.10.11.12.13.14.15.16.17.18.19.20.21.22.23scandone
07:49:39.389 -> state: 0 -> 2 (b0)
07:49:39.389 -> .24state: 2 -> 3 (0)
07:49:39.389 -> state: 3 -> 5 (10)
07:49:39.389 -> add 0
07:49:39.389 -> aid 1
07:49:39.389 -> cnt 
07:49:39.492 -> .25
07:49:39.492 -> connected with XXXXXX, channel 6
07:49:39.561 -> dhcp client start...
07:49:39.561 -> wifi evt: 0
07:49:39.561 -> .26ip:192.168.88.122,mask:255.255.255.0,gw:192.168.88.1
07:49:39.596 -> wifi evt: 3
07:49:39.666 -> .27
07:49:39.666 -> WiFi connected - ESP IP address: 192.168.88.122
07:49:41.182 -> Attempting MQTT connection...[hostByName] Host: 111.111.111.111 is a IP!
07:49:41.182 -> :ref 1
07:49:41.216 -> :wr 70 0
07:49:41.216 -> :wrc 70 70 0
07:49:41.216 -> :ack 70
07:49:41.216 -> :rn 4
07:49:41.216 -> :c0 1, 4
07:49:41.216 -> connected
07:49:41.216 -> :wr 23 0
07:49:41.216 -> :wrc 23 23 0
07:49:41.285 -> :ack 23
07:49:49.372 -> pm open,type:2 0
07:50:03.705 -> :rcl
07:50:03.705 -> :abort
07:50:05.117 -> RAM: 50008  change 1296
07:50:35.151 -> RAM: 49976  change 1264
07:51:05.188 -> RAM: 49944  change 1232
07:51:35.216 -> RAM: 49944  change 1232
07:52:05.217 -> RAM: 49944  change 1232
07:52:35.249 -> RAM: 49880  change 1168
07:53:05.287 -> RAM: 49816  change 1104
07:53:35.307 -> RAM: 49784  change 1072
07:54:05.336 -> RAM: 49784  change 1072
07:54:35.374 -> RAM: 49688  change 976
07:55:05.382 -> RAM: 49560  change 848
07:55:35.429 -> RAM: 49464  change 752
07:56:05.438 -> RAM: 49464  change 752
07:56:35.483 -> RAM: 49336  change 624
07:57:05.512 -> RAM: 49080  change 368
07:57:35.547 -> RAM: 48856  change 144
07:58:05.566 -> RAM: 48648  change -64
07:58:35.574 -> RAM: 48488  change -224
07:59:05.604 -> RAM: 48008  change -704
07:59:35.638 -> RAM: 47528  change -1184
08:00:05.669 -> RAM: 47368  change -1344
08:00:35.683 -> RAM: 47176  change -1536
@d-a-v d-a-v self-assigned this Feb 3, 2020
@Jeroen88
Copy link
Contributor

Jeroen88 commented Feb 3, 2020

I am suspecting LWIP in a similar situation on the ESP32, see here, in this case (using a WiFiClientSecure) I suspect ssl_client->socket = lwip_socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) in ssl_client, because if I add lwip_close(ssl_client->socket) directly following this line, memory has leaked. These are lwip calls and so have nothing to do with mbedtls / sll and should thus be comparable to your situation. I am not sure, because I do not now if a combination of lwip_socket() followed by a lwip_close() should release all memory, I am not familiar enough with LWIP, but that seems logical to me. I am also not sure if the ESP32 uses the same lwip version.

@TD-er
Copy link
Contributor

TD-er commented Feb 3, 2020

I have something similar here.
It looks like the leak is related to making a WiFiClient connection.
I am running a test here in which my node cannot connect to a MQTT broker, so every N seconds (30 I believe) it will re-attempt a connection and after roughly 18 minutes it is out of memory.
Not sure if it is related to unsuccessful reconnects or not.

I've also tried to recreate the WiFiClient object again at each reconnect attempt, but that doesn't seem to make any difference.

mqtt = WiFiClient(); // workaround see: https://github.com/esp8266/Arduino/issues/4497#issuecomment-373023864 

@civilman2006
Copy link
Author

I have something similar here.
It looks like the leak is related to making a WiFiClient connection.
I am running a test here in which my node cannot connect to a MQTT broker, so every N seconds (30 I believe) it will re-attempt a connection and after roughly 18 minutes it is out of memory.
Not sure if it is related to unsuccessful reconnects or not.

I've also tried to recreate the WiFiClient object again at each reconnect attempt, but that doesn't seem to make any difference.

mqtt = WiFiClient(); // workaround see: https://github.com/esp8266/Arduino/issues/4497#issuecomment-373023864 

It seems to be a different case because I create only 1 connection at setup and then at the loop cycle only send 1 mqtt message and even without any mqtt message memory leak will present. So this is about something inside IwIP lib because switching to the oldest version 1.4 helps to resolve the situation!

In yours case, in my mind, it's about creating new connections and after fail attempt to establish it there is time_wait state which destroys connection only after 2 minutes delay.

@TD-er
Copy link
Contributor

TD-er commented Feb 4, 2020

@civilman2006 I agree it does look like a different set of symptoms.

Last night I als tested with the latest Git head of esp8266/Arduino and that does at least seem to solve this reboot issue due to memory exhaustion.
It looks like this commit may have fixed that. So maybe you can test also with the latest merges, just to be sure?

The issue I was talking about needed roughly 18 minutes to run out of memory, so I am not sure those idle connection attempts were destroyed after 2 minutes.

@d-a-v
Copy link
Collaborator

d-a-v commented Feb 4, 2020

I just tried the OP sketch and I have not the same output

Current git master, no debug:

09:37:51.869 -> Connecting to xxxx
09:37:52.565 -> .1.2.3.4.5.6.7.8.9.10.11.12.13.14.15.16.17.18.19.20.21.22.23.24.25
09:37:55.879 -> WiFi connected - ESP IP address: 10.0.1.225
09:37:57.370 -> Attempting MQTT connection...connected
09:38:21.797 -> RAM: 49352  change 1368
09:38:51.825 -> RAM: 49352  change 1368
09:39:21.873 -> RAM: 49352  change 1368
09:39:51.856 -> RAM: 49352  change 1368
09:40:21.873 -> RAM: 49352  change 1368
09:40:51.886 -> RAM: 49352  change 1368
09:41:21.874 -> RAM: 49352  change 1368
09:41:51.889 -> RAM: 49352  change 1368
09:42:21.915 -> RAM: 49352  change 1368
09:42:51.932 -> RAM: 49352  change 1368
09:43:21.948 -> RAM: 49352  change 1368
09:43:51.939 -> RAM: 49352  change 1368
09:44:21.956 -> RAM: 49352  change 1368
09:44:51.974 -> RAM: 49352  change 1368
09:45:21.963 -> RAM: 49352  change 1368
09:45:52.014 -> RAM: 49352  change 1368
09:46:21.999 -> RAM: 49352  change 1368
09:46:52.018 -> RAM: 49352  change 1368
09:47:22.035 -> RAM: 49352  change 1368
09:47:52.050 -> RAM: 49352  change 1368
09:48:22.067 -> RAM: 49352  change 1368

2.6.3, debug enabled

09:54:38.670 -> SDK:2.2.2-dev(38a443e)/Core:2.6.3=20603000/lwIP:IPv6+STABLE-2_1_2_RELEASE/glue:1.2-31-gf839746/BearSSL:89454af
09:54:38.670 -> wifi evt: 2
09:54:38.670 -> 
09:54:38.704 -> Connecting to yyyyy
09:54:38.704 -> scandone
09:54:38.803 -> scandone
09:54:38.803 -> state: 0 -> 2 (b0)
09:54:38.836 -> state: 2 -> 3 (0)
09:54:38.836 -> state: 3 -> 5 (10)
09:54:38.836 -> add 0
09:54:38.836 -> aid 1
09:54:38.836 -> cnt 
09:54:38.869 -> 
09:54:38.869 -> connected with yyyyy, channel 1
09:54:38.869 -> dhcp client start...
09:54:38.869 -> wifi evt: 0
09:54:38.869 -> ip:10.0.1.225,mask:255.255.255.0,gw:10.0.1.254
09:54:38.902 -> wifi evt: 3
09:54:39.201 -> 
09:54:39.201 -> WiFi connected - ESP IP address: 10.0.1.225
09:54:39.765 -> ip:10.0.1.225,mask:255.255.255.0,gw:10.0.1.254
09:54:40.660 -> ip:10.0.1.225,mask:255.255.255.0,gw:10.0.1.254
09:54:40.693 -> Attempting MQTT connection...[hostByName] Host: 10.0.1.253 is a IP!
09:54:40.693 -> :ref 1
09:54:40.693 -> :wr 49 0
09:54:40.693 -> :wrc 49 49 0
09:54:40.742 -> :ack 49
09:54:40.742 -> :rn 4
09:54:40.742 -> :c0 1, 4
09:54:40.742 -> connected
09:54:40.742 -> :wr 18 0
09:54:40.742 -> :wrc 18 18 0
09:54:40.742 -> :ack 18
09:54:41.654 -> ip:10.0.1.225,mask:255.255.255.0,gw:10.0.1.254
09:54:48.845 -> pm open,type:2 0
09:55:02.699 -> :rcl
09:55:02.699 -> :abort
09:55:08.632 -> RAM: 48088  change 1392
09:55:38.626 -> RAM: 48088  change 1392
09:56:08.652 -> RAM: 48088  change 1392
09:56:38.662 -> RAM: 48088  change 1392
09:57:08.654 -> RAM: 48088  change 1392
09:57:38.674 -> RAM: 48088  change 1392
09:58:08.699 -> RAM: 48088  change 1392

@d-a-v
Copy link
Collaborator

d-a-v commented Feb 4, 2020

@civilman2006 lwip2 version you use is "glue:1.2-16" which is about the one shipped with 2.6.3 while I have used "glue:1.2-31". That may explain that.

Unfortunately, #6887 is not merged yet.
Are you able to test it ?
If not I can try to generate an alpha release so you can try with the arduino board installer.

@d-a-v
Copy link
Collaborator

d-a-v commented Feb 4, 2020

Because I don't see what should have changed about an eventual memory leak, I tried with the same lwip2 version, and unfortunately I can see no leak.

10:27:35.598 -> SDK:2.2.2-dev(38a443e)/Core:2.6.3=20603000/lwIP:STABLE-2_1_2_RELEASE/glue:1.2-17-g354887a/BearSSL:89454af
10:27:35.598 -> wifi evt: 2
10:27:35.598 -> 
10:27:35.598 -> Connecting to wwwwww
10:27:35.664 -> scandone
10:27:36.294 -> .1.2.3.4.5.6.7.8.9.10.11.12.13.14.15.16.17.18.19.20.21.22.23scandone
10:27:39.507 -> state: 0 -> 2 (b0)
10:27:39.507 -> .24state: 2 -> 3 (0)
10:27:39.507 -> state: 3 -> 5 (10)
10:27:39.507 -> add 0
10:27:39.507 -> aid 1
10:27:39.507 -> cnt 
10:27:39.540 -> 
10:27:39.540 -> connected with wwwwwww, channel 1
10:27:39.540 -> dhcp client start...
10:27:39.540 -> wifi evt: 0
10:27:39.573 -> ip:10.0.1.225,mask:255.255.255.0,gw:10.0.1.254
10:27:39.573 -> wifi evt: 3
10:27:39.606 -> .25
10:27:39.606 -> WiFi connected - ESP IP address: 10.0.1.225
10:27:41.097 -> Attempting MQTT connection...[hostByName] Host: 10.0.1.253 is a IP!
10:27:41.097 -> :ref 1
10:27:41.097 -> :wr 49 0
10:27:41.097 -> :wrc 49 49 0
10:27:41.130 -> :ack 49
10:27:41.130 -> :rn 4
10:27:41.130 -> :c0 1, 4
10:27:41.130 -> connected
10:27:41.130 -> :wr 18 0
10:27:41.130 -> :wrc 18 18 0
10:27:41.130 -> :ack 18
10:27:49.511 -> pm open,type:2 0
10:28:02.697 -> :rcl
10:28:02.697 -> :abort
10:28:05.547 -> RAM: 50088  change 408
10:28:35.540 -> RAM: 50088  change 408
10:29:05.557 -> RAM: 50088  change 408
10:29:35.583 -> RAM: 50088  change 408
10:30:05.575 -> RAM: 50088  change 408
10:30:35.592 -> RAM: 50088  change 408
10:31:05.581 -> RAM: 50088  change 408
10:31:35.601 -> RAM: 50088  change 408

@civilman2006
Copy link
Author

@civilman2006 lwip2 version you use is "glue:1.2-16" which is about the one shipped with 2.6.3 while I have used "glue:1.2-31". That may explain that.

Unfortunately, #6887 is not merged yet.
Are you able to test it ?
If not I can try to generate an alpha release so you can try with the arduino board installer.

Thx for the reply!

I try to like 2 hours to find how to update glue from 1.2-16 to 1.2-31+, but I can't find the right way to do it at windows, because I don't have 'make' and so on... I update the board from git, that works fine, but I have the same version of glue = 1.2-16 and got the same error with a memory leak.

If it possible comment me on how to update glue?
My current version of SDK & so on:
SDK:2.2.2-dev(38a443e)/Core:2.6.3-44-g6be56161=20603044/lwIP:STABLE-2_1_2_RELEASE/glue:1.2-16-ge23a07e/BearSSL:0645c68 (from esp8266 debug message)

And in the next message, you wrote that there is no problem at old glue - but the version is different glue:1.2-17 and mine is 1.2-16...

@d-a-v
Copy link
Collaborator

d-a-v commented Feb 4, 2020

1.2-16 and 1.2-17 make no difference.
lwip2 is updated by #6887 which is yet to be merged. You can however try a pull -request with for example this gist.
But as I said, it may not be the issue. More testers are needed, able to reproduce your issue.

@civilman2006
Copy link
Author

If not I can try to generate an alpha release so you can try with the arduino board installer.

Thx for the reply! I can't find way to update with #6887 because windows Arduino IDE & no compiler & make tool... So if you can help me with providing a link to some instruction I can try to test this future merge.. Or I can try alpha release...

@Juppit
Copy link
Contributor

Juppit commented Feb 10, 2020 via email

@civilman2006
Copy link
Author

Thx for the script!
I install and successfully build local version BUT I have same issue with memory because version of Glue is not updated?!
SDK:2.2.2-dev(38a443e)/Core:2.6.3-56-g5efdc776=20603056/lwIP:STABLE-2_1_2_RELEASE/glue:1.2-16-ge23a07e/BearSSL:0645c68

I think that I correctly install boards: "C:\Users\Дмитрий\AppData\Local\Arduino15\packages\esp8266\hardware\esp8266\2.7.0-dev-nightly+20200216"\
only this build is installed... I select Generic 8266 -> 4M (3SPIFF).. Dunno why glue still the same version?

I just updated a script on https://gist.github.com/Juppit/5e1e61eceb4c9a63136ff4d5411b5ff1, which will create - a cygwin installation on windows - downloads the esp8266 repository - and build a board support package for the Arduino IDE. A http server will deliver the stuff at least to the IDE wenn you use 'http://localhost:8000/versions/package_esp8266com_index.json' in the IDE settings for the IDE. On 05.02.2020 at 06:35 wrote Dmitriy Khizhinskiy:

If not I can try to generate an alpha release so you can try with the arduino board installer. Thx for the reply! I can't find way to update with #6887 <#6887> because windows Arduino IDE & no compiler & make tool... So if you can help me with providing a link to some instruction I can try to test this future merge.. Or I can try alpha release... — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#7059?email_source=notifications&email_token=ACSTHJC2RKV5XY6PLCL66DLRBJFZRA5CNFSM4KO66GH2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEK2GN2Q#issuecomment-582248170>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACSTHJCILD7DAV4HOREH6C3RBJFZRANCNFSM4KO66GHQ.

@laercionit
Copy link
Contributor

HELLO, I'm having this problem using the 2.6.3 kernel, I still haven't found anything to solve.
Has anyone been successful in solving this?

@d-a-v
Copy link
Collaborator

d-a-v commented Mar 18, 2020

@laercionit @civilman2006 Please have a try with beta version 0.0.1.

@laercionit
Copy link
Contributor

Hello .. Good night. I separated an IDE from Arduino to install the beta core.
It compiled without problems and I installed it on ESP8266.

The project is working normally, but still has a HEAP below what it should.

What I find strange is that the same code generates different HEAP on different routers using IPV4. I think the memory loss is related to a low quality network, with packet loss or with high latency.

I used WIFIClient to make a connection to google.com.br on port 80 to check if the equipment had internet access. I removed this routine to use the LIB ESP8266Ping and stopped booting the equipment due to lack of HEAP.

Using CORE BETA lwIP 2.0 or 2.6.3 lwIP 2.0 I still have significant loss of memory with PING, but nothing that generates boot.

I also use a WifiClient to post to the MQTT server.

@philbowles
Copy link

philbowles commented Apr 16, 2020

Is this fixed yet in the core? Am seeing it today in 2.6.3. My example occurs with no wifi calls at all (save for auto hidden "background" reconnect) ie. the following MVCE triggers it:

void setup() {
  // put your setup code here, to run once:
  Serial.begin(74880);
}

void loop() {
  // put your main code here, to run repeatedly:
  Serial.printf("%u\n",ESP.getFreeHeap());
  delay(1000);
}

Same versions etc as OP:
20:50:27.460 -> SDK:2.2.2-dev(38a443e)/Core:2.6.3=20603000/lwIP:STABLE-2_1_2_RELEASE/glue:1.2-16-ge23a07e/BearSSL:89454af

...as long as a valid reconnect has occurred in the past of course. If a full erase is done hence no wifif running on boot of above MVCE (or lwip1.4 used, as above) - no memory leak.

For my 2c - it occurs if lwip runs at all irrespective of code calling wificlient/ ssl etc etc it just has to be there

@philbowles
Copy link

@d-a-v "more testers..." I can offer some insight perhaps as I have two systems (apparently identical in terms of release levels, both 20:50:27.460 -> SDK:2.2.2-dev(38a443e)/Core:2.6.3=20603000/lwIP:STABLE-2_1_2_RELEASE/glue:1.2-16-ge23a07e/BearSSL:89454af) One of which does it every time and the other which never does. ANy use?

I'm thinking strange / special / unusual protocol packet on one of the nets, but not the other, that triggers the latent bug?

@TD-er
Copy link
Contributor

TD-er commented Apr 16, 2020

@philbowles Do these 2 boards connect to the same network with similar connection speed and stability?
Or does one of them need noticeable more time to get a connection?

I've seen different nodes running the same build on the same network being a lot less responsive.
Just like if there are differences in chip revisions or maybe some issues with board designs?
Running builds with different SDK builds (July/November SDK builds) may swap the behavior.

Apart from this strange behavior, it may also be like the buffers in LWIP fill up and are not released.
Maybe you need to check for presence of UDP packets and/or handle web server, or DNS requests (or whatever service you were running)

@philbowles
Copy link

@TD-er. One is my own in France, the other is one of my users in the UK who happens to have A LOT of devices (200+) on it. Mine is a more humble 20-30 - but big "air gap" between the two no common packets. :)

We have spent the best part of 2 days whittling, stripping back to ensure a) identical code b) the minimum non-WiFi MVCE above. He ALWAYS gets it even when he rolls back to 2.5.2 - I nver get it , ever...took me 2 days trying before I found this exact match for his symptoms...so again for my 2c its been lurking for some time and only comes out to play when network busy or packet-corrupted etc.

@philbowles
Copy link

PS his first job tomorrow is to wireshark it. I will report back his findings of anything he sees every 30 sec or just prior

@philbowles
Copy link

@TD-er "Apart from this strange behavior, it may also be like the buffers in LWIP fill up and are not released.
Maybe you need to check for presence of UDP packets and/or handle web server, or DNS requests (or whatever service you were running)" see my MVCE - we run NO WIFI at all. Thats the fun part ...in our code anyway - its there in the background for autoconnect.

void setup() {
  // put your setup code here, to run once:
  Serial.begin(74880);
}

void loop() {
  // put your main code here, to run repeatedly:
  Serial.printf("%u\n",ESP.getFreeHeap());
  delay(1000);
}

@TD-er
Copy link
Contributor

TD-er commented Apr 16, 2020

Thats the fun part ...in our code anyway - its there in the background for autoconnect.

Well if the node does auto-reconnect then it is connected, so it has the IP stack active.
If you try to send UDP packets to the node, then those will remain in the memory of the IP stack until they are fetched, or removed after some timeout. (at least that last part is something I hope is present)

@philbowles
Copy link

@TD-er agreed, but short of e.g MDNS / UPNP broadcasts nothing is directly targeting or trying to access the non-WiFi sketch...

I do believe that some kind of unsolicited broadcast is indeed the cause, hence the wiseshark plan for tomorrow...other than that, we have exhausted pretty much everything else. But whatever it is, they don't occur on my net, or at least not in any volume...depends on what the true cause/mix is, but....it doesn't happen here, always happens there. :(

@TD-er
Copy link
Contributor

TD-er commented Apr 16, 2020

Maybe the code has not been tested for left-driving WiFi traffic...

@philbowles
Copy link

You mean the "wrong" side of the road? I may live in France, but I'm a Brit! :)

@TD-er
Copy link
Contributor

TD-er commented Apr 16, 2020

Well no need to state which side is the correct side.
I know we drive on the right side :)

Well, at least it used to be, but now with Corona, I hardly drive these days.
So in the left side, there's nothing right.
In the right side, there's nothing left.

@philbowles
Copy link

Luckily my user's ESP8266 is not in his engine management system. He'd get 500yards at a time, virus or no virus!

@earlephilhower
Copy link
Collaborator

@philbowles, surely you mean 500 meters. 📏

I think we're all slowly losing it under lockdown!

@philbowles
Copy link

philbowles commented Apr 16, 2020

@earlephilhower I lost it years ago :)
PS you were right on my last issue - NOT locale -based (red face here :( ) but this is a different user! Previous was a 3rd party library updated (+bug) then bugfixed and all with no bump in version number...so while our systems looked the same , his had the ("same version") with the +100k code bug.

@philbowles
Copy link

FWIW got this from my user today "Looks like there is something coming in around every 5-6s. Most times the ESP recovers, but not always and that is when the heap loss happens. Capturing with WireShark, the only equipment sending packets out at this interval (that my PC also sees) are my SkyQ boxes. So I'm waiting to shut those down to see what difference that makes and I'll report back at some point."

...followed later by:

[turning off the] "SkyQ boxes reduced significantly the heap changes/losses. So not the whole story. "

From his 4.3MB wireshark capture, I count 8593 SSDP broadcasts from his skyQ boxes within a 3-minute window, or 48 per SECOND

My money is on lwip not freeing a udp asset during periods of (exceedingly) high throughput - hence why same config on my "quiet" little home net doesn't show the problem (yet)

@d-a-v
Copy link
Collaborator

d-a-v commented Apr 17, 2020

Very interesting.
Latest master includes some kind of protection against UDP flood. 2.6.3 does not have it.
If your user does not use git, but uses platormio, can stage version be tried ?
Otherwise there is the beta version 0.0.1 for arduino.

@philbowles
Copy link

@d-a-v I'm just hacking a udp+heap logger to try to pin it down further and see if its an "offensive packet" issue or a "rate" issue - will pass your comments to him - but my be a while as he's doing "real" work :) today

@philbowles
Copy link

FYI I'm sustaining bursts of 20/sec with no heap loss:

12:44:45.079 -> SDK:2.2.2-dev(38a443e)/Core:2.6.3=20603000/lwIP:STABLE-2_1_2_RELEASE/glue:1.2-16-ge23a07e/BearSSL:89454af
12:44:45.079 -> 
12:44:45.079 -> H4 version 0.5.4
12:44:45.079 -> SPIFFSImpl: allocating 512+240+1400=2152 bytes
12:44:45.079 -> SPIFFSImpl: mounting fs @200000, size=1fa000, block=2000, page=100
12:44:45.113 -> wifi evt: 2
12:44:45.147 -> SPIFFSImpl: mount rc=0
12:44:45.147 -> 
12:44:45.181 -> scandone
12:44:45.181 -> state: 0 -> 2 (b0)
12:44:45.215 -> SPIFFSImpl::open: fd=-10002 path=`/device` openMode=0 accessMode=1 err=-10002
12:44:45.215 -> state: 2 -> 3 (0)
12:44:45.215 -> state: 3 -> 5 (10)
12:44:45.215 -> add 0
12:44:45.215 -> aid 6
12:44:45.215 -> cnt 
12:44:45.283 -> 
12:44:45.283 -> connected with XXXXXXXX, channel 8
12:44:45.283 -> dhcp client start...
12:44:45.283 -> wifi evt: 0
12:44:45.283 -> SPIFFSImpl::open: fd=-10002 path=`/h4sv` openMode=0 accessMode=1 err=-10002
12:44:45.385 -> ip:192.168.1.104,mask:255.255.255.0,gw:192.168.1.1
12:44:45.385 -> wifi evt: 3
12:44:45.385 -> IP=192.168.1.104
12:44:45.385 -> udp logger: start heap 39824
12:44:50.973 -> From 192.168.1.30:51612 5973783 39520 99% rate=1/sec
12:44:51.278 -> From 192.168.1.30:51612 6280909 39520 99% rate=2/sec
12:44:51.584 -> From 192.168.1.30:51612 6590276 39520 99% rate=1/sec
12:44:55.243 -> pm open,type:2 0
12:44:56.499 -> From 192.168.1.21:62082 11503252 38992 97% rate=1/sec
12:45:00.911 -> From 192.168.1.30:50328 15906799 39000 97% rate=1/sec
12:45:01.231 -> From 192.168.1.30:50328 16215991 39000 97% rate=2/sec
12:45:01.511 -> From 192.168.1.30:50328 16521076 39000 97% rate=1/sec
12:45:10.957 -> From 192.168.1.30:54633 25941772 39000 97% rate=1/sec
12:45:11.263 -> From 192.168.1.30:54633 26249532 39000 97% rate=2/sec
12:45:11.501 -> From 192.168.1.30:54633 26482328 39000 97% rate=1/sec
12:45:20.995 -> From 192.168.1.30:49537 35978692 39000 97% rate=1/sec
12:45:21.301 -> From 192.168.1.30:49537 36285887 39000 97% rate=2/sec
12:45:21.607 -> From 192.168.1.30:49537 36591001 39000 97% rate=1/sec
12:45:26.011 -> From 192.168.1.1:43823 40995519 38824 97% rate=1/sec
12:45:26.011 -> From 192.168.1.1:43823 41003004 38840 97% rate=2/sec
12:45:26.011 -> From 192.168.1.1:43823 41010628 38784 97% rate=3/sec
12:45:26.011 -> From 192.168.1.1:43823 41018714 38776 97% rate=4/sec
12:45:26.051 -> From 192.168.1.1:43823 41029413 38768 97% rate=5/sec
12:45:26.051 -> From 192.168.1.1:43823 41037639 38760 97% rate=6/sec
12:45:26.171 -> From 192.168.1.1:43823 41165090 38824 97% rate=7/sec
12:45:26.171 -> From 192.168.1.1:43823 41168810 38840 97% rate=8/sec
12:45:26.171 -> From 192.168.1.1:43823 41172863 38784 97% rate=9/sec
12:45:26.171 -> From 192.168.1.1:43823 41177160 38776 97% rate=10/sec
12:45:26.211 -> From 192.168.1.1:43823 41181539 38768 97% rate=11/sec
12:45:26.211 -> From 192.168.1.1:43823 41185875 38760 97% rate=12/sec
12:45:30.931 -> From 192.168.1.30:38895 45909437 39000 97% rate=1/sec
12:45:31.211 -> From 192.168.1.30:38895 46216459 39000 97% rate=2/sec
12:45:31.531 -> From 192.168.1.30:38895 46525973 39000 97% rate=1/sec
12:45:40.947 -> From 192.168.1.30:33980 55944607 39000 97% rate=1/sec
12:45:41.267 -> From 192.168.1.30:33980 56251956 39000 97% rate=2/sec
12:45:41.587 -> From 192.168.1.30:33980 56561259 39000 97% rate=1/sec
12:45:45.045 -> From 192.168.1.1:36148 60042989 38704 97% rate=1/sec
12:45:45.045 -> From 192.168.1.1:36148 60047773 38696 97% rate=2/sec
12:45:45.045 -> From 192.168.1.1:36148 60053148 38632 97% rate=3/sec
12:45:45.080 -> From 192.168.1.1:53448 60058405 38640 97% rate=4/sec
12:45:45.080 -> From 192.168.1.1:60141 60063162 38696 97% rate=5/sec
12:45:45.080 -> From 192.168.1.1:60141 60070829 38656 97% rate=6/sec
12:45:45.080 -> From 192.168.1.1:53673 60076320 38624 96% rate=7/sec
12:45:45.080 -> From 192.168.1.1:53577 60081288 38696 97% rate=8/sec
12:45:45.080 -> From 192.168.1.1:53577 60086731 38640 97% rate=9/sec
12:45:45.114 -> From 192.168.1.1:34880 60092024 38640 97% rate=10/sec
12:45:45.114 -> From 192.168.1.1:51619 60097112 38704 97% rate=11/sec
12:45:45.114 -> From 192.168.1.1:51619 60102086 38696 97% rate=12/sec
12:45:45.114 -> From 192.168.1.1:51619 60107444 38632 97% rate=13/sec
12:45:45.114 -> From 192.168.1.1:33140 60112743 38640 97% rate=14/sec
12:45:45.114 -> From 192.168.1.1:33925 60117473 38696 97% rate=15/sec
12:45:45.114 -> From 192.168.1.1:33925 60122738 38656 97% rate=16/sec
12:45:45.147 -> From 192.168.1.1:60519 60128183 38624 96% rate=17/sec
12:45:45.147 -> From 192.168.1.1:43878 60133202 38696 97% rate=18/sec
12:45:45.215 -> From 192.168.1.1:43878 60192326 38640 97% rate=19/sec
12:45:45.215 -> From 192.168.1.1:33076 60197738 38640 97% rate=20/sec

@philbowles
Copy link

New & improved version - just can't provoke my system to to broadcast more than 20/sec:

13:29:37.534 -> udp logger: start heap 39712 min Delta=64
13:29:37.694 -> ******************* Heap LOSS 592 bytes ******************
13:29:37.694 -> From 192.168.1.112:1900 564924 39120 98% rate=1/sec (max 1)
13:29:37.735 -> From 192.168.1.112:1900 630918 39136 98% rate=2/sec (max 2)
13:29:38.055 -> From 192.168.1.112:1900 921557 39136 98% rate=3/sec (max 3)
13:29:38.135 -> From 192.168.1.112:1900 998479 39120 98% rate=4/sec (max 4)
13:29:38.455 -> From 192.168.1.112:1900 1330905 39120 98% rate=5/sec (max 5)
13:29:38.455 -> From 192.168.1.112:1900 1339604 39120 98% rate=6/sec (max 6)
13:29:38.455 -> From 192.168.1.112:1900 1348698 39120 98% rate=7/sec (max 7)
13:29:41.615 -> ******************* Heap GAIN 288 bytes ******************
13:29:41.615 -> From 192.168.1.30:59327 4502944 39408 99% rate=1/sec (max 7)
13:29:41.935 -> From 192.168.1.30:59327 4810313 39408 99% rate=2/sec (max 7)
13:29:42.255 -> From 192.168.1.30:59327 5117484 39408 99% rate=3/sec (max 7)
13:29:45.216 -> ******************* Heap LOSS 296 bytes ******************
13:29:45.216 -> From 192.168.1.1:52740 8090005 39112 98% rate=1/sec (max 7)
13:29:45.216 -> From 192.168.1.1:52740 8094388 39104 98% rate=2/sec (max 7)
13:29:45.216 -> ******************* Heap LOSS 72 bytes ******************
13:29:45.216 -> From 192.168.1.1:52740 8099734 39040 98% rate=3/sec (max 7)
13:29:45.256 -> From 192.168.1.1:33064 8105368 39048 98% rate=4/sec (max 7)
13:29:45.256 -> From 192.168.1.1:52744 8112009 39104 98% rate=5/sec (max 7)
13:29:45.256 -> From 192.168.1.1:52744 8117142 39064 98% rate=6/sec (max 7)
13:29:45.256 -> From 192.168.1.1:53846 8122606 39032 98% rate=7/sec (max 7)
13:29:45.256 -> From 192.168.1.1:41105 8127467 39104 98% rate=8/sec (max 8)
13:29:45.256 -> From 192.168.1.1:41105 8132855 39048 98% rate=9/sec (max 9)
13:29:45.256 -> From 192.168.1.1:47277 8138241 39048 98% rate=10/sec (max 10)
13:29:45.256 -> ******************* Heap GAIN 72 bytes ******************
13:29:45.296 -> From 192.168.1.1:55181 8152746 39112 98% rate=11/sec (max 11)
13:29:45.296 -> From 192.168.1.1:55181 8157766 39104 98% rate=12/sec (max 12)
13:29:45.296 -> ******************* Heap LOSS 72 bytes ******************
13:29:45.296 -> From 192.168.1.1:55181 8163191 39040 98% rate=13/sec (max 13)
13:29:45.296 -> From 192.168.1.1:53510 8173378 39048 98% rate=14/sec (max 14)
13:29:45.296 -> From 192.168.1.1:40372 8183241 39104 98% rate=15/sec (max 15)
13:29:45.336 -> From 192.168.1.1:40372 8201513 39064 98% rate=16/sec (max 16)
13:29:45.336 -> From 192.168.1.1:40793 8211597 39032 98% rate=17/sec (max 17)
13:29:45.336 -> From 192.168.1.1:59516 8224100 39104 98% rate=18/sec (max 18)
13:29:45.376 -> From 192.168.1.1:59516 8234653 39048 98% rate=19/sec (max 19)
13:29:45.376 -> From 192.168.1.1:52216 8245047 39048 98% rate=20/sec (max 20)
13:29:47.336 -> pm open,type:2 0
13:29:51.651 -> ******************* Heap LOSS 152 bytes ******************
13:29:51.651 -> From 192.168.1.30:39149 14538628 38888 97% rate=1/sec (max 20)
13:29:51.971 -> From 192.168.1.30:39149 14845414 38888 97% rate=2/sec (max 20)
13:29:52.291 -> From 192.168.1.30:39149 15152675 38888 97% rate=3/sec (max 20)
13:30:01.678 -> From 192.168.1.30:55947 24573123 38888 97% rate=1/sec (max 20)
13:30:01.985 -> From 192.168.1.30:55947 24880546 38888 97% rate=2/sec (max 20)
13:30:02.189 -> From 192.168.1.30:55947 25085252 38888 97% rate=3/sec (max 20)
13:30:09.275 -> From 192.168.1.21:54322 32153764 38840 97% rate=1/sec (max 20)
13:30:10.315 -> From 192.168.1.21:54322 33178264 38840 97% rate=1/sec (max 20)
13:30:11.236 -> ******************* Heap LOSS 176 bytes ******************
13:30:11.236 -> From 192.168.1.1:43823 34098385 38712 97% rate=1/sec (max 20)
13:30:11.236 -> From 192.168.1.1:43823 34105806 38728 97% rate=2/sec (max 20)
13:30:11.236 -> From 192.168.1.1:43823 34113515 38672 97% rate=3/sec (max 20)
13:30:11.236 -> From 192.168.1.1:43823 34124035 38664 97% rate=4/sec (max 20)
13:30:11.236 -> From 192.168.1.1:43823 34132444 38656 97% rate=5/sec (max 20)
13:30:11.276 -> From 192.168.1.1:43823 34140800 38648 97% rate=6/sec (max 20)
13:30:11.276 -> ******************* Heap GAIN 128 bytes ******************

@TD-er
Copy link
Contributor

TD-er commented Apr 17, 2020

Do you have a switch or access point that supports IGMP snooping?
If so, then it would be best to use that on the position where those boxes may interact with the network where the ESP nodes are.
Just to try if this is some kind of interaction with IGMP traffic.

@philbowles
Copy link

Do you have a switch or access point that supports IGMP snooping?
If so, then it would be best to use that on the position where those boxes may interact with the network where the ESP nodes are.
Just to try if this is some kind of interaction with IGMP traffic.

I will suggest that too - at the moment I'm placing all my bets on the high ssdp traffic, but - as ever - I could be wrong

@philbowles
Copy link

Progress. Following is graph of the steady decline of "bad" system running my SSDP logger. IPAddress(239,255,255,250);

Max rate per second peaked @ 42.
lwip

The steady and smooth decline suggest to me that it is not a rate issue since the rate fluctuates wildly . However, if I HAD to pint to the start of the decline...its not long after the 42/s peak

I have 50+mb of Wireshark PCAP that matches this log if anyone knows enough to spot any suspicious packets... there are 100k+ packets, I estimate 80% of them from 192.168.101

Examples of a typical broadcast is:

NOTIFY * HTTP/1.1
HOST: 239.255.255.250:1900
CACHE-CONTROL: max-age=115
LOCATION: http://192.168.1.110:49153/description4.xml
OPT: "http://schemas.upnp.org/upnp/1/0/"; ns=01
01-NLS: b2161680-1dd1-11b2-a99c-89f8d67f4c3f
NT: upnp:rootdevice
NTS: ssdp:alive
SERVER: GW-Device/1.0.0 (Sky, ES140, )
X-User-Agent: redsonic
USN: uuid:55443176-3147-4761-7465-9002184c3e11::upnp:rootdevice

NOTIFY * HTTP/1.1
HOST: 239.255.255.250:1900
CACHE-CONTROL: max-age=115
LOCATION: http://192.168.1.110:49153/description1.xml
OPT: "http://schemas.upnp.org/upnp/1/0/"; ns=01
01-NLS: b2161680-1dd1-11b2-a99c-89f8d67f4c3f
NT: urn:schemas-nds-com:device:GatewaySkyControl:2
NTS: ssdp:alive
SERVER: GW-Device/1.0.0 (Sky, ES140, )
X-User-Agent: redsonic
USN: uuid:444D5276-3247-4761-7465-9002184c3e11::urn:schemas-nds-com:device:GatewaySkyControl:2

NOTIFY * HTTP/1.1
HOST: 239.255.255.250:1900
CACHE-CONTROL: max-age=115
LOCATION: http://192.168.1.110:49153/description4.xml
OPT: "http://schemas.upnp.org/upnp/1/0/"; ns=01
01-NLS: b2161680-1dd1-11b2-a99c-89f8d67f4c3f
NT: urn:schemas-nds-com:service:SkyPairing:1
NTS: ssdp:alive
SERVER: GW-Device/1.0.0 (Sky, ES140, )
X-User-Agent: redsonic
USN: uuid:55443176-3147-4761-7465-9002184c3e11::urn:schemas-nds-com:service:SkyPairing:1

... etc

Which textual look sensible apart from the the max-age which looks way too low to a newbie(ish) like me , but I wouldn't have a clue re the validity of the header.

The header of the last one is:

image

# Log

16:41:56.961 -> udp logger: start heap 39936 min Delta=32
16:41:57.609 -> ******************* Heap LOSS 2304 bytes ******************
16:41:57.609 -> From 192.168.1.110:38503 2415624 37632 94% rate=1/sec (max 1)
16:41:57.609 -> ******************* Heap GAIN 592 bytes ******************
16:41:57.609 -> From 192.168.1.110:38503 2422390 38224 95% rate=2/sec (max 2)
16:41:57.643 -> ******************* Heap GAIN 600 bytes ******************
16:41:57.643 -> From 192.168.1.110:38503 2438526 38824 97% rate=3/sec (max 3)
16:41:57.643 -> ******************* Heap GAIN 656 bytes ******************
16:41:57.643 -> From 192.168.1.110:59621 2454656 39480 98% rate=4/sec (max 4)
16:41:57.711 -> ******************* Heap LOSS 1848 bytes ******************
16:41:57.711 -> From 192.168.1.110:44673 2517903 37632 94% rate=5/sec (max 5)
16:41:57.711 -> ******************* Heap GAIN 592 bytes ******************
16:41:57.711 -> From 192.168.1.110:44673 2524799 38224 95% rate=6/sec (max 6)
16:41:57.745 -> ******************* Heap GAIN 600 bytes ******************
16:41:57.745 -> From 192.168.1.110:44673 2540934 38824 97% rate=7/sec (max 7)

< Big snip >

17:10:09.642 -> ******************* Heap GAIN 600 bytes ******************
17:10:09.642 -> From 192.168.1.110:55689 1694436226 11904 29% rate=3/sec (max 42)
17:10:09.677 -> ******************* Heap GAIN 656 bytes ******************
17:10:09.677 -> From 192.168.1.110:59684 1694457392 12560 31% rate=4/sec (max 42)
17:10:09.677 -> ******************* Heap GAIN 640 bytes ******************
17:10:09.677 -> From 192.168.1.110:40478 1694477674 13200 33% rate=5/sec (max 42)
17:10:09.712 -> ******************* Heap GAIN 592 bytes ******************
17:10:09.712 -> From 192.168.1.110:40478 1694500984 13792 34% rate=6/sec (max 42)
17:10:09.747 -> ******************* Heap GAIN 600 bytes ******************
17:10:09.747 -> From 192.168.1.110:40478 1694521038 14392 36% rate=7/sec (max 42)
17:10:14.501 ->
17:10:14.501 -> Abort called
17:10:14.501 ->
17:10:14.501 -> >>>stack>>>
17:10:14.501 ->
17:10:14.501 -> ctx: sys
17:10:14.501 -> sp: 3fffeb70 end: 3fffffb0 offset: 01b0
17:10:14.501 -> 3fffed20: 3fffafa4 4021cf7e 0000018a 4021d1a4
17:10:14.501 -> 3fffed30: 00000000 00000000 3fffbc6e 4021d96c
17:10:14.501 -> 3fffed40: 000003c6 3fffee08 3fffb78c 40214120
17:10:14.536 -> 3fffed50: 000003c6 3fffee08 3ffeef00 3fff9024
17:10:14.536 -> 3fffed60: 3fffee20 3fffee20 3fffbae4 4021d9be
17:10:14.536 -> 3fffed70: 3fffed90 40208702 00000197 4020eea4
17:10:14.536 -> 3fffed80: 3fff715c 3fffbc6e 3fffbae4 4020871b
17:10:14.536 -> 3fffed90: 3fffa4cc 4021cf7e 40206928 4021d1fa
17:10:14.571 -> 3fffeda0: 3fff905e 00000000 3fff8b24 40214120
17:10:14.571 -> 3fffedb0: 0000936a 3fff905e 3ffeed58 402069f2
17:10:14.571 -> 3fffedc0: 40214e40 00000000 3ffeedb8 3fffee64
... etc

@TD-er
Copy link
Contributor

TD-er commented Apr 17, 2020

what happens if you block the traffic for a few minutes after you noticed a steady decline in available heap memory?
E.g. pulling the ethernet cable from the access point or the Sky box.

Can you also plot the heap fragmentation?

@philbowles
Copy link

TD-er good suggestions - I was worried about fragmentation when I saw that last fail @ 14k remaining heap...may amend my logger to include largest free block - trying to avoid a quick n dirty single-purpose bug-hunting tool becoming another distracting project!

Problem with other option he has 200 -odd devices, all dhcp'd so he'd be pulling cables for a week, but I'll suggest it.

What's your view on the 115sec UPNP "TTL" ?

@philbowles
Copy link

@TD-er Decode trace of the above fail - looks like it is on a malloc, so maybe the fragmentation is also a problem. I stress "also" because most of the fails drain it down to a handful of bytes.

Decoding stack results
0x4021cf7e: operator new(unsigned int) at ../../../../../dl/gcc-xtensa/libstdc++-v3/libsupc++/new_op.cc line 52
0x4021d1a4: std::string::_Rep::_S_create(unsigned int, unsigned int, std::allocator  const&) at /workdir/arena/gcc/xtensa-lx106-elf/libstdc++-v3/include/bits/basic_string.tcc line 621
0x4021d96c: std::string::_S_construct (char const*, char const*, std::allocator  const&, std::forward_iterator_tag) at /workdir/arena/gcc/xtensa-lx106-elf/libstdc++-v3/include/bits/basic_string.tcc line 142
0x40214120: std::_Function_base::~_Function_base() at c:\users\adam\appdata\local\arduino15\packages\esp8266\tools\xtensa-lx106-elf-gcc\2.5.0-3-20ed2b9\xtensa-lx106-elf\include\c++\4.8.2/functional line 2031
0x4021d9be: std::basic_string  , std::allocator  >::basic_string(char const*, std::allocator  const&) at /workdir/arena/gcc/xtensa-lx106-elf/libstdc++-v3/include/bits/basic_string.tcc line 218
0x40208702: stringFromBuff(unsigned char const*, int) at C:\Users\adam\OneDrive\Documents\Arduino\libraries\H4\src\H4Utils.cpp line 76
0x4020eea4: malloc(size_t) at C:\Users\adam\AppData\Local\Arduino15\packages\esp8266\hardware\esp8266\2.5.2\cores\esp8266\umm_malloc\umm_malloc.cpp line 1685
0x4020871b: stringFromBuff(unsigned char const*, int) at C:\Users\adam\OneDrive\Documents\Arduino\libraries\H4\src\H4Utils.cpp line 79
0x4021cf7e: operator new(unsigned int) at ../../../../../dl/gcc-xtensa/libstdc++-v3/libsupc++/new_op.cc line 52
0x40206928: std::_Function_base::_Base_manager    , std::allocator  >, IPAddress, short unsigned int)> >::_M_manager(std::_Any_data &, const std::_Any_data &, std::_Manager_operation) at c:\users\adam\appdata\local\arduino15\packages\esp8266\tools\xtensa-lx106-elf-gcc\2.5.0-3-20ed2b9\xtensa-lx106-elf\include\c++\4.8.2/functional line 1931
0x4021d1fa: std::basic_string  , std::allocator  >::~basic_string() at /workdir/arena/gcc/xtensa-lx106-elf/libstdc++-v3/include/bits/basic_string.h line 541
0x40214120: std::_Function_base::~_Function_base() at c:\users\adam\appdata\local\arduino15\packages\esp8266\tools\xtensa-lx106-elf-gcc\2.5.0-3-20ed2b9\xtensa-lx106-elf\include\c++\4.8.2/functional line 2031
0x402069f2: std::_Function_handler ::_M_invoke(const std::_Any_data &, AsyncUDPPacket &) at C:\Users\adam\OneDrive\Documents\Arduino\libraries\H4Plugins\src\H4P_udpLogger.cpp line 50
0x4020e998: _umm_free(void*) at C:\Users\adam\AppData\Local\Arduino15\packages\esp8266\hardware\esp8266\2.5.2\cores\esp8266\umm_malloc\umm_malloc.cpp line 1304
0x4020f0c0: free(void*) at C:\Users\adam\AppData\Local\Arduino15\packages\esp8266\hardware\esp8266\2.5.2\cores\esp8266\umm_malloc\umm_malloc.cpp line 1764
0x4021faf0: lwip_cyclic_timer at core/timeouts.c line 233
0x40226fd4: mem_free at core/mem.c line 237
0x40221d6a: pbuf_free_LWIP2 at core/pbuf.c line 786
0x4020a6c6: AsyncUDP::_recv(udp_pcb*, pbuf*, ip4_addr*, unsigned short) (C:\Program Files at x86)\Arduino\libraries\ESPAsyncUDP\src\AsyncUDP.cpp line 197
0x40226fb0: mem_malloc at core/mem.c line 210
0x4020a6fc: AsyncUDP::_s_recv(void*, udp_pcb*, pbuf*, ip4_addr const*, unsigned short) (C:\Program Files at x86)\Arduino\libraries\ESPAsyncUDP\src\AsyncUDP.cpp line 210
0x402223dc: udp_input at core/udp.c line 404
0x4020eea4: malloc(size_t) at C:\Users\adam\AppData\Local\Arduino15\packages\esp8266\hardware\esp8266\2.5.2\cores\esp8266\umm_malloc\umm_malloc.cpp line 1685
0x402262bc: ip4_input at core/ipv4/ip4.c line 704
0x4020e998: _umm_free(void*) at C:\Users\adam\AppData\Local\Arduino15\packages\esp8266\hardware\esp8266\2.5.2\cores\esp8266\umm_malloc\umm_malloc.cpp line 1304
0x4021e3c5: ethernet_input_LWIP2 at netif/ethernet.c line 188
0x4021e1fb: esp2glue_ethernet_input at glue-lwip/lwip-git.c line 441
0x4023d086: ethernet_input at glue-esp/lwip-esp.c line 363
0x4023d097: ethernet_input at glue-esp/lwip-esp.c line 371
0x4021b11c: __ssputs_r at /home/earle/src/esp-quick-toolchain/repo/newlib/newlib/libc/stdio/nano-vfprintf.c line 180
0x40216d37: _printf_i at /home/earle/src/esp-quick-toolchain/repo/newlib/newlib/libc/stdio/nano-vfprintf_i.c line 217
0x4021b5ff: _svfprintf_r at /home/earle/src/esp-quick-toolchain/repo/newlib/newlib/libc/stdio/nano-vfprintf.c line 660
0x4021b11c: __ssputs_r at /home/earle/src/esp-quick-toolchain/repo/newlib/newlib/libc/stdio/nano-vfprintf.c line 180
0x4021b1ea: __ssputs_r at /home/earle/src/esp-quick-toolchain/repo/newlib/newlib/libc/stdio/nano-vfprintf.c line 233
0x40216c90: _printf_i at /home/earle/src/esp-quick-toolchain/repo/newlib/newlib/libc/stdio/nano-vfprintf_i.c line 194
0x4021b1ea: __ssputs_r at /home/earle/src/esp-quick-toolchain/repo/newlib/newlib/libc/stdio/nano-vfprintf.c line 233
0x4021b11c: __ssputs_r at /home/earle/src/esp-quick-toolchain/repo/newlib/newlib/libc/stdio/nano-vfprintf.c line 180
0x40216d98: _printf_i at /home/earle/src/esp-quick-toolchain/repo/newlib/newlib/libc/stdio/nano-vfprintf_i.c line 244
0x4021b1ea: __ssputs_r at /home/earle/src/esp-quick-toolchain/repo/newlib/newlib/libc/stdio/nano-vfprintf.c line 233
0x4021b3d4: _svfprintf_r at /home/earle/src/esp-quick-toolchain/repo/newlib/newlib/libc/stdio/nano-vfprintf.c line 531
0x40218945: _vsnprintf_r at /home/earle/src/esp-quick-toolchain/repo/newlib/newlib/libc/stdio/vsnprintf.c line 73
0x4020ec7f: check_poison_block(umm_block*) at C:\Users\adam\AppData\Local\Arduino15\packages\esp8266\hardware\esp8266\2.5.2\cores\esp8266\umm_malloc\umm_malloc.cpp line 871
0x4020ecf3: check_poison_all_blocks() at C:\Users\adam\AppData\Local\Arduino15\packages\esp8266\hardware\esp8266\2.5.2\cores\esp8266\umm_malloc\umm_malloc.cpp line 892
0x4020ece8: check_poison_all_blocks() at C:\Users\adam\AppData\Local\Arduino15\packages\esp8266\hardware\esp8266\2.5.2\cores\esp8266\umm_malloc\umm_malloc.cpp line 889
0x4020ecf3: check_poison_all_blocks() at C:\Users\adam\AppData\Local\Arduino15\packages\esp8266\hardware\esp8266\2.5.2\cores\esp8266\umm_malloc\umm_malloc.cpp line 892
0x4020bd68: String::invalidate() at C:\Users\adam\AppData\Local\Arduino15\packages\esp8266\hardware\esp8266\2.5.2\cores\esp8266/WString.h line 267
0x40207ffc: loop() at C:\Users\adam\OneDrive\Documents\Arduino\libraries\H4\src\H4.cpp line 278
0x4020e156: uart_read(uart_t*, char*, size_t) at C:\Users\adam\AppData\Local\Arduino15\packages\esp8266\hardware\esp8266\2.5.2\cores\esp8266\uart.cpp line 274
0x4020e14c: uart_read(uart_t*, char*, size_t) at C:\Users\adam\AppData\Local\Arduino15\packages\esp8266\hardware\esp8266\2.5.2\cores\esp8266\uart.cpp line 249
0x4020e1bc: uart_read_char(uart_t*) at C:\Users\adam\AppData\Local\Arduino15\packages\esp8266\hardware\esp8266\2.5.2\cores\esp8266\uart.cpp line 236
0x402039e5: H4P_SerialCmd::_run() at C:\Users\adam\AppData\Local\Arduino15\packages\esp8266\hardware\esp8266\2.5.2\cores\esp8266/HardwareSerial.h line 134
0x4020e998: _umm_free(void*) at C:\Users\adam\AppData\Local\Arduino15\packages\esp8266\hardware\esp8266\2.5.2\cores\esp8266\umm_malloc\umm_malloc.cpp line 1304
0x4020f0c0: free(void*) at C:\Users\adam\AppData\Local\Arduino15\packages\esp8266\hardware\esp8266\2.5.2\cores\esp8266\umm_malloc\umm_malloc.cpp line 1764
0x4021cefe: operator delete(void*) at ../../../../../dl/gcc-xtensa/libstdc++-v3/libsupc++/del_op.cc line 48
0x402036f4: std::_Function_base::_Base_manager ::_M_manager(std::_Any_data&, std::_Function_base::_Base_manager  const&, std::_Manager_operation) at c:\users\adam\appdata\local\arduino15\packages\esp8266\tools\xtensa-lx106-elf-gcc\2.5.0-3-20ed2b9\xtensa-lx106-elf\include\c++\4.8.2/functional line 1954
0x40214120: std::_Function_base::~_Function_base() at c:\users\adam\appdata\local\arduino15\packages\esp8266\tools\xtensa-lx106-elf-gcc\2.5.0-3-20ed2b9\xtensa-lx106-elf\include\c++\4.8.2/functional line 2031
0x40207fdf: H4::loop() at c:\users\adam\appdata\local\arduino15\packages\esp8266\tools\xtensa-lx106-elf-gcc\2.5.0-3-20ed2b9\xtensa-lx106-elf\include\c++\4.8.2/functional line 2174
0x402036c0: std::_Function_base::_Base_manager ::_M_manager(std::_Any_data&, std::_Function_base::_Base_manager  const&, std::_Manager_operation) at c:\users\adam\appdata\local\arduino15\packages\esp8266\tools\xtensa-lx106-elf-gcc\2.5.0-3-20ed2b9\xtensa-lx106-elf\include\c++\4.8.2/functional line 1931
0x40203a68: std::_Function_handler ::_M_invoke(std::_Any_data const&) at C:\Users\adam\OneDrive\Documents\Arduino\libraries\H4Plugins\src/H4P_SerialCmd.h line 75
0x4020c5b9: esp_schedule() at C:\Users\adam\AppData\Local\Arduino15\packages\esp8266\hardware\esp8266\2.5.2\cores\esp8266\core_esp8266_main.cpp line 95
0x4020c64e: loop_wrapper() at C:\Users\adam\AppData\Local\Arduino15\packages\esp8266\hardware\esp8266\2.5.2\cores\esp8266\core_esp8266_main.cpp line 127

@philbowles
Copy link

Latest: user has throttled his router and ESP now receiving max 100k/s all traffic ...and - heap stable @ 95% of start value, so its back to looking like a rate thing with ESP / core / lwip not able to free packets fast enough AND that triggering a memleak.

I don't know what else I can do., but happy try try sensible and polite suggestions

@TD-er
Copy link
Contributor

TD-er commented Apr 17, 2020

The idea I had when suggesting it was something like this.

Just assume the unprocessed packets remain in memory for N seconds until they are cleared.
So the amount of heap allocations can remain constant as long as it does have enough time to process all stored packets at less than the rate they come in.
If you exceed that threshold once, you will see it uses a bit more memory, but it should keep up later on.

This all goes well as long as the time needed to process it remains constant and the rate of messages fluctuates so that it gets below the limit the node can handle every now and then.
But the time needed to process is not constant if the memory gets more fragmented.
New allocations do take more time on fragmented memory.
So after each 'burst' of messages, it may use a bit more memory and the heap does get more fragmented and thus increases the processing time.

Given this theory, you would see an increase of speed at which the free heap declines. Or at least as long as the average rate of packets remains constant and just about the initial threshold of what the node can handle. Also the rate must fluctuate to see this happening.

@TD-er
Copy link
Contributor

TD-er commented Apr 17, 2020

A simple test could be to run the ESP at 160 MHz. If it can keep up longer with the other conditions the same, then my theory is a bit more plausible.

@philbowles
Copy link

@TD-er thats our plan for tomorrow: Run 1: unthrottle router re-run @ 160Mhz, Run 2: throttle router to 100kb/s then rerun

@d-a-v
Copy link
Collaborator

d-a-v commented Apr 17, 2020

@philbowles #6895 was intended to solve a similar UDP issue (just read #6831).
Are you using 2.6.3 or did you and your tester tried using latest master ?

@philbowles
Copy link

philbowles commented Apr 17, 2020

@d-a-v Sorry still using 2.6.3 trying to nail the beast - if its been fixed we are wasting time, so will try to get him to do latest master tomorrow, merci! (et salutations de L'Orne 61330) :)

@devyte
Copy link
Collaborator

devyte commented Apr 18, 2020

The stack dump above says core 2.5.2, not 2.6.3.

@devyte
Copy link
Collaborator

devyte commented Apr 18, 2020

@d-a-v in case you haven't noticed it, @philbowles is using the AsyncUDP lib, not ours.

@philbowles
Copy link

Good news and bad news. "My man" has rerun some tests this morning and...debuggers worst nightmare: It's gone away. Nothing he has tried (2.5.2 reversion, 2.6.3 etc) will now cause the heap loss.

Even more surprising his SkyQ box is still flooding the netwrok, and my logger shows it is actually peaking at 55 broadcasts/second and stable as a rock, bouncing up n down between 80% and 95% as the rate fluctuates, but basically, "flatlining"

Tail of ths a.m. log after 45 minutes uptime:


09:42:13.982 -> From 192.168.1.110:35442 1554678487 38064 93% rate=6/sec (max 55)
09:42:13.982 -> ******************* Heap GAIN 656 bytes ******************
09:42:13.982 -> From 192.168.1.110:34212 1554695180 38720 95% rate=7/sec (max 55)
09:42:14.015 -> ******************* Heap LOSS 1840 bytes ******************
09:42:14.015 -> From 192.168.1.110:42525 1554712194 36880 91% rate=8/sec (max 55)
09:42:14.015 -> ******************* Heap GAIN 640 bytes ******************
09:42:14.015 -> From 192.168.1.110:34916 1554728689 37520 92% rate=9/sec (max 55)
09:42:14.049 -> ******************* Heap GAIN 592 bytes ******************
09:42:14.049 -> From 192.168.1.110:34916 1554745372 38112 94% rate=10/sec (max 55)
09:42:14.049 -> ******************* Heap GAIN 600 bytes ******************
09:42:14.049 -> From 192.168.1.110:34007 1554762205 38712 95% rate=11/sec (max 55)
09:42:14.083 -> ******************* Heap GAIN 632 bytes ******************
09:42:14.083 -> From 192.168.1.110:34916 1554779013 39344 97% rate=12/sec (max 55)
09:42:14.797 -> ******************* Heap GAIN 320 bytes ******************
09:42:14.797 -> From 192.168.1.101:63931 1555517278 39664 97% rate=1/sec (max 55)

His bewildered suggestion is that his boxes got firmware uploaded overnight. At a total loss for an explanation, I tend to agree with him, but only because I can think of few other realistic explanations. :(

The only +ve from this is that rate does not now seem to be the core issue. We think "bad packet by sky box (now fixed)" is/was the answer.

I wish I could tell you something different, but now neither of us can reproduce the problem.

I am still happy to try to help if i can , of course.

@TD-er
Copy link
Contributor

TD-er commented Apr 18, 2020

Maybe you also switched WiFi channels on the ESP, to one with less disturbances (less retransmits)?

@FinduschkaLi
Copy link

FinduschkaLi commented Apr 18, 2020

Hi I am following this thread closely, since I have a prob with ESP8266 resetting since 6 months.
I run a websocket client and observe resets several times a day (in particular with one particular wifi hotspot consisting in a wifi repeater, I could not yet reproduce the same behaviour on my mobile phone hotspot)

I basically loose heap in every reconnection to Wifi attempt when my router is switched off and can't be reached...

I tried a bunch of different things, no solution yet. Currently running tests with the beta 0.0.2 as indicated above, but behaviour stays the same. Will share the next stack prints. I am currently trying to make a minimal version to be able to reproduce the behaviour.

Library Version 2.6.3 + Beta 0.0.2
Lwip v2 - lower Memory
CPU 80Mhz

Let me know if I can be of any help to this.

@devyte
Copy link
Collaborator

devyte commented Apr 18, 2020

@FinduschkaLi that sounds like a completely different problem. Please don't hijack this thread, which is specific to a reported mem leak on each loop.

@devyte
Copy link
Collaborator

devyte commented Apr 18, 2020

@philbowles it sounds to me like your friend had a corrupted build. I've seen that reported, and a clean build from scratch would make the problem go away.

@devyte
Copy link
Collaborator

devyte commented Apr 18, 2020

@civilman2006 Your original mcve uses pubsub mqtt. I've seen mem leaks reported when using that 3rd party lib, and failure to reproduce without it using just our core. I suggest working with tbe authors of pubsub to reach a mcve that uses only our core.
I'm closing this. If any of the involved parties can produce a mcve that shows the mem leak and that doesn't use 3rd party libs, please open a new issue and follow the instructions in the issue template.

@devyte devyte closed this as completed Apr 18, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants