Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stack overflow with the new NCM driver #2711

Closed
1 task done
showier-drastic opened this issue Jul 14, 2024 · 12 comments · Fixed by #2713
Closed
1 task done

Stack overflow with the new NCM driver #2711

showier-drastic opened this issue Jul 14, 2024 · 12 comments · Fixed by #2713
Labels

Comments

@showier-drastic
Copy link

Operating System

Linux

Board

Custom ESP32 S3

Firmware

My own, but this is not relevant - will explain below

What happened ?

The new USB NCM driver - introduced in #2227 - is causing a stack overflow.

It causes the problem like this:

tud_network_recv_renew will call into recv_transfer_datagram_to_glue_logic. recv_transfer_datagram_to_glue_logic will in turn call tud_network_recv_cb if there's data - and when the cb called, the recv_glue_ntb is not cleared yet.

The problem is that some tud_network_recv_cb, like the one in esp_tinyusb, will unconditionally call tud_network_recv_renew. So the whole thing happens again and again, causing infinite recursion.

So, it's either Espressif will need to change their code ( @ESP-Coco ) and documentation added in TinyUSB to prohibit calling tud_network_recv_renew directly within tud_network_recv_cb, or the logic within recv_transfer_datagram_to_glue_logic need to be changed to prevent this infinite recursion.

@HiFiPhile @rgrr could I ask for your opinions on this?

How to reproduce ?

If you are not using ESP32, call tud_network_recv_renew directly within tud_network_recv_cb like the ESP code mentioned above.

If you are using ESP32 with ESP-IDF, you can try to replace the src/class/net folder of the tinyusb component to the latest master branch of tinyusb.

Debug Log as txt file (LOG/CFG_TUSB_DEBUG=2)

1.txt

Screenshots

No response

I have checked existing issues, dicussion and documentation

  • I confirm I have checked existing issues, dicussion and documentation.
@showier-drastic
Copy link
Author

I think the current situation on reception handling is confusing and probably wrong. tud_network_recv_renew seems to be called twice in a single packet reception, the first time by tud_network_recv_renew_r which is in turn called by netd_xfer_cb, and another time by user logic when it finished processing the current packet. This also means recv_try_to_start_new_reception is called twice in a single reception.

@showier-drastic
Copy link
Author

showier-drastic commented Jul 14, 2024

OK, I understand that we may process multiple NCM "subpackets" in one callback, and that's why we need to repeatedly call tud_network_recv_renew.

So, I propose a solution to this issue to add re-entrance detection logic to tud_network_recv_renew function.

void tud_network_recv_renew(void) {
  TU_LOG_DRV("tud_network_recv_renew()\n");

  static bool processing = false;
  static bool should_process = false;

  should_process = true;
  if (processing) {
    TU_LOG_DRV("Re-entrant into tud_network_recv_renew, will process later\n");
    return;
  }

  while (should_process) {
    should_process = false;
    processing = true;
    // If the current function is called within recv_transfer_datagram_to_glue_logic,
    // should_process will become true, and the loop will run again
    // Otherwise the loop will not run again
    recv_transfer_datagram_to_glue_logic();
    processing = false;
  }
  recv_try_to_start_new_reception(ncm_interface.rhport);
} // tud_network_recv_renew

@rgrr
Copy link
Contributor

rgrr commented Jul 14, 2024

Hmmm... not getting the point. Do you think the problem is in the NCM driver or in the glue code?

Perhaps my own glue code could be of help?: https://github.com/rgrr/yapicoprobe/blob/master/src/net/net_glue.c

@showier-drastic
Copy link
Author

I think it's in the NCM driver, which cause compatibility issues with older codes that calls tud_network_recv_renew directly within tud_network_recv_cb. And this can be fixed by my function above.

@rgrr
Copy link
Contributor

rgrr commented Jul 15, 2024

ok... understood. Good point. Loop above seems to be too complicated for my eyes. Is something like this working:

void tud_network_recv_renew(void) {
  TU_LOG_DRV("tud_network_recv_renew()\n");

  static bool should_process = false;    // should go into ncm_interface_t

  if (should_process) {
    TU_LOG_DRV("Re-entrant into tud_network_recv_renew, will process later\n");
    return;
  }
  should_process = true;

  while (should_process) {
    should_process = false;
    // If the current function is called within recv_transfer_datagram_to_glue_logic,
    // should_process will become true, and the loop will run again
    // Otherwise the loop will not run again
    recv_transfer_datagram_to_glue_logic();
  }
  recv_try_to_start_new_reception(ncm_interface.rhport);
} // tud_network_recv_renew

Or will it end up again in an endless loop?

@showier-drastic
Copy link
Author

No, this will not work, because when you call recv_transfer_datagram_to_glue_logic(), should_process is false, so the if (should_process) block won't be executed. Keep in mind that the second call to tud_network_recv_renew is inside recv_transfer_datagram_to_glue_logic.

@rgrr
Copy link
Contributor

rgrr commented Jul 15, 2024

Hmmm... you are right (perhaps I need some time to rethink it ;-))

Perhaps a simple semaphore could do?:

void tud_network_recv_renew(void) {
  static bool sema = false;

  if ( !sema) {
    sema = true;

    TU_LOG_DRV("tud_network_recv_renew()\n");

    recv_transfer_datagram_to_glue_logic();
    sema = false;
  }
  recv_try_to_start_new_reception(ncm_interface.rhport);
} // tud_network_recv_renew

@showier-drastic
Copy link
Author

showier-drastic commented Jul 15, 2024

This will not work when there are 3 ntbs handled in one call to netd_xfer_cb.

I think either a loop or a recursion is needed in case tud_network_recv_renew is called directly within tud_network_recv_cb. One call to recv_transfer_datagram_to_glue_logic only process on ntb. As we need to process multiple ntbs, it's obvious that multiple calls to recv_transfer_datagram_to_glue_logic is needed. We don't want recursions (because that might cause stack overflow on MCUs), so this means we need loops.

@rgrr
Copy link
Contributor

rgrr commented Jul 15, 2024

you are right (again)... perhaps it could be solved by letting recv_transfer_datagram_to_glue_logic() return an indication that some work has been done (or that there is some work still waiting)?

@showier-drastic
Copy link
Author

Yes, but I think should_process in my code is this indication.

@rgrr
Copy link
Contributor

rgrr commented Jul 15, 2024

got that, but to be honest I'm not happy with two variables for this logic.

@rgrr
Copy link
Contributor

rgrr commented Jul 15, 2024

Could you please check PR #2713

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants