Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

thread: Terminating threads go to zombie state instead of stopped #19197

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

chrysn
Copy link
Member

@chrysn chrysn commented Jan 25, 2023

Contribution description

Let's categorize threads in general:

  1. Looping threads. They just don't quit. (Or maybe they quit in an error situation).
  2. Setup threads. Once they quit, they're not started again.
  3. Threads that perform a task, and when they're done, they close, and expect to be created again.

Category 3 is currently hard to do -- I know of one example (the SUIT worker), and that got it subtly wrong.

I think that it'll be easier to get category 3 right if threads are not completely stopped after termination (something that makes their PID free for reuse, which means that once you hold a PID you'll never know if the thread maybe just terminated and anything you do with that PID now goes to a different process). Instead, this PR makes the thread into a zombie when it dies.

After this PR, it's up to the creator of the process to stop the zombie thread, if its resources are to be reused. Cleanup doesn't need to be done by the same task that created it -- could be anyone. Threads of category 1 and 2 can easily just stay zombies without clean-up: There was enough space in the PID range to accommodate them when they were live, so there's still space after. Category 3 tasks already needed to do some cleanup, now they have better tools for it.

Possible extensions (in this PR or a follow-up)

  • Documentation (really needs to go here)

  • Fixing the SUIT case

  • Storing the thread's return value somewhere in the zombie state (thread_zombify_with_value? Maybe unioned with the SP?)

  • Adding a way to wait for a thread when it's not clear yet that the thread has terminated. (It's generally hard to do that from a higher priority thread: Imagine you used a message or mutex to signal that the dying process is done, right before returning. Then the waiting thread would still be awoken right before the dying thread had a chance to go to zombie mode. One implementation solution would be for the waiting call to lift the awaited process's priority to the caller and yield; still, this can be abused and badly block the system).

    I think that having this is not critical for a MVP version of this PR -- after all, creating the thread anew will "just" fail, and the caller is invited to try again.

Testing procedure

TBD. Right now, this only contains the change itself to see what the fall-out will be.

Issues/PRs references

This is one way to close #19195.

There was previous discussion of this topic in #17542 -- back when I thought the underlying problem could be resolved more easily.

This is similar to how POSIX systems handle terminated threads. On those systems, there's a clear parent process that's responsible for clean-up -- here we don't have that, but we don't need it: Any process can do the clean-up, as long as it's only one process. (Similar to how we unlock mutexes).

@chrysn chrysn added Type: enhancement The issue suggests enhanceable parts / The PR enhances parts of the codebase / documentation Area: core Area: RIOT kernel. Handle PRs marked with this with care! Process: API change Integration Process: PR contains or issue proposes an API change. Should be handled with care. CI: ready for build If set, CI server will compile all applications for all available boards for the labeled PR Process: needs >1 ACK Integration Process: This PR requires more than one ACK labels Jan 25, 2023
@github-actions github-actions bot added the Process: missing approvals Integration Process: PR needs more ACKS (handled by action) label Jan 25, 2023
@chrysn chrysn added CI: run tests If set, CI server will run tests on hardware for the labeled PR CI: ready for build If set, CI server will compile all applications for all available boards for the labeled PR and removed CI: ready for build If set, CI server will compile all applications for all available boards for the labeled PR labels Jan 25, 2023
@riot-ci
Copy link

riot-ci commented Jan 25, 2023

Murdock results

FAILED

833de50 thread: Zombification is NORETURN when called from threads

Success Failures Total Runtime
3219 0 6860 04m:27s

Artifacts

bors bot added a commit that referenced this pull request Jan 27, 2023
19199: sys/suit: Ensure previous thread is stopped before reusing its stack r=benpicco a=chrysn

### Contribution description

Closes: #19195

If the thread has released the mutex but the thread has not terminated (which happens in the situation that would previously have overwritten a still active thread's state), then a warning is shown and the trigger is ignored.

### Testing procedure

This should work before and after:

* `make -C examples/suit_update BOARD=native all term`
* `aiocoap-client coap://'[fe80::3c63:beff:fe85:ca96%tapbr0]'/suit/trigger -m POST --payload 'coap://[2001:db8::]/foo'`
* In parallel, on the RIOT shell, run `suit fetch coap://[2001:db8::]/foo`
* After the first download fails, the second one starts right away ("suit_worker: update failed, hdr invalid" / "suit_worker: started").

Run again with the worker thread on low priority:

```patch
diff --git a/sys/suit/transport/worker.c b/sys/suit/transport/worker.c
index a54022fb28..e26701a64c 100644
--- a/sys/suit/transport/worker.c
+++ b/sys/suit/transport/worker.c
`@@` -70 +70 `@@`
-#define SUIT_COAP_WORKER_PRIO THREAD_PRIORITY_MAIN - 1
+#define SUIT_COAP_WORKER_PRIO THREAD_PRIORITY_MAIN + 1
```

Before, this runs the download once silently (no clue why it doesn't run it twice, but then again, I claim there's concurrent memory access from two thread, so who knows what happens). After, it runs a single download and shows an error message for the second one once the first is complete ("Ignoring SUIT trigger: worker is still busy.").

### Issues/PRs references

This may be made incrementally easier by #19197 -- that PR as it is now would spare us the zombification (because returning would do that), and having a `wait` function would allow us to turn the new error case into a success.

19205: boards/common: add common timer config for GD32VF103 boards r=benpicco a=benpicco



19207: examples/gnrc_border_router: static: use router from advertisements by default r=benpicco a=benpicco




Co-authored-by: chrysn <chrysn@fsfe.org>
Co-authored-by: Benjamin Valentin <benjamin.valentin@bht-berlin.de>
Co-authored-by: Benjamin Valentin <benpicco@beuth-hochschule.de>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area: core Area: RIOT kernel. Handle PRs marked with this with care! CI: ready for build If set, CI server will compile all applications for all available boards for the labeled PR CI: run tests If set, CI server will run tests on hardware for the labeled PR Process: API change Integration Process: PR contains or issue proposes an API change. Should be handled with care. Process: missing approvals Integration Process: PR needs more ACKS (handled by action) Process: needs >1 ACK Integration Process: This PR requires more than one ACK Type: enhancement The issue suggests enhanceable parts / The PR enhances parts of the codebase / documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Race condition in SUIT around threads terminating
2 participants