-
Notifications
You must be signed in to change notification settings - Fork 152
Improve depthai-core
's responsiveness upon a device crash or a node error
#1404
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve depthai-core
's responsiveness upon a device crash or a node error
#1404
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment bugbot run
to trigger another review on this PR
if(PyErr_CheckSignals() != 0) { | ||
throw py::error_already_set(); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Thread Safety & GIL Issues in Pipeline Cleanup
The __exit__
method introduces a detached thread for pipeline cleanup, leading to several issues:
- Data Race & Use-after-scope: The local
threadRunning
variable is captured by reference by the detached thread. It's accessed concurrently by the main thread (reading) and the detached thread (writing) without synchronization, causing a data race and undefined behavior. If the main thread's__exit__
method exits early (e.g., due to an exception),threadRunning
goes out of scope, leading to a use-after-scope bug when the detached thread attempts to access the destroyed variable, also resulting in undefined behavior. This could cause the main thread to loop indefinitely. - Missing GIL Release: The
py::gil_scoped_release
was removed. The main thread now holds the Python Global Interpreter Lock while polling, which can block other Python threads and degrade performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the data-race is fine here. Either, the thread will reach the end of the lamba function and the __exit__
method will finish as expected or the entire program will exit prematurely in which case worrying about data-races is unnecessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One edge case (which I'm not sure can ever happen) is if the user on Python side wraps everything in try/catch and wants to restart the pipeline or something like that.
Having the thread detached in that scenario is dangerous, but I don't see a good quick solution for such a case, unless we add a way to cleanly interrupt the crashdump collection and just do a fast clean-up.
std::conditional_variable
for the rescue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@moratom Is the simple example bellow something that you had in mind? I am not sure I understand your concerns correctly.
#!/usr/bin/env python3
import cv2
import depthai as dai
# Create pipeline
with dai.Pipeline() as pipeline:
# Define source and output
cam = pipeline.create(dai.node.Camera).build()
videoQueue = cam.requestOutput((640,400)).createOutputQueue()
# Connect to device and start pipeline
try:
pipeline.start()
while pipeline.isRunning():
videoIn = videoQueue.get()
assert isinstance(videoIn, dai.ImgFrame)
cv2.imshow("video", videoIn.getCvFrame())
if cv2.waitKey(1) == ord("q"):
break
except KeyboardInterrupt:
print("Exception caught!")
Having tested the above, if the KeyboardInterrupt
exception is caught by the try-except block, then that exception will not get propagated down to the bindings, thus working as expected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moreover, having played around with the simplest of python scripts possible, similar in nature to the one above, I've noticed that calling Pipeline's
constructor upon entering the context manager also renders the program unresponsive due to the device routine taking a considerable amount of time to find a device. Injecting in a detached thread, just like in the __exit__
method, seems to have solved that issue and so pressing ctrl+C leads to an almost immediate exit. This very much mimics the behavior of our C++ examples.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
depthai-core
's responsiveness upon device crash or node errorsdepthai-core
's responsiveness upon a device crash or a node error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR improves the responsiveness of Python applications when handling device crashes or disconnections by preventing the application from becoming unresponsive during crashdump collection. The changes address timeout unit inconsistencies, refactor blocking cleanup operations, and allow proper interrupt handling during device error scenarios.
- Moves blocking pipeline cleanup operations to a detached thread to maintain application responsiveness
- Fixes crashdump timeout unit from seconds to milliseconds for consistency
- Adds periodic interrupt checking to allow immediate application termination (e.g., Ctrl+C)
Reviewed Changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
File | Description |
---|---|
src/device/DeviceBase.cpp | Fixes crashdump timeout calculation by correcting unit from seconds to milliseconds |
bindings/python/src/pipeline/PipelineBindings.cpp | Refactors pipeline cleanup to run in detached thread with interrupt checking |
README.md | Updates documentation to clarify timeout units are milliseconds |
Purpose
Currently, in case of python applications, when a device is disconnected or the firmware crashes, the code in
Pipeline
's__exit__
method tries to retrieve a crashdump from the device. This usually takes several seconds during which the application (pybind11) doesn't check for interrupts or other errors, thus rendering the running program unresponsive (for instance, pressing the ctrl+C key combination has no effect).This PR moves the mentioned long-blocking code into a detached thread and periodically checks for interrupts and other signals in the main thread. Doing this will allow users to interrupt the crashdump collecting procedure and exit their application essentially immediately.
Moreover, this PR introduces a tiny fix where the crashdump collection timeout was set to 9000 seconds (2.5 hours). I presume the intended unit of time in this case was meant to be milliseconds.