Improve `depthai-core`'s responsiveness upon a device crash or a node error #1404

lnotspotl · 2025-07-30T19:11:18Z

Purpose

Currently, in case of python applications, when a device is disconnected or the firmware crashes, the code in Pipeline's __exit__ method tries to retrieve a crashdump from the device. This usually takes several seconds during which the application (pybind11) doesn't check for interrupts or other errors, thus rendering the running program unresponsive (for instance, pressing the ctrl+C key combination has no effect).

This PR moves the mentioned long-blocking code into a detached thread and periodically checks for interrupts and other signals in the main thread. Doing this will allow users to interrupt the crashdump collecting procedure and exit their application essentially immediately.

Moreover, this PR introduces a tiny fix where the crashdump collection timeout was set to 9000 seconds (2.5 hours). I presume the intended unit of time in this case was meant to be milliseconds.

…evice to close

cursor

Comment bugbot run to trigger another review on this PR

cursor · 2025-07-30T19:13:15Z

bindings/python/src/pipeline/PipelineBindings.cpp

+                     if(PyErr_CheckSignals() != 0) {
+                         throw py::error_already_set();
+                     }
+                 }


Bug: Thread Safety & GIL Issues in Pipeline Cleanup

The __exit__ method introduces a detached thread for pipeline cleanup, leading to several issues:

Data Race & Use-after-scope: The local threadRunning variable is captured by reference by the detached thread. It's accessed concurrently by the main thread (reading) and the detached thread (writing) without synchronization, causing a data race and undefined behavior. If the main thread's __exit__ method exits early (e.g., due to an exception), threadRunning goes out of scope, leading to a use-after-scope bug when the detached thread attempts to access the destroyed variable, also resulting in undefined behavior. This could cause the main thread to loop indefinitely.

Missing GIL Release: The py::gil_scoped_release was removed. The main thread now holds the Python Global Interpreter Lock while polling, which can block other Python threads and degrade performance.

Locations (1)

bindings/python/src/pipeline/PipelineBindings.cpp#L113-L132

I think the data-race is fine here. Either, the thread will reach the end of the lamba function and the __exit__ method will finish as expected or the entire program will exit prematurely in which case worrying about data-races is unnecessary.

One edge case (which I'm not sure can ever happen) is if the user on Python side wraps everything in try/catch and wants to restart the pipeline or something like that.

Having the thread detached in that scenario is dangerous, but I don't see a good quick solution for such a case, unless we add a way to cleanly interrupt the crashdump collection and just do a fast clean-up.

std::conditional_variable for the rescue.

@moratom Is the simple example bellow something that you had in mind? I am not sure I understand your concerns correctly.

#!/usr/bin/env python3 import cv2 import depthai as dai # Create pipeline with dai.Pipeline() as pipeline: # Define source and output cam = pipeline.create(dai.node.Camera).build() videoQueue = cam.requestOutput((640,400)).createOutputQueue() # Connect to device and start pipeline try: pipeline.start() while pipeline.isRunning(): videoIn = videoQueue.get() assert isinstance(videoIn, dai.ImgFrame) cv2.imshow("video", videoIn.getCvFrame()) if cv2.waitKey(1) == ord("q"): break except KeyboardInterrupt: print("Exception caught!")

Having tested the above, if the KeyboardInterrupt exception is caught by the try-except block, then that exception will not get propagated down to the bindings, thus working as expected.

Moreover, having played around with the simplest of python scripts possible, similar in nature to the one above, I've noticed that calling Pipeline's constructor upon entering the context manager also renders the program unresponsive due to the device routine taking a considerable amount of time to find a device. Injecting in a detached thread, just like in the __exit__ method, seems to have solved that issue and so pressing ctrl+C leads to an almost immediate exit. This very much mimics the behavior of our C++ examples.

wo_interrupts.webm
w_interrupts.webm

Copilot

Pull Request Overview

This PR improves the responsiveness of Python applications when handling device crashes or disconnections by preventing the application from becoming unresponsive during crashdump collection. The changes address timeout unit inconsistencies, refactor blocking cleanup operations, and allow proper interrupt handling during device error scenarios.

Moves blocking pipeline cleanup operations to a detached thread to maintain application responsiveness
Fixes crashdump timeout unit from seconds to milliseconds for consistency
Adds periodic interrupt checking to allow immediate application termination (e.g., Ctrl+C)

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File	Description
src/device/DeviceBase.cpp	Fixes crashdump timeout calculation by correcting unit from seconds to milliseconds
bindings/python/src/pipeline/PipelineBindings.cpp	Refactors pipeline cleanup to run in detached thread with interrupt checking
README.md	Updates documentation to clarify timeout units are milliseconds

bindings/python/src/pipeline/PipelineBindings.cpp

…etached threads

check for interrupts periodically in __exit__ while waiting for the d…

519121f

…evice to close

lnotspotl self-assigned this Jul 30, 2025

cursor bot reviewed Jul 30, 2025

View reviewed changes

lnotspotl changed the title ~~Improve depthai-core's responsiveness upon device crash or node errors~~ Improve depthai-core's responsiveness upon a device crash or a node error Jul 30, 2025

properly release and re-acquire GIL

fa5088c

lnotspotl requested a review from Copilot July 30, 2025 19:29

Copilot AI reviewed Jul 30, 2025

View reviewed changes

bindings/python/src/pipeline/PipelineBindings.cpp Outdated Show resolved Hide resolved

bindings/python/src/pipeline/PipelineBindings.cpp Outdated Show resolved Hide resolved

bindings/python/src/pipeline/PipelineBindings.cpp Show resolved Hide resolved

lnotspotl added 2 commits August 4, 2025 09:54

Make other pipeline's python methods more responsive by introducing d…

c2cfd4a

…etached threads

Handle possible un-authorized memory accesses

a834fd9

lnotspotl mentioned this pull request Aug 25, 2025

Improve depthai-core's responsiveness upon a device crash or a node error, attempt 2 #1424

Draft

lnotspotl closed this Aug 26, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve `depthai-core`'s responsiveness upon a device crash or a node error #1404

Improve `depthai-core`'s responsiveness upon a device crash or a node error #1404

Uh oh!

lnotspotl commented Jul 30, 2025 •

edited

Loading

Uh oh!

cursor bot left a comment

Uh oh!

cursor bot Jul 30, 2025

Uh oh!

lnotspotl Jul 30, 2025

Uh oh!

moratom Jul 31, 2025

Uh oh!

lnotspotl Aug 4, 2025

Uh oh!

lnotspotl Aug 4, 2025

Uh oh!

lnotspotl Aug 4, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Improve depthai-core's responsiveness upon a device crash or a node error #1404

Improve depthai-core's responsiveness upon a device crash or a node error #1404

Uh oh!

Conversation

lnotspotl commented Jul 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot Jul 30, 2025

Choose a reason for hiding this comment

Bug: Thread Safety & GIL Issues in Pipeline Cleanup

Uh oh!

lnotspotl Jul 30, 2025

Choose a reason for hiding this comment

Uh oh!

moratom Jul 31, 2025

Choose a reason for hiding this comment

Uh oh!

lnotspotl Aug 4, 2025

Choose a reason for hiding this comment

Uh oh!

lnotspotl Aug 4, 2025

Choose a reason for hiding this comment

Uh oh!

lnotspotl Aug 4, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Improve `depthai-core`'s responsiveness upon a device crash or a node error #1404

Improve `depthai-core`'s responsiveness upon a device crash or a node error #1404

lnotspotl commented Jul 30, 2025 •

edited

Loading