Skip to content

Conversation

lnotspotl
Copy link
Member

@lnotspotl lnotspotl commented Jul 30, 2025

Purpose

Currently, in case of python applications, when a device is disconnected or the firmware crashes, the code in Pipeline's __exit__ method tries to retrieve a crashdump from the device. This usually takes several seconds during which the application (pybind11) doesn't check for interrupts or other errors, thus rendering the running program unresponsive (for instance, pressing the ctrl+C key combination has no effect).

This PR moves the mentioned long-blocking code into a detached thread and periodically checks for interrupts and other signals in the main thread. Doing this will allow users to interrupt the crashdump collecting procedure and exit their application essentially immediately.

Moreover, this PR introduces a tiny fix where the crashdump collection timeout was set to 9000 seconds (2.5 hours). I presume the intended unit of time in this case was meant to be milliseconds.

@lnotspotl lnotspotl self-assigned this Jul 30, 2025
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment bugbot run to trigger another review on this PR

if(PyErr_CheckSignals() != 0) {
throw py::error_already_set();
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Thread Safety & GIL Issues in Pipeline Cleanup

The __exit__ method introduces a detached thread for pipeline cleanup, leading to several issues:

  • Data Race & Use-after-scope: The local threadRunning variable is captured by reference by the detached thread. It's accessed concurrently by the main thread (reading) and the detached thread (writing) without synchronization, causing a data race and undefined behavior. If the main thread's __exit__ method exits early (e.g., due to an exception), threadRunning goes out of scope, leading to a use-after-scope bug when the detached thread attempts to access the destroyed variable, also resulting in undefined behavior. This could cause the main thread to loop indefinitely.
  • Missing GIL Release: The py::gil_scoped_release was removed. The main thread now holds the Python Global Interpreter Lock while polling, which can block other Python threads and degrade performance.
Locations (1)
Fix in Cursor Fix in Web

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the data-race is fine here. Either, the thread will reach the end of the lamba function and the __exit__ method will finish as expected or the entire program will exit prematurely in which case worrying about data-races is unnecessary.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One edge case (which I'm not sure can ever happen) is if the user on Python side wraps everything in try/catch and wants to restart the pipeline or something like that.

Having the thread detached in that scenario is dangerous, but I don't see a good quick solution for such a case, unless we add a way to cleanly interrupt the crashdump collection and just do a fast clean-up.

std::conditional_variable for the rescue.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@moratom Is the simple example bellow something that you had in mind? I am not sure I understand your concerns correctly.

#!/usr/bin/env python3

import cv2
import depthai as dai

# Create pipeline
with dai.Pipeline() as pipeline:
    # Define source and output
    cam = pipeline.create(dai.node.Camera).build()
    videoQueue = cam.requestOutput((640,400)).createOutputQueue()

    # Connect to device and start pipeline
    try:
        pipeline.start()
        while pipeline.isRunning():
            videoIn = videoQueue.get()
            assert isinstance(videoIn, dai.ImgFrame)
            cv2.imshow("video", videoIn.getCvFrame())

            if cv2.waitKey(1) == ord("q"):
                break
    except KeyboardInterrupt:
        print("Exception caught!")

Having tested the above, if the KeyboardInterrupt exception is caught by the try-except block, then that exception will not get propagated down to the bindings, thus working as expected.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moreover, having played around with the simplest of python scripts possible, similar in nature to the one above, I've noticed that calling Pipeline's constructor upon entering the context manager also renders the program unresponsive due to the device routine taking a considerable amount of time to find a device. Injecting in a detached thread, just like in the __exit__ method, seems to have solved that issue and so pressing ctrl+C leads to an almost immediate exit. This very much mimics the behavior of our C++ examples.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lnotspotl lnotspotl changed the title Improve depthai-core's responsiveness upon device crash or node errors Improve depthai-core's responsiveness upon a device crash or a node error Jul 30, 2025
@lnotspotl lnotspotl requested a review from Copilot July 30, 2025 19:29
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR improves the responsiveness of Python applications when handling device crashes or disconnections by preventing the application from becoming unresponsive during crashdump collection. The changes address timeout unit inconsistencies, refactor blocking cleanup operations, and allow proper interrupt handling during device error scenarios.

  • Moves blocking pipeline cleanup operations to a detached thread to maintain application responsiveness
  • Fixes crashdump timeout unit from seconds to milliseconds for consistency
  • Adds periodic interrupt checking to allow immediate application termination (e.g., Ctrl+C)

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
src/device/DeviceBase.cpp Fixes crashdump timeout calculation by correcting unit from seconds to milliseconds
bindings/python/src/pipeline/PipelineBindings.cpp Refactors pipeline cleanup to run in detached thread with interrupt checking
README.md Updates documentation to clarify timeout units are milliseconds

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants