-
-
Notifications
You must be signed in to change notification settings - Fork 31.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reolink prevent ONVIF push being lost due to ConnectionResetError #91070
Conversation
Could this please be added to the 2023.4.2 milestone? |
@starkillerOG I'm not completely sure if the addition of asyncio shielding is the cause of this issue as I have not tried this integration before 2023.4 but the ONVIF POST requests to the home assistant endpoint never completes. I noticed this when I created a nodejs proxy server that forwards the ONVIF POST requests to the integration's webhook (the reason behind the proxy is because I have my home assistant on HTTPS and Reolink cameras can't send ONVIF requests to HTTPS endpoints). If I send an empty request to the webhook, the request returns an immediate 200 as the integration short circuits on the empty request body. When I send a real ONVIF request, the request hangs indefinitely and never completes. The integration does however register the ONVIF request and reacts accordingly. As far as I can tell nothing in the webhook handler should be blocking forever, the only possible culprit might be the asyncio.shield call. I wonder if this has any ramifications for the camera as the connections are being kept open for a long time. As the camera sends more and more requests it might start experiencing issues. |
On second thought I wonder if that's why the camera is dropping the connection so frequently, because it's running out of resources. |
@kevdliu what I think is happening: the received ONVIF data does not include the AI information or has other missing fields (like channel). The camera will be polled for its motion+AI state, this takes too long for the ONVIF Post request, so the connection is closed and the My latest commit to this PR ensures that as soon as the ONVIF data is read, the |
@bdraco a simpler alternative would be using the However I think this requires the |
I'm not sure we want to carry around the complexity of another job scheduler in core unless we find we need to do this multiple places. As for right now I don't think we have many other webhooks that need shielding |
@bdraco I already thought so, do you think the current implementation of this PR is then acceptable? |
Co-authored-by: J. Nick Koston <nick@koston.org>
I think you can write it something like this |
@bdraco thank you very much for the rewrite, I like it! |
If it doesn't work now, it's not us 🤷 |
I did aditional tests with the code of this PR (Current implementation), compared to how it was before (Old implementation) for 300 motion/AI people callbacks: I made a follow up PR #91478 that moves the read to the primary callback and uses try except to protect against cancellation and ConnectionResetError, for 300 callbacks that got: I also tested the follow up PR by adding |
Breaking change
Proposed change
When a ONVIF push message comes in, first the coroutine was shielded using asyncio.shield to prevent cancellation when the Reolink camera disconnects (The reolink camera has the nasty habbit of disconnecting really fast after sending the ONVIF push, it does not wait on the HTTP_OK).
However the await.shield can apperently take up to 182ms (and 68 ms) as seen in the image below (original situation).
T1 error message is logged just before the
await asyncio.shield(self.handle_webhook_shielded(hass, webhook_id, request))
T2 error message is logged at the top of
handle_webhook_shielded
In those cases the Reolink camera already disconnected, the couritine is successfully shielded, but the
data = await request.text()
would throw aConnectionResetError("Lost connection")
which was not handeld and did not apear in the log (probably because the parent task was already cancelled). In the image above I catched this error and logged it during testing.This all would result in ONVIF pushes being lost halfway through processing meaning doorbell press, people detection events etc. would not be detected.
In this PR first the data is read from the webhook before schielding the rest of the processing, in this way there is no delay of up to 182 ms and the connection is still open. In my testing I did not lose any ONVIF pushes anymore.
However there might still be the possiblity the
handle_webhook
could be cancelled if the camera manages to disconnect beforedata = await request.text()
finishes, I don't know enough about the details of aiohttp.web to know if this is realistic or even possible.I would very much like to not have the task be cancelled in the first place, but I don't think that is possible unless the cancellation policy of all HomeAssistant webhooks is changed, which is probably not a good idea. Ideally a more complex
CancelOnDisconnectRequestHandler
could be made which allows a integration to set if the tasks of the specific webhook get cancelled or not when it registers the webhook. Something in this direction was implemented by @bdraco in #88046 to temporarily solve a issue in aiohttp. But that goes way above my capabilities.Type of change
Additional information
Checklist
black --fast homeassistant tests
)If user exposed functionality or configuration variables are added/changed:
If the code communicates with devices, web services, or third-party tools:
Updated and included derived files by running:
python3 -m script.hassfest
.requirements_all.txt
.Updated by running
python3 -m script.gen_requirements_all
..coveragerc
.To help with the load of incoming pull requests: