Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reolink prevent ONVIF push being lost due to ConnectionResetError #91070

Merged
merged 23 commits into from
Apr 15, 2023

Conversation

starkillerOG
Copy link
Contributor

@starkillerOG starkillerOG commented Apr 8, 2023

Breaking change

Proposed change

When a ONVIF push message comes in, first the coroutine was shielded using asyncio.shield to prevent cancellation when the Reolink camera disconnects (The reolink camera has the nasty habbit of disconnecting really fast after sending the ONVIF push, it does not wait on the HTTP_OK).
However the await.shield can apperently take up to 182ms (and 68 ms) as seen in the image below (original situation).
T1 error message is logged just before the await asyncio.shield(self.handle_webhook_shielded(hass, webhook_id, request))
T2 error message is logged at the top of handle_webhook_shielded

afbeelding

In those cases the Reolink camera already disconnected, the couritine is successfully shielded, but the data = await request.text() would throw a ConnectionResetError("Lost connection") which was not handeld and did not apear in the log (probably because the parent task was already cancelled). In the image above I catched this error and logged it during testing.

This all would result in ONVIF pushes being lost halfway through processing meaning doorbell press, people detection events etc. would not be detected.

In this PR first the data is read from the webhook before schielding the rest of the processing, in this way there is no delay of up to 182 ms and the connection is still open. In my testing I did not lose any ONVIF pushes anymore.

However there might still be the possiblity the handle_webhook could be cancelled if the camera manages to disconnect before data = await request.text() finishes, I don't know enough about the details of aiohttp.web to know if this is realistic or even possible.

I would very much like to not have the task be cancelled in the first place, but I don't think that is possible unless the cancellation policy of all HomeAssistant webhooks is changed, which is probably not a good idea. Ideally a more complex CancelOnDisconnectRequestHandler could be made which allows a integration to set if the tasks of the specific webhook get cancelled or not when it registers the webhook. Something in this direction was implemented by @bdraco in #88046 to temporarily solve a issue in aiohttp. But that goes way above my capabilities.

Type of change

  • Dependency upgrade
  • Bugfix (non-breaking change which fixes an issue)
  • New integration (thank you!)
  • New feature (which adds functionality to an existing integration)
  • Deprecation (breaking change to happen in the future)
  • Breaking change (fix/feature causing existing functionality to break)
  • Code quality improvements to existing code or addition of tests

Additional information

Checklist

  • The code change is tested and works locally.
  • Local tests pass. Your PR cannot be merged unless tests pass
  • There is no commented out code in this PR.
  • I have followed the development checklist
  • I have followed the perfect PR recommendations
  • The code has been formatted using Black (black --fast homeassistant tests)
  • Tests have been added to verify that the new code works.

If user exposed functionality or configuration variables are added/changed:

If the code communicates with devices, web services, or third-party tools:

  • The manifest file has all fields filled out correctly.
    Updated and included derived files by running: python3 -m script.hassfest.
  • New or updated dependencies have been added to requirements_all.txt.
    Updated by running python3 -m script.gen_requirements_all.
  • For the updated dependencies - a link to the changelog, or at minimum a diff between library versions is added to the PR description.
  • Untested files have been added to .coveragerc.

To help with the load of incoming pull requests:

@starkillerOG
Copy link
Contributor Author

Could this please be added to the 2023.4.2 milestone?

@bdraco bdraco added this to the 2023.4.2 milestone Apr 8, 2023
@kevdliu
Copy link
Contributor

kevdliu commented Apr 10, 2023

@starkillerOG I'm not completely sure if the addition of asyncio shielding is the cause of this issue as I have not tried this integration before 2023.4 but the ONVIF POST requests to the home assistant endpoint never completes. I noticed this when I created a nodejs proxy server that forwards the ONVIF POST requests to the integration's webhook (the reason behind the proxy is because I have my home assistant on HTTPS and Reolink cameras can't send ONVIF requests to HTTPS endpoints).

If I send an empty request to the webhook, the request returns an immediate 200 as the integration short circuits on the empty request body. When I send a real ONVIF request, the request hangs indefinitely and never completes. The integration does however register the ONVIF request and reacts accordingly. As far as I can tell nothing in the webhook handler should be blocking forever, the only possible culprit might be the asyncio.shield call.

I wonder if this has any ramifications for the camera as the connections are being kept open for a long time. As the camera sends more and more requests it might start experiencing issues.

@kevdliu
Copy link
Contributor

kevdliu commented Apr 10, 2023

On second thought I wonder if that's why the camera is dropping the connection so frequently, because it's running out of resources.

@starkillerOG
Copy link
Contributor Author

@kevdliu what I think is happening: the received ONVIF data does not include the AI information or has other missing fields (like channel). The camera will be polled for its motion+AI state, this takes too long for the ONVIF Post request, so the connection is closed and the handle_webhook coroutine is cancelled so the HTTP_OK (200) will never be send. The shielding ensures the _handle_webhook keeps running untill the polling is completed and the entities are signalled.

My latest commit to this PR ensures that as soon as the ONVIF data is read, the handle_webhook coroutine is signalled to finish such that the HTTP_OK (200) is send. If polling needs to happen, it will be handled after sending the HTTP_OK (200).

@starkillerOG
Copy link
Contributor Author

@bdraco a simpler alternative would be using the aiojobs spawn method or atomic decorator as also suggested by the aiohttp docs: https://docs.aiohttp.org/en/latest/web_advanced.html#web-handler-cancellation

However I think this requires the from aiojobs.aiohttp import setup which I think schould be done somewhere in core which goes way above my head.

@bdraco
Copy link
Member

bdraco commented Apr 12, 2023

I'm not sure we want to carry around the complexity of another job scheduler in core unless we find we need to do this multiple places. As for right now I don't think we have many other webhooks that need shielding

@starkillerOG
Copy link
Contributor Author

@bdraco I already thought so, do you think the current implementation of this PR is then acceptable?
I have a lot of user issues related to this PR (see description) so would really like to get this into HA 2023.4.3

@balloob balloob modified the milestones: 2023.4.3, 2023.4.4 Apr 13, 2023
Co-authored-by: J. Nick Koston <nick@koston.org>
@frenck frenck modified the milestones: 2023.4.4, 2023.4.5 Apr 13, 2023
@bdraco
Copy link
Member

bdraco commented Apr 13, 2023

I think you can write it something like this
bdraco@7a939bf

@starkillerOG
Copy link
Contributor Author

@bdraco thank you very much for the rewrite, I like it!
It is a lot cleaner indeed.
I applied your suggestions.

@bdraco bdraco changed the title Reolink prevent ONVIF push beeing lost due to ConnectionResetError Reolink prevent ONVIF push being lost due to ConnectionResetError Apr 14, 2023
@balloob
Copy link
Member

balloob commented Apr 15, 2023

If it doesn't work now, it's not us 🤷

@balloob balloob merged commit adc8a13 into home-assistant:dev Apr 15, 2023
@starkillerOG
Copy link
Contributor Author

@balloob @bdraco thank you both very much for your help during this PR, I learned a lot from you guys!

@starkillerOG
Copy link
Contributor Author

I did aditional tests with the code of this PR (Current implementation), compared to how it was before (Old implementation) for 300 motion/AI people callbacks:
Old implementation: 69% push 0% polling 31% lost
Current implementation: 64% push 36% polling 0% lost

I made a follow up PR #91478 that moves the read to the primary callback and uses try except to protect against cancellation and ConnectionResetError, for 300 callbacks that got:
Follow up PR: 100% push 0% polling 0% lost

I also tested the follow up PR by adding await asyncio.sleep(5.0) in various places to trigger the cancellations and mallformed the ONVIF message, and then got polling callbacks, so the fallback for older camera's (or slow once) is still working.

@github-actions github-actions bot locked and limited conversation to collaborators Apr 16, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Reolink person/pet/vehicle sensors repeatedly turn on/off and do not reflect camera events
5 participants