Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automations issue/ ZHA Network busy errors after migrating to Skyconnect dongle #86411

Open
jason1980p opened this issue Jan 23, 2023 · 122 comments
Assignees

Comments

@jason1980p
Copy link

The problem

After migrating to Home Assistant Skyconnect usb dongle I've been running into network busy errors.
I currently have the dongle connected to a usb extension cable connected to R-Pie4 .

What version of Home Assistant Core has the issue?

Home Assistant 2023.1.6

What was the last working version of Home Assistant Core?

No response

What type of installation are you running?

Home Assistant OS

Integration causing the issue

Automation

Link to integration documentation on our website

https://www.home-assistant.io/docs/automation/

Diagnostics information

No response

Example YAML snippet

alias: "Pico: Master Bathroom remote"
description: ""
use_blueprint:
  path: stephack/core-pico.yaml
  input:
    pico_remote: a58ddd4ab05559d05de8267f82dd7c49
    top_on:
      - service: light.turn_on
        data:
          brightness_step_pct: 100
        target:
          entity_id: light.light_unknown_master_bathroom_lights_zha_group_0x0006
    bottom_off_release:
      - service: light.turn_off
        data: {}
        target:
          entity_id: light.light_unknown_master_bathroom_lights_zha_group_0x0006
    up_raise:
      - service: light.turn_on
        data:
          brightness_step_pct: 20
        target:
          entity_id:
            - light.light_unknown_master_bathroom_lights_zha_group_0x0006
    down_lower:
      - service: light.turn_on
        data:
          brightness_step_pct: -20
        target:
          entity_id: light.light_unknown_master_bathroom_lights_zha_group_0x0006

Anything in the logs that might be useful for us?

Logger: homeassistant.components.automation.pico_master_bedroom_remote
Source: components/zha/light.py:292
Integration: Automation (documentation, issues)
First occurred: January 21, 2023 at 8:37:18 PM (7 occurrences)
Last logged: 7:21:23 PM

Pico: Master Bathroom remote: Choose at step 1: choice 1: Choose at step 1: choice 1: Error executing script. Unexpected error for call_service at pos 1: Failed to enqueue message after 3 attempts: <EmberStatus.NETWORK_BUSY: 161>
Traceback (most recent call last):
  File "/usr/src/homeassistant/homeassistant/helpers/script.py", line 451, in _async_step
    await getattr(self, handler)()
  File "/usr/src/homeassistant/homeassistant/helpers/script.py", line 684, in _async_call_service_step
    await service_task
  File "/usr/src/homeassistant/homeassistant/core.py", line 1755, in async_call
    task.result()
  File "/usr/src/homeassistant/homeassistant/core.py", line 1792, in _execute_service
    await cast(Callable[[ServiceCall], Awaitable[None]], handler.job.target)(
  File "/usr/src/homeassistant/homeassistant/helpers/entity_component.py", line 213, in handle_service
    await service.entity_service_call(
  File "/usr/src/homeassistant/homeassistant/helpers/service.py", line 678, in entity_service_call
    future.result()  # pop exception if have
  File "/usr/src/homeassistant/homeassistant/helpers/entity.py", line 958, in async_request_call
    await coro
  File "/usr/src/homeassistant/homeassistant/helpers/service.py", line 715, in _handle_entity_call
    await result
  File "/usr/src/homeassistant/homeassistant/components/light/__init__.py", line 570, in async_handle_light_on_service
    await light.async_turn_on(**filter_turn_on_params(light, params))
  File "/usr/src/homeassistant/homeassistant/components/zha/light.py", line 978, in async_turn_on
    await super().async_turn_on(**kwargs)
  File "/usr/src/homeassistant/homeassistant/components/zha/light.py", line 292, in async_turn_on
    result = await self._level_channel.move_to_level_with_on_off(
  File "/usr/local/lib/python3.10/site-packages/zigpy/zcl/__init__.py", line 324, in request
    return await self._endpoint.request(
  File "/usr/local/lib/python3.10/site-packages/zigpy/group.py", line 57, in request
    await self.application.send_packet(
  File "/usr/local/lib/python3.10/site-packages/bellows/zigbee/application.py", line 782, in send_packet
    raise zigpy.exceptions.DeliveryError(
zigpy.exceptions.DeliveryError: Failed to enqueue message after 3 attempts: <EmberStatus.NETWORK_BUSY: 161>

Additional information

No response

@home-assistant
Copy link

Hey there @dmulcahey, @Adminiuga, @puddly, mind taking a look at this issue as it has been labeled with an integration (zha) you are listed as a code owner for? Thanks!

Code owner commands

Code owners of zha can trigger bot actions by commenting:

  • @home-assistant close Closes the issue.
  • @home-assistant rename Awesome new title Change the title of the issue.
  • @home-assistant reopen Reopen the issue.
  • @home-assistant unassign zha Removes the current integration label and assignees on the issue, add the integration domain after the command.

(message by CodeOwnersMention)


zha documentation
zha source
(message by IssueLinks)

@puddly
Copy link
Contributor

puddly commented Jan 23, 2023

The error means:

A message cannot be sent because the network is currently overloaded.

Your automation is referencing ZHA group entities. Are you rapidly sending messages to ZHA groups with it? If so, that's the cause of the error message. How fast are you sending them?

@jason1980p
Copy link
Author

Yes I use zha groups. example I have 4 lamps in my living room, I have a zigbee bulb in each. I have them all set in a zha group.
I control the group lights via a automation using the Lutron Pico remote.
when I press the on button it sends a on request to the group instead of each individual light bulb
same when turning the lights off.

@fakethinkpad85
Copy link

fakethinkpad85 commented Jan 27, 2023

Having the same issue, also using ZHA groups to control 2-4 lights at once depending on the group. And its not getting called often, in the example below i call the ZHA groups once (turn of all lights), its maybe 10 groups with 2-4 lights in each group. but the automation is only calling each light group once.

The Automation actually turns off All lights (light service) by Area, meaning the Area could potentially contain both the ZHA group and the induvidual Light in that ZHA group, i usally Hide all the induvidual light entities but they are still in the same Area that the group is.

image

image

Watch - Away: Choose at step 1: choice 1: Error executing script. Unexpected error for call_service at pos 1: Failed to enqueue message after 3 attempts: <EmberStatus.NETWORK_BUSY: 161> Watch - Away: Error executing script. Unexpected error for choose at pos 1: Failed to enqueue message after 3 attempts: <EmberStatus.NETWORK_BUSY: 161> While executing automation automation.watch_away Traceback (most recent call last): File "/usr/src/homeassistant/homeassistant/helpers/script.py", line 451, in _async_step await getattr(self, handler)() File "/usr/src/homeassistant/homeassistant/helpers/script.py", line 684, in _async_call_service_step await service_task File "/usr/src/homeassistant/homeassistant/core.py", line 1755, in async_call task.result() File "/usr/src/homeassistant/homeassistant/core.py", line 1792, in _execute_service await cast(Callable[[ServiceCall], Awaitable[None]], handler.job.target)( File "/usr/src/homeassistant/homeassistant/helpers/entity_component.py", line 213, in handle_service await service.entity_service_call( File "/usr/src/homeassistant/homeassistant/helpers/service.py", line 678, in entity_service_call future.result() # pop exception if have File "/usr/src/homeassistant/homeassistant/helpers/entity.py", line 958, in async_request_call await coro File "/usr/src/homeassistant/homeassistant/helpers/service.py", line 715, in _handle_entity_call await result File "/usr/src/homeassistant/homeassistant/components/light/__init__.py", line 581, in async_handle_light_off_service await light.async_turn_off(**filter_turn_off_params(light, params)) File "/usr/src/homeassistant/homeassistant/components/zha/light.py", line 986, in async_turn_off await super().async_turn_off(**kwargs) File "/usr/src/homeassistant/homeassistant/components/zha/light.py", line 417, in async_turn_off result = await self._on_off_channel.off() File "/usr/local/lib/python3.10/site-packages/zigpy/zcl/__init__.py", line 324, in request return await self._endpoint.request( File "/usr/local/lib/python3.10/site-packages/zigpy/group.py", line 57, in request await self.application.send_packet( File "/usr/local/lib/python3.10/site-packages/bellows/zigbee/application.py", line 782, in send_packet raise zigpy.exceptions.DeliveryError( zigpy.exceptions.DeliveryError: Failed to enqueue message after 3 attempts: <EmberStatus.NETWORK_BUSY: 161>


Another set of errors that might be related,

`Logger: homeassistant.components.zha.core.channels.base
Source: components/zha/core/channels/base.py:486
Integration: Zigbee Home Automation (documentation, issues)
First occurred: January 26, 2023 at 16:49:07 (32 occurrences)
Last logged: January 26, 2023 at 16:49:10

[0xF4E3:1:0x0300]: async_initialize: all attempts have failed: [DeliveryError('Failed to deliver message: <EmberStatus.DELIVERY_FAILED: 102>'), DeliveryError('Failed to deliver message: <EmberStatus.DELIVERY_FAILED: 102>'), DeliveryError('Failed to deliver message: <EmberStatus.DELIVERY_FAILED: 102>'), DeliveryError('Failed to deliver message: <EmberStatus.DELIVERY_FAILED: 102>')]
[0xB7E9:1:0x0300]: async_initialize: all attempts have failed: [DeliveryError('Failed to deliver message: <EmberStatus.DELIVERY_FAILED: 102>'), DeliveryError('Failed to deliver message: <EmberStatus.DELIVERY_FAILED: 102>'), DeliveryError('Failed to deliver message: <EmberStatus.DELIVERY_FAILED: 102>'), DeliveryError('Failed to deliver message: <EmberStatus.DELIVERY_FAILED: 102>')]
[0xC402:1:0x0006]: async_initialize: all attempts have failed: [DeliveryError('Failed to deliver message: <EmberStatus.DELIVERY_FAILED: 102>'), DeliveryError('Failed to deliver message: <EmberStatus.DELIVERY_FAILED: 102>'), DeliveryError('Failed to deliver message: <EmberStatus.DELIVERY_FAILED: 102>'), DeliveryError('Failed to deliver message: <EmberStatus.DELIVERY_FAILED: 102>')]
[0x7588:1:0x0006]: async_initialize: all attempts have failed: [DeliveryError('Failed to deliver message: <EmberStatus.DELIVERY_FAILED: 102>'), DeliveryError('Failed to deliver message: <EmberStatus.DELIVERY_FAILED: 102>'), DeliveryError('Failed to deliver message: <EmberStatus.DELIVERY_FAILED: 102>'), DeliveryError('Failed to deliver message: <EmberStatus.DELIVERY_FAILED: 102>')]
[0xC402:1:0x0008]: async_initialize: all attempts have failed: [DeliveryError('Failed to deliver message: <EmberStatus.DELIVERY_FAILED: 102>'), DeliveryError('Failed to deliver message: <EmberStatus.DELIVERY_FAILED: 102>'), DeliveryError('Failed to deliver message: <EmberStatus.DELIVERY_FAILED: 102>'), DeliveryError('Failed to deliver message: <EmberStatus.DELIVERY_FAILED: 102>')]`

Edit: actually forgot to include the logs..

@SneakieGargamel
Copy link

SneakieGargamel commented Jan 28, 2023

I had the same issue and found this post. I am now using helpers to create groups instead of ZHA groups and it seems to work a lot better.

@fakethinkpad85
Copy link

Thanks for the reply @SneakieGargamel I personally might give that a try, have heard from multiple people that this might solve the problem. But still, if it's a generally known issue it should be solved as it does make sense to try and keep the ZHA id requests at a minimum and they obviously created ZHA groups for a reason.

@dmulcahey
Copy link
Contributor

If you are consistently trying to control several groups at once you should create an additional group with the members of the other groups.

multicast / broadcast traffic has limits on EZSP. See the broadcast section here: https://community.silabs.com/s/article/guidelines-for-large-dense-networks-with-emberznet-pro?language=en_US

(most relevant statement is 8 in a 9s window)

keep in mind this means all broadcasts including ones initiated by the stack itself (not just your group messages)

@fakethinkpad85
Copy link

@dmulcahey that is kind of interesting, so do I understand correctly that using the independent lights Id´s will only worsen the result as it will probable more than 8 requests (~50 in my case); shouldn't creating a Helper group containing the same Id's cause the same issue?

@dmulcahey
Copy link
Contributor

Helper groups send individual device messages so they are no different than just addressing lights individually. You can prove this by enabling debug logs and watching the Zigpy / bellows logs. Zigpy groups are meant to cut down Zigbee traffic and they send a single message to the group Id. again this can be seen in the nwk traffic. What you should not do is attempt to address several Zigbee groups at the same time. This will flood the network. If that is something you do consistently you should create an additional zigbee group with the members of the individual groups and use that.

That article / the section I mentioned is specifically talking about broadcast messaging not about messaging individual devices.

@fakethinkpad85
Copy link

@dmulcahey Ah I see, appreciate the explanation. Sorry to bother you but as bit of a nerd I guess I need to know how some things function. You mentioned messages to a Zigpy group is meant to reduce traffic, but dousent the controller need to relay that message to each of the device Id of that group causing the same amount of end-to-end traffic as sending it to each device to begin with?

Or is the group Ids somehow stored on the end-device, surely they can't be?

@dmulcahey
Copy link
Contributor

The coordinator sends a multicast message to the group id so only 1 message is sent by the coordinator. Enable debug logs and you can follow along

@dmulcahey
Copy link
Contributor

Also, you can read the group cluster stuff in here if you want: https://zigbeealliance.org/wp-content/uploads/2019/12/07-5123-06-zigbee-cluster-library-specification.pdf#page126

@MattWestb
Copy link
Contributor

The 8 broadcast / 9 seconds is not one EZSP limit is one limit in the underlying 802.15.4 network and its shall being the same for all Zigbee 3 coordinator then forming on Zigbee 3 network.
If using one TI CC-2531 with HA 1.X firmware is one other thing.
Some IT coordinator Zigbee 3 firmware have being patched with very high broadcast limit but its useless then all routers that is Zigbee certificated is silent doping the 9 package in 9 seconds window.
The limit is made for blocking broadcast storms that can blocking the network and its the same with other protocols like Thread that is using 802.15.4 under there own stack.

@TheAlphaLaw
Copy link

I have the same issues with ZHA groups, with SkyConnect. Deconz with the Conbee II works 100% better, regretting the migration of 100+ devices without testing.

@smartqasa
Copy link

After trying virtually everything, I finally gave up on ZHA and migrated to Zigbee2MQTT (Z2M). Migrating is painful and Z2M is not perfect but it performs much better than ZHA in environments that include 40+ lighting devices such as mine.

@atr00
Copy link

atr00 commented Mar 8, 2023

same issue with ZHA and SkyConnect... my group of 2 lights triggers this problem but neither a help group nor the individual lights will provoke it.
Z2M groups (using Sonoff Dongle-E so also eszp) work flawlessly as well.

@dmulcahey
Copy link
Contributor

@TheAlphaLaw @atr00 can you enable debug mode, run the actions that cause the issue and then attach the full logs please?

@atr00
Copy link

atr00 commented Mar 8, 2023

@dmulcahey
There you go: zha_logs.txt
There are both successful and failed actions in the log.

Note that, even then the action is successful, it is very slow (it takes 2 to 3 seconds for the lights to change their color... whereas individually or in a helper group it is almost instantaneous.)

@dmulcahey
Copy link
Contributor

Thanks. This is really helpful.

@atr00
Copy link

atr00 commented Mar 8, 2023

Happy to help! If you need more tests, just ask me.

@dmulcahey
Copy link
Contributor

dmulcahey commented Mar 8, 2023

With debug mode on you should also get periodic counter dumps in the logs. Can you attach that too?

Your log cuts off right when the command to read them is sent.

@atr00
Copy link

atr00 commented Mar 8, 2023

I am not sure what they are... let me know if there is is what you need here:
home-assistant_2023-03-08T12-53-49.370Z.log

@puddly
Copy link
Contributor

puddly commented Mar 8, 2023

@atr00 I suspect you have an automation that needs debouncing, as you're sending a lot of group requests at once:

# To turn on a single light, you'd need to send at most one `on()` command, and maybe one `move_to_color()`
2023-03-08 20:52:47.617 DEBUG (MainThread) [zigpy.zcl] [Bedroom Desk Bulbs:None:0x0006] Sending request: on()
2023-03-08 20:52:48.460 DEBUG (MainThread) [zigpy.zcl] [Bedroom Desk Bulbs:None:0x0006] Sending request: on()
2023-03-08 20:52:48.598 DEBUG (MainThread) [zigpy.zcl] [Bedroom Desk Bulbs:None:0x0300] Sending request: move_to_color(color_x=8912, color_y=2621, transition_time=0, options_mask=None, options_override=None)
2023-03-08 20:52:49.176 DEBUG (MainThread) [zigpy.zcl] [Bedroom Desk Bulbs:None:0x0006] Sending request: on()
2023-03-08 20:52:49.433 DEBUG (MainThread) [zigpy.zcl] [Bedroom Desk Bulbs:None:0x0300] Sending request: move_to_color(color_x=11272, color_y=48954, transition_time=0, options_mask=None, options_override=None)
2023-03-08 20:52:50.017 DEBUG (MainThread) [zigpy.zcl] [Bedroom Desk Bulbs:None:0x0006] Sending request: on()
2023-03-08 20:52:50.194 DEBUG (MainThread) [zigpy.zcl] [Bedroom Desk Bulbs:None:0x0300] Sending request: move_to_color(color_x=29097, color_y=33881, transition_time=0, options_mask=None, options_override=None)
2023-03-08 20:52:50.948 DEBUG (MainThread) [zigpy.zcl] [Bedroom Desk Bulbs:None:0x0300] Sending request: move_to_color(color_x=16318, color_y=6029, transition_time=0, options_mask=None, options_override=None)
2023-03-08 20:52:51.623 DEBUG (MainThread) [zigpy.zcl] [Bedroom Desk Bulbs:None:0x0006] Sending request: on()

# `NETWORK_BUSY` immediately after sending the 10th request in 5 seconds
2023-03-08 20:52:52.559 DEBUG (MainThread) [zigpy.zcl] [Bedroom Desk Bulbs:None:0x0300] Sending request: move_to_color(color_x=25230, color_y=10157, transition_time=0, options_mask=None, options_override=None)
2023-03-08 20:52:52.574 DEBUG (MainThread) [bellows.ezsp.protocol] Application frame received sendMulticast: [<EmberStatus.NETWORK_BUSY: 161>, 118]

2023-03-08 20:52:53.023 DEBUG (MainThread) [zigpy.zcl] [Bedroom Desk Bulbs:None:0x0006] Sending request: on()
2023-03-08 20:52:53.034 DEBUG (MainThread) [bellows.ezsp.protocol] Application frame received sendMulticast: [<EmberStatus.NETWORK_BUSY: 161>, 125]

2023-03-08 20:52:55.077 DEBUG (MainThread) [zigpy.zcl] [Bedroom Desk Bulbs:None:0x0006] Sending request: on()
2023-03-08 20:52:53.088 DEBUG (MainThread) [bellows.ezsp.protocol] Application frame received sendMulticast: [<EmberStatus.NETWORK_BUSY: 161>, 126]

...

2023-03-08 20:53:04.598 DEBUG (MainThread) [zigpy.zcl] [Bedroom Desk Bulbs:None:0x0006] Sending request: on()
2023-03-08 20:53:05.604 DEBUG (MainThread) [zigpy.zcl] [Bedroom Desk Bulbs:None:0x0300] Sending request: move_to_color(color_x=11272, color_y=48954, transition_time=0, options_mask=None, options_override=None)
2023-03-08 20:53:06.809 DEBUG (MainThread) [zigpy.zcl] [Bedroom Desk Bulbs:None:0x0006] Sending request: on()
2023-03-08 20:53:07.841 DEBUG (MainThread) [zigpy.zcl] [Bedroom Desk Bulbs:None:0x0300] Sending request: move_to_color(color_x=8912, color_y=2621, transition_time=0, options_mask=None, options_override=None)
2023-03-08 20:53:07.884 DEBUG (MainThread) [zigpy.zcl] [Bedroom Desk Bulbs:None:0x0006] Sending request: on()
2023-03-08 20:53:08.664 DEBUG (MainThread) [zigpy.zcl] [Bedroom Desk Bulbs:None:0x0006] Sending request: on()
2023-03-08 20:53:08.836 DEBUG (MainThread) [zigpy.zcl] [Bedroom Desk Bulbs:None:0x0300] Sending request: move_to_color(color_x=16318, color_y=6029, transition_time=0, options_mask=None, options_override=None)
2023-03-08 20:53:09.631 DEBUG (MainThread) [zigpy.zcl] [Bedroom Desk Bulbs:None:0x0300] Sending request: move_to_color(color_x=25230, color_y=10157, transition_time=0, options_mask=None, options_override=None)
2023-03-08 20:53:09.968 DEBUG (MainThread) [zigpy.zcl] [Bedroom Desk Bulbs:None:0x0006] Sending request: on()
2023-03-08 20:53:10.904 DEBUG (MainThread) [zigpy.zcl] [Bedroom Desk Bulbs:None:0x0300] Sending request: move_to_color(color_x=29097, color_y=33881, transition_time=0, options_mask=None, options_override=None)
2023-03-08 20:53:11.292 DEBUG (MainThread) [zigpy.zcl] [Bedroom Desk Bulbs:None:0x0006] Sending request: on()
2023-03-08 20:53:11.963 DEBUG (MainThread) [zigpy.zcl] [Bedroom Desk Bulbs:None:0x0006] Sending request: on()
2023-03-08 20:53:13.916 DEBUG (MainThread) [zigpy.zcl] [Bedroom Desk Bulbs:None:0x0006] Sending request: on()

Group requests are network-wide broadcasts that are bounced back and forth between all devices to make sure every possible group member has a chance to "hear" the message. If you send this many at once, the network will become too busy, as the error indicates. The limit on the number of group requests/broadcasts is hard-coded in the firmware and can't be raised.

You may find it better to switch to a normal light group where each bulb is individually contacted concurrently, as this won't have the same limitation.

@MattWestb
Copy link
Contributor

The under network layer 802.15.4 have broadcast storm protection so all routers is only handling 9 broadcast in 8 seconds and if its more they ignoring them.
Ziogbee groups can being used for individual lights but is the best being used with middle and large groups of lights and its not getting so large problems if having routing problem in the network.

@atr00
Copy link

atr00 commented Mar 9, 2023

@puddly I'm actually not even using any automation here. I'm going to the coordinator device where I can see my 2-bulb zigbee light group and I change the color from there. I'm not sure the issue is only related to automation. And when I don't purposely jam the network by sending consecutive fast requests, the response is very slow (it take about 3 seconds between the click and the actual color change). When I change individual lights or a helper group I am not facing any issue. When I use Z2M using a dongle with the same chip (ERF32), it works flawlessly, whether on individual lights or on a Z2M zigbee group.

@MattWestb if you see my answer here, this is not happening with Z2M, it's specific with ZHA and SkyConnect. My group has 2 bulbs in it. There's slowness as well on a single broadcast command.

@TheJulianJES
Copy link
Member

TheJulianJES commented Mar 9, 2023

the response is very slow (it take about 3 seconds between the click and the actual color change

So this is only when using ZHA groups, right?
I also remember a "lag" when changing colors on ZHA groups (using the UI picker) and EZSP coordinators in my testing.
There are two packets sent: OnOff -> on and Color -> set_xy_color even when just changing a light color that is already on (this is somewhat because ZHA can't be 100% sure that the light is actually on and the service is called light.turn_on, so there's always either some on message or move_to_level_with_on_off being sent with "just a color change". This also follows the behavior of other integrations).
However, TI coordinators do not have this lag. They basically send both packets at almost the exact same time.
EZSP coordinators (when used with bellows) have a delay of about a second or so for me (until the color change shows up).

There were some changes to this with zigpy/bellows#402, but there's still a difference in regards to how fast ZHA light groups react with TI coordinators (zigpy-znp) vs with EZSP coordinators (bellows).

@th3cube
Copy link

th3cube commented Jun 18, 2024

I can see big improvements in calling zha zigbee groups. For example, if i would move the color picker over a group it would give me network_busy almost immediately. With this FW this behavior has improved drastically. Also, my HUE remotes would "freeze" when I did quadruple presses to trigger automations. That improved as well.

@evelant
Copy link

evelant commented Jun 18, 2024

@puddly I have noticed that some of my bulbs are not responding at random to group commands. The strange thing is some of the bulbs in a group will turn on but not all of them. Usually like 5 of 8 turn on, 3 stay off. It seems random which bulbs it is. Not sure if this is an issue with the new firmware or something else.

@jclendineng
Copy link

The most I have in a group is 6 but no issues here, do the bulbs control if you do them individually? Double checked the zigbee group to make sure nothing dropped? Sometimes the group will drop bulbs if the bulb goes offline.

@puddly
Copy link
Contributor

puddly commented Jun 18, 2024

I have noticed that some of my bulbs are not responding at random to group commands.

Once a group command is sent out it's sort of at the mercy of the network. It'll be relayed and rebroadcast by most devices on your network, the coordinator is no longer involved beyond sending it out.

I would make sure your network channel is free of noise.

@jtbandes
Copy link

Thanks for these updated builds — I just gave it a try and had good results! My setup is:

  • ZHA + Sonoff Zigbee 3.0 USB Dongle Plus-E
  • About 32 Philips Hue bulbs and 11 Inovelli switches in the house. While most or all of these are actually located within range of the zigbee radio, since they are all mains-powered I suspect this can result in a lot of unnecessary broadcast traffic (but haven't actually measured this in any way). My Zigbee network looks like a total mess: image

In the past I had repeated NETWORK_BUSY errors when using the Zigbee groups in Home Assistant for toggling lights and changing colors and such, so I had kinda stopped using them from Home Assistant (I only use them for binding the Inovelli switches directly to bulbs). With the 7.4.2.0 build it looks to be a lot more reliable, at least in my initial testing!

I recorded a before/after video to demonstrate (note that all the bulbs used in this demo are located <10 feet from the zigbee radio):

BeforeAfter
zigbee-before.mp4
zigbee-after.mp4

@johnlento
Copy link

@puddly First of all this firmware is pretty awesome, thank you! I was running a ZBDongle-P but no matter what firmware I would use or what settings I would set I would get Watchdog failures frequently. Sometimes every 10 minutes and it would kill the stick and cause it to reinitialize. I elected to move to the ZBDongle-E just to give the other chipset a chance. When I migrated, I couldn't do anything due to Network Busy, so this firmware is the only think that even remotely allows my coordinator to start and work.

The annoying thing which @evelant pointed out is that groups no longer behave like ZHA groups. With the ZBDongle-P a group command would broadcast and the group would all honor it at the same time, even if it took 60 seconds to propagate the network and HA would not mark the group on until it was. In the ZBDongle-E with this firmware a broadcast is sent, the switch goes to on then to off immediately and generally a few members of the group will turn on while the rest will later or not at all. Multiple off/on commands are required to get the full group of lights to turn on or off and HA reporting always lags. On the other hand the coordinator doesn't crash every 10 minutes so I think its a net win.

My question is whether there are any settings I can do to make that more ZBDongle-P like? Subjectively it does feel like individual devices respond hella fast but ZHA groups are all over the place so I suspect there is something with spamming group broadcasts.

I have 20 ZHA groups varying from varying from 2-9 end devices in each. They are essentially light zones where I then go apply adaptive lighting to. So my groups are getting spammed hard and heavy with broadcasts. Most of my routers are Inovelli Blues which did and may still have issues with Zigbee group broadcasts.

Has anyone solved this issue with this firmware and setup? I am contemplating trying to do light groups but its a lot of work... Also happy to take some PCAP's if I can get some quick guidance as its been a hot minute since I did traffic analysis.

@evelant
Copy link

evelant commented Jul 2, 2024

@johnlento That's a very good description of the problem. Exactly what I'm seeing as well.

@puddly I'm not sure how this group problem is happening but I am certain that it was introduced with the new build you provided. This never happened on the previous build. Groups always responded in unison or not at all due to network_busy error. Looking at the config changes you made I have no idea how they could have caused such behavior. Maybe a bug introduced upstream by silabs in the newer sdk? Any thoughts on other possible causes since it seems pretty certain that this was introduced with the new build?

@evelant
Copy link

evelant commented Jul 4, 2024

Another possible clue about the group issue @puddly -- it only happens when the command is coming from the coordinator. If I address a group via a binding on an inovelli blue switch all members always respond. Only group commands from the coordinator seem to trigger this partial group response behavior.

@puddly
Copy link
Contributor

puddly commented Jul 4, 2024

If you have a second Silicon Labs coordinator (e.g. a HUSBZB-1 or another SkyConnect), you can use it as a packet capture tool. If you indeed are seeing a difference and can reliably replicate group commands working worse with the tweaked firmware, perform a packet capture on your ZigBee channel for about five minutes and include the group command in there, one with the old firmare and one with the new:

pip install bellows
bellows --baudrate 115200 -d /dev/serial/by-id/your-other-zigbee-stick dump -c 20 -w capture.pcap

Change 20 to your ZigBee network's channel and make sure to include your network key. Both can be found in the ZHA configuration page and in the backup JSON. I can try to take a look at the difference.

@TheJulianJES
Copy link
Member

@evelant EMBER_BROADCAST_TABLE_SIZE is likely set to 15 on all your (EZSP) router devices. This value cannot be changed.
If there are too many broadcasts in a short amount of time, your routers will not rebroadcast them, basically "voiding" those broadcasts.

However, Z-Stack firmware was modified a long time ago to lift the broadcast limit, like done with this EZSP firmware now.
I'm running this without any issues and most Z-Stack users (unknowingly) use an even higher limit, I think.
I don't have any negative impacts, only improvements.

I'd set up a network sniffer to see how much broadcast traffic there is on your network. Using adaptive lightning or constantly changing colors of Zigbee group lights just doesn't work and will cause issues.
The underlying behavior between Z-Stack and EmberZNet seems to be different, but I doubt there's much we can do about this.

One thing you can try is to position your coordinator more central (in regards to your group lights).
I'm not sure this is actually the case, but the routers might be able to hear/honor broadcasts coming from the coordinator (in a better way), even if their "broadcast slots" are already/mostly filled up.
Also, make sure the coordinator is on an extension cable, away from interference like 2.4 GHz WiFi APs, USB 3 SSDs, ...

@evelant
Copy link

evelant commented Jul 4, 2024

@puddly Unfortunately I just gave my extra silabs coordinator to a friend. I still have a ti zstack dev board somewhere, maybe I can capture with that.

@TheJulianJES I'm pretty certain this group issue arose with the new firmware build and not due to any configuration in my network since it's the only thing that changed. Before the update groups always responded in unison. After the update random group members don't respond. Nothing else changed -- same channel, same devices, same location, same extension cable, same coordinator hardware, same automations. I know it's puzzling since from my understanding of zigbee groups they should either all respond or none respond. I don't know how the new build could possibly be causing this but as far as I can tell all signs point to it being something with the new build and not with my setup/configuration.

@TheJulianJES
Copy link
Member

TheJulianJES commented Jul 4, 2024

I'm pretty certain this group issue arose with the new firmware build and not due to any configuration in my network since it's the only thing that changed.

It's a combination of both. Increasing EMBER_BROADCAST_TABLE_SIZE only on the coordinator (which is what the new build does) can have an impact on timing and how many broadcasts can be sent. Your whole network configuration seems to have an issue with the increased broadcast table size, but mine does not.

Your routers need to relay the broadcasts, but are seemingly "overwhelmed" by the increased amount of broadcasts that your coordinator can send now (or the tighter timing), because of the EMBER_BROADCAST_TABLE_SIZE change.

@evelant
Copy link

evelant commented Jul 4, 2024

Makes sense, thanks. I'm not sure how I could be sending an excessive amount of broadcasts however. I don't have any sort of automation that continually issues commands or otherwise seems like it could flood the network. The most chatty automation is adaptive lighting which only updates every 90 seconds and only if the lights are already on. Most of the time they're off because they're turned on when radar presence sensors in a room detect occupancy. That's why this issue is particularly annoying -- people keep walking into rooms and having only 5 out of 10 bulbs turning on. I'll have to see if I can get a packet capture for more debug information.

@dmulcahey
Copy link
Contributor

The packet capture should show us exactly what’s going on.

Are the radar devices Tuya devices? I’m wondering if the entire network is flooded and it’s not just broadcast issues. I understand the issue wasn’t like this before the new firmware but maybe it’s the straw that broke the camel’s back so to say… that, and we have seen many instances of Tuya devices either completely spamming a network to death or introducing lots of routing issues.

It could be possible that we are overloading some of the routers causing them to crash as well.

@MattWestb
Copy link
Contributor

As sad many times before paying with broadcast table in the coordinator is only making problems and Silabs have locking it i the stack so not possible changing in SS GUI.
One intersecting old code snips:
https://github.com/yqyunjie/Zigbee-Project/blob/2bb294718ac2652fc98c5de08fb4bbd417680e1a/firmware/EmberZNet/EM35x-EZSP/stack/config/ember-configuration-defaults.h#L394-L401

Also if controller devices is working OK (certificated ones and no tuya or Aqaaras) in the network then the network is OK with all broadcast.
If tweaking the coordinator and is getting problems then the network is blocking for broadcast storm as it shall doing for not killing it self.
Spamming broadcast is also getting problems with unicasts then route discovery is not working if some routers is blocked => complete network breakdown.

If sniffing look for address 0xffff or some other in the 0xfffX then its broadcasts to different types of network devices and only 9 in 8 seconds can being handles the rest is silent ignored. Also test with source routing and without and see how its looking.
Some broadcast for discovery is only one hope so is not spamming do much but its not working if the network is blocking all broadcast the routers is losing its topography knowledge.

@evelant
Copy link

evelant commented Jul 4, 2024

I don't have any tuya or other uncertified devices. Routers in my network are Inovelli Blue series switches, Sonoff SNZB-06P radars, Innr AE-270T bulbs, and a few ikea bulbs, all of which should be well behaved.

@goncalossilva
Copy link

I have also noticed a massive improvement with the new firmware. I have 81 zigbee devices overall, and one particular automation — closing all blinds at sunset — always failed. I had been struggling with it for months, and none of the typical troubleshooting helped. Some blinds would randomly not close, every single time, and now they all do. I've only noticed a hiccup of two for the past couple of weeks, so a major improvement.

Interestingly, HA still reports the automation as having failed due to EmberStatus.DELIVERY_FAILED: 102, and some blinds appear as “last seen XX hours ago” where XX is a large number (say, 12 hours). But it does work in practice.

Interestingly per the discussion above, my blinds switches are tuya. I wonder if this is related to the flakiness I've experienced. Could it make sense to force them to act as end devices? I certainly have enough routers. Is that possible?

@evelant
Copy link

evelant commented Aug 26, 2024

I just ordered a skyconnect so I can use it as a sniffer. I'm wondering if NabuCasa/silabs-firmware-builder#57 might help with the group command issues?

@evelant
Copy link

evelant commented Aug 29, 2024

@puddly I switched my network a a zbdongle-p and zigbee2mqtt and the "not all lights in group turn on" is still happening. That rules out zha and ember controllers as sources of the problem. I must have something else in my hass install messing up the command or a device on my network causing problems.

@stp-ip
Copy link

stp-ip commented Sep 3, 2024

For what it's worth.

I ran with a Conbee II on a Raspi 3B and around 230 devices. A few delays here and there, but overall worked.
Migrated to a Home Assistant Yellow and the migration failed so readded all devices.
This failed so I started a new network from scratch instead. Couldn't reliably add new devices. Initializations kept coming up for devices multiple times sometimes even looping to leaving and joining the network.
Sensors and even routers dropping off and almost no actual actions working reliably. I got 3 new Sonoff Dongle E devices as I thought it might be a router issue. Long and behold all Extenders (3 Dongle E and 4 Aeotec Range Extender Zi) have RSSI of under 60. This is a good improvement from before, where it would periodically get worse reception etc.

Still the network was not even barely functioning. Tried forcing a reorg by leaving HA off for a few hours, but that didn't improve anything.

End result is I plugged in the old Conbee II, migrated without issues and everything is working. Thanks to the new Dongle Es no request delays anymore and everything feels a lot snappier (could also be the move to the CM4 instead of the Raspi 3B). Either way. Nothing in the network changed. Same spot, same network settings, same channel, same extenders, same devices. Only difference is moving to the Conbee II. It's a difference of night and day in a matter of a few minutes.
Due to devices falling off again and again with the on board chip the Conbee II is handling a lot more devices than the chip ever did.

Not sure logs help much. But I got a small one and a 650MB one, which I can't upload, but happy to provide, if helpful.
home-assistant_zha_2024-09-02T19-49-50.444Z.log

@evelant
Copy link

evelant commented Sep 3, 2024

I think my problem must be these Innr AE-270T bulbs. I guess they don't respond properly to group commands since it only seems to be happening to those bulbs and it happens with two entirely different coordinators and software stacks. @johnlento any chance you're also having the group issues with Innr bulbs? Or do you get it with different devices?

@johnlento
Copy link

johnlento commented Sep 22, 2024 via email

@puddly
Copy link
Contributor

puddly commented Sep 22, 2024

Sometimes I do get upwards of 20k messages in the queue and have to powercycle everything so that it can start responding again.

That's very unusual, you should not have 20k enqueued packets.

Can you post a ZHA debug log?

@johnlento
Copy link

johnlento commented Sep 22, 2024 via email

@puddly
Copy link
Contributor

puddly commented Sep 22, 2024

Open a separate issue once you do. Thanks!

@evelant
Copy link

evelant commented Sep 23, 2024

@johnlento I have a lot of sengled bulbs as well and have run into similar issues with them. A couple of things that might be helpful:

  1. IIRC sengled bulbs by default set up reporting power usage to the coordinator at a very high rate. This can clog the network easily. I'm not sure if there's a way to turn it off in ZHA but at least with z2m turning off reporting improved things a lot.
  2. Sengled bulbs do not act as routers. Make sure you've got strong router devices to support them. (I use inovelli switches)
  3. Sengled bulbs seem to have wonky firmware and unfortunately AFAIK nobody has had success even contacting sengled to get them to fix firmware issues. IIRC a lot of mine would try to connect directly to the coordinator even if the signal was terrible rather than use a nearby strong router. Also IIRC this behavior led to problems with the max directly connected children setting in the coordinator firmware.

Probably not helpful to your situation but I have completely resolved all of my zigbee issues and have a fast, stable network. What it took was a firmware update for my Innr AE270-T bulbs. After a lot of back and forth with Innr they released a new firmware and after updating my ~40 Innr bulbs my network problems disappeared. This IMO shows that zigbee problems can be the fault of manufacturer firmware totally out of our control. I had similar issues to yours when I had primarily sengled bulbs. Now I only use a few sengled bulbs and mostly Innr and have no problems after their new firmware. I suspect your problems may stem from bad Sengled firmware -- maybe if you're persistent (and lucky) you could prod them into releasing an update to fix them?

@johnlento
Copy link

@puddly So I think my network is in that state again, max concurrent requests received and slowly climbing. Was at 524 queued and not at 3,000+. It occurred sometime after 10 sengled bulbs went offline for some reason. I can open a separate issue, but the debug log is like 2GB. Is there something I should trim out and submit? Looking for guidance on how to submit the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests