Skip to content

Add logging to help diagnose missing to-device messages#19801

Open
richvdh wants to merge 12 commits into
developfrom
rav/19795_logging
Open

Add logging to help diagnose missing to-device messages#19801
richvdh wants to merge 12 commits into
developfrom
rav/19795_logging

Conversation

@richvdh
Copy link
Copy Markdown
Member

@richvdh richvdh commented May 22, 2026

Some attempts to debug #19795.

@richvdh richvdh requested a review from a team as a code owner May 22, 2026 10:59
@richvdh richvdh force-pushed the rav/19795_logging branch from fd2b20c to d55d564 Compare May 22, 2026 11:28
@MadLittleMods MadLittleMods added A-to-device-messages EDU messages sent exactly once to a specific set of devices. Related to E2EE A-Workers A-Logging labels May 22, 2026
Comment thread synapse/replication/tcp/resource.py Outdated
Comment thread synapse/storage/databases/main/deviceinbox.py Outdated
Comment thread synapse/replication/tcp/resource.py Outdated
Comment thread synapse/storage/databases/main/deviceinbox.py
Comment thread synapse/storage/util/id_generators.py Outdated
Comment thread synapse/storage/util/id_generators.py Outdated
Comment thread changelog.d/19801.misc Outdated
richvdh and others added 4 commits May 28, 2026 10:54
Co-authored-by: Eric Eastwood <madlittlemods@gmail.com>
Both branches of our `if` do a db query for all the user's devices, though the
spell it slightly differently. Let's pull it out, as it makes the code shorter
and clearer.
@richvdh richvdh force-pushed the rav/19795_logging branch from fbe061e to e922532 Compare May 28, 2026 10:02
@richvdh richvdh requested a review from MadLittleMods May 28, 2026 10:08
@richvdh
Copy link
Copy Markdown
Member Author

richvdh commented May 28, 2026

@MadLittleMods thanks for the review. I've made the changes you requested, but in looking at it I realised I wanted to add another couple of commits. Would you mind taking another look?

Comment thread synapse/storage/databases/main/deviceinbox.py Outdated
msg["content"].get(EventContentFields.TO_DEVICE_MSGID),
)
message_json = json_encoder.encode(msg)
msg = messages_by_device[device_id]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can tell, in this else-branch, we're working with to-device messages going to specific devices defined by devices.

messages_by_device[device_id] can probably be None (if we're going to loop over known_devices here) and we should do a safer lookup.

And to confirm this suspicion, it looks like that's what the Python tests are failing on as well:

https://github.com/element-hq/synapse/actions/runs/26568079004/job/78268176493?pr=19801

  File "/opt/hostedtoolcache/Python/3.10.20/x64/lib/python3.10/site-packages/synapse/storage/databases/main/deviceinbox.py", line 945, in _add_messages_to_local_device_inbox_txn
    msg = messages_by_device[device_id]
builtins.KeyError: 'XYDQYXDPTF'

tests.rest.client.test_devices.DehydratedDeviceTestCase.test_dehydrate_msc3814

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ugh, right; I now understand why we had two separate queries before: I'd failed to realise that, in the non-wildcard case, we were only looking up device_ids for which we had a to-device message.

I'm going to revert this commit.

messages_json_for_user = {}
devices = list(messages_by_device.keys())
if not devices:
continue
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Log about dropping to-device messages?

How does this case even happen? A to-device message should be expected to go somewhere. Should this be an assert?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just moving this logic from below.

If devices is empty, then it means there are no to-device messages for this user (we just have an empty dict for them). Looks like one reason it can happen if somebody hits /sendToDevice with a request body like:

{
  "messages": {
    "@alice:example.com": {}
  }
}

I don't think it's worth logging, and it certainly shouldn't be an assert (at least, unless we change the higher-level methods.

I'll add a quick comment.

# A map from user id, to device id, to a pair of (serialized message, msgid).
local_by_user_then_device: dict[str, dict[str, tuple[str, str]]] = {}
for user_id, messages_by_device in messages_by_user_then_device.items():
messages_json_for_user = {}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment and type to document what this map holds

if issue9533_logger.isEnabledFor(logging.DEBUG):
# Log any messags we are dropping
unmapped_devices = (
messages_by_device.keys() - messages_json_for_user.keys()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this logic will mess up for * and will say we're always dropping that one when we're actually handling it.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't hit this branch for *.

richvdh and others added 4 commits May 29, 2026 11:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-Logging A-to-device-messages EDU messages sent exactly once to a specific set of devices. Related to E2EE A-Workers

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants