Skip to content

orphaned m.space.child state remains after room cleanup #19699

@ekimminau

Description

@ekimminau

Description

This draft is sanitized for public submission. Domain names and environment-specific identifiers have been redacted.

Description

We are running a self-hosted Synapse deployment and create and remove a large number of child rooms under parent spaces.

This was first discovered while using etkecc/ketesa and inspecting a parent space in the hierarchy view. The UI showed thousands of stale Suggested entries under the parent space, including entries referring to rooms and test-created users that had already been deleted during repeated testing cycles.

We observed that once many m.space.child links have been created, later room shutdown, unlink, or purge operations do not reliably remove the corresponding state rows. Over time, this leaves thousands of lingering parent-to-child entries in current_state_events and event_json, even after repeated cleanup attempts and even after new room-link creation was changed to stop using suggested links.

At minimum, it would be helpful if this behavior were documented with a very clear warning. Ideally, suggested space-child behavior should default to off, or Synapse should provide a safer built-in cleanup or garbage-collection path for stale m.space.child state.

Steps to reproduce

  1. Run a self-hosted Synapse 1.151.0 deployment with PostgreSQL.
  2. Create one or more parent spaces and many child rooms via the Matrix Client-Server API.
  3. Add parent-child links using m.space.child state events. In our earlier flow, some links used content similar to:
{
  "via": ["redacted.example.invalid"],
  "order": "some-ordering-value",
  "suggested": true
}
  1. Later remove, unlink, shut down, or purge many of those child rooms using normal admin cleanup flows.
  2. Also try explicitly overwriting the same m.space.child state with empty content for the same state_key, and then re-run cleanup.
  3. Query the Synapse database tables tracking current room state and event JSON.

Actual result:
Thousands of m.space.child entries can remain after cleanup.

Expected result:
Stale room links should be cleaned up reliably when rooms are deleted or purged, or the feature should be clearly documented as potentially leaving persistent orphaned state.

Homeserver

Self-hosted Matrix Synapse deployment. Public domain intentionally redacted in this draft.

Synapse Version

1.151.0 This is based on the fedora-minimal v.43 container build pin: text matrix-synapse[postgres]==1.151.0

Installation Method

Docker (matrixdotorg/synapse)

Database

PostgreSQL-backed deployment, using a single PostgreSQL server for only Synapse data.

Workers

Single process

Platform

Windows 11 host, Docker Desktop (current), Synapse running in Fedora-minimal v.43 Linux containers.

Configuration

  • Self-hosted HTTPS deployment.
  • Parent-child room hierarchy is managed through the Matrix Client-Server API.
  • Cleanup and purge have been attempted with a full Synapse admin MXID, redacted in this draft.
  • New room-link creation has since been changed to set suggested: false for newly created m.space.child links.
  • That change prevents new suggested links from being created by our current path, but it does not clean up the large number of existing residual rows.

Relevant log output

Sanitized excerpts from the latest cleanup run and generated artifact:


[SQL][matrix:all_space_child_links] ... output truncated, 2963 additional line(s) not shown
[SQL][matrix:space_child_suggested_flags] ... output truncated, 2963 additional line(s) not shown
[MATRIX-HYGIENE] Counts remaining -> allSpaceChildLinks=3109, orphanRooms=0

all_space_child_links.outputCount: 3113
space_child_suggested_flags.outputCount: 3113
postCleanupAllSpaceChildCount: 3109
postCleanupOrphanCount: 0


Additional aggregate evidence from the same artifact:


space_child_counts_by_parent:
- one parent room had 2124 child pointers by itself
- several others were in the 90+ range


The raw SQL output was large enough that the surrounding test log had to truncate many lines, which is part of why this became operationally difficult to diagnose.

## SQL used to detect the stale state

These are the main queries used to inspect the residual m.space.child state.

List all current parent-child links:


SELECT room_id AS parent_room, state_key AS child_room_id, event_id
FROM current_state_events
WHERE type = 'm.space.child'
  AND state_key LIKE '!%'
ORDER BY room_id, state_key;


Check whether the stored content still reports suggested:


SELECT c.room_id AS parent_room,
       c.state_key AS child_room_id,
       CASE
         WHEN ej.json::jsonb -> 'content' ->> 'suggested' = 'true' THEN true
         ELSE false
       END AS suggested
FROM current_state_events c
JOIN event_json ej ON ej.event_id = c.event_id
WHERE c.type = 'm.space.child'
  AND c.state_key LIKE '!%'
ORDER BY c.room_id, c.state_key;


Count how many child pointers each parent still holds:


SELECT room_id AS parent_room, COUNT(*) AS child_pointer_count
FROM current_state_events
WHERE type = 'm.space.child'
  AND state_key LIKE '!%'
GROUP BY room_id
ORDER BY child_pointer_count DESC, room_id;


Look for rooms that appear empty or orphaned:


SELECT r.room_id,
       COALESCE(rsc.joined_members, 0) AS joined_members,
       COALESCE(rsc.local_users_in_room, 0) AS local_users_in_room
FROM rooms r
LEFT JOIN room_stats_current rsc ON rsc.room_id = r.room_id
WHERE COALESCE(rsc.joined_members, 0) = 0
  AND COALESCE(rsc.local_users_in_room, 0) = 0
ORDER BY r.room_id;

Anything else that would be useful to know?

  • The residue is large enough to become operationally painful. In the latest cleanup artifact, 3113 m.space.child rows were still present, with 3109 remaining even after cleanup.
  • The largest single parent room still had 2124 child pointers.
  • The cleanup tool reported no obvious stale pointers or purge candidates, which suggests these rows are difficult to classify or remove through ordinary cleanup once they exist.
  • We also attempted explicit state unsets and room shutdown or purge flows, but the counts remained high.
  • If this is expected behavior, the documentation should warn operators very clearly before they use suggested space-child links heavily.
  • If this is not expected behavior, a built-in admin cleanup path or safer default would be very helpful.
  • The way this was discovered was by using etkecc/ketesa https://github.com/etkecc/ketesa, selecting a parent space room, selectuing the heirarchy view and seeing thousands of "Suggested" state messages under a space parent for rooms and users that had been deleted during testing cycles.
  • Those lingering entries included references to rooms and test-created users that had already been deleted during repeated testing cycles.
Image

Metadata

Metadata

Assignees

No one assigned

    Labels

    A-purge-roomDeleting and purging a room

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions