-
Notifications
You must be signed in to change notification settings - Fork 385
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] MSC3898: Native Matrix VoIP signalling for cascaded foci (SFUs, MCUs...) #3898
base: main
Are you sure you want to change the base?
Changes from 12 commits
750087f
aa53398
de302cb
7474782
5cad46d
2cbc2d6
f542fcb
6f01a94
575e16c
33b1880
65faee4
9882c97
c66bbe4
1b2d740
feb064b
d96d101
d538e1e
91470a2
2ef7425
5a186e4
b461525
e49e80d
6b3fd47
bf52e02
f81dd9d
9c32b96
6f8c9d1
ecf2425
bf04b17
1896fc7
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,364 @@ | ||
# MSC3898: Native Matrix VoIP signalling for cascaded SFUs | ||
|
||
[MSC3401](https://github.com/matrix-org/matrix-spec-proposals/pull/3401) | ||
specifies how full-mesh group calls work in Matrix. While that MSC works well | ||
for small group calls, it does not work so well for large conferences due to | ||
bandwidth (and other) issues. | ||
|
||
Selective Forwarding Units (SFUs) - servers which forwarding WebRTC streams | ||
between peers (which could be clients or SFUs or both). To make use of them | ||
effectively, peers need to be able to tell the SFU which streams they want to | ||
receive at what resolutions. | ||
|
||
To solve the issue of centralization, the SFUs are also allowed to connect to | ||
each other ("cascade") and therefore the peers also need a way to tell an SFU to | ||
which other SFUs to connect. | ||
|
||
## Proposal | ||
|
||
- **TODO: spell out how this works with active speaker detection & associated | ||
signalling** | ||
|
||
### Diagrams | ||
|
||
The diagrams of how this all looks can be found in | ||
[MSC3401](https://github.com/matrix-org/matrix-spec-proposals/pull/3401). | ||
|
||
### Additions to the `m.call.member` state event | ||
|
||
This MSC proposes adding two _optional_ fields to the `m.call.member` state event: | ||
`m.foci.preferred` and `m.foci.active`. | ||
|
||
For instance: | ||
|
||
```json | ||
{ | ||
"type": "m.call.member", | ||
"state_key": "@matthew:matrix.org", | ||
"content": { | ||
"m.calls": [ | ||
{ | ||
"m.call_id": "cvsiu2893", | ||
"m.devices": [{ | ||
"device_id": "U738KDF9WJ", | ||
"m.foci.active": [ | ||
{ "user_id": "@sfu-lon:matrix.org", "device_id": "FS5F589EF" } | ||
], | ||
"m.foci.preferred": [ | ||
{ "user_id": "@sfu-bon:matrix.org", "device_id": "3FSF589EF" }, | ||
{ "user_id": "@sfu-mon:matrix.org", "device_id": "GFSDH93EF" }, | ||
] | ||
}] | ||
} | ||
], | ||
"m.expires_ts": 1654616071686 | ||
} | ||
} | ||
``` | ||
|
||
#### `m.foci.active` | ||
|
||
This field is a list of foci the user's device is publishing to. Usually, this | ||
list will have a length of 1, yet a client might publish to multiple foci if | ||
they are on different networks, for instance, or to simultaneously fan-out in | ||
different directions from the client if there is no nearby focus. If the client | ||
is participating full-mesh, it should either omit this field from the state | ||
event or leave the list empty. | ||
|
||
#### `m.foci.preferred` | ||
|
||
This field is a list of foci the client would prefer to switch to from the | ||
current active focus, if any other client also starts using the given focus. If | ||
the client is already using one of its preferred foci, it should either omit | ||
this field from the state event or leave the list empty. | ||
dbkr marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
### Choosing a focus | ||
|
||
#### Discovering foci | ||
|
||
- **TODO: How does a client discover foci? We could use well-known or a custom endpoint** | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thoughts: how we load balance between SFUs and manage availability will have a bearing on this, ie. would we expect SFUs to become unavailable when they get restarted / updated and therefore how often a client expect its SFU (or list of SFUs?) to change? |
||
|
||
Foci are identified by a tuple of `user_id` and `device_id`. | ||
|
||
#### Determining the best focus | ||
|
||
There are many ways to determine the best focus; this MSC recommends the | ||
following: | ||
|
||
- Is the quickest to respond to `m.call.invite` with `m.call.answer`. | ||
- Is the quickest to rapidly reject a spurious HTTPS request to a high-numbered | ||
port on the SFU's IP address, if the SFU exposes its IP somewhere - similar to | ||
the [apenwarr/blip](https://github.com/apenwarr/blip) trick, in order to | ||
measure media-path latency rather than signalling path latency. | ||
- Has the best latency of data-channel traffic flows. | ||
- Has the best latency and bandwidth determined by sending a small splurge of | ||
media down the pipe to probe. | ||
|
||
#### Joining a call | ||
|
||
The following diagram explains how a client chooses a focus when joining a call. | ||
|
||
```mermaid | ||
flowchart TD; | ||
wantsToJoin[Wants to join a call]; | ||
hasPreferred(Has preferred focus?); | ||
callPreferred[Calls preferred foci without media to grab a slot]; | ||
publishPreferred[Publishes `m.foci.preferred`]; | ||
checkMembers(Call has more than 2 members including the client itself?); | ||
callFullMesh[Calls other member full-mesh]; | ||
callMembersFoci[Tries calling foci from `m.call.member` events]; | ||
orderFoci[Orders foci from best to worst]; | ||
findFocusPreferredByOtherMember(Goes through ordered foci to find one which is preferred by at least one other member); | ||
callBestPreferred[Calls the focus]; | ||
callBestActive[Calls the best active focus in room]; | ||
publishActive[Publishes `m.foci.active`]; | ||
|
||
wantsToJoin-->hasPreferred; | ||
hasPreferred--->|Yes|callPreferred; | ||
hasPreferred--->|No|checkMembers; | ||
callPreferred--->publishPreferred; | ||
publishPreferred--->checkMembers; | ||
checkMembers--->|Yes|callMembersFoci; | ||
checkMembers--->|No|callFullMesh; | ||
callMembersFoci--->orderFoci; | ||
orderFoci--->findFocusPreferredByOtherMember; | ||
findFocusPreferredByOtherMember--->|Found|callBestPreferred; | ||
callBestPreferred--->publishActive; | ||
findFocusPreferredByOtherMember--->|Not found|callBestActive; | ||
callBestActive--->publishActive; | ||
``` | ||
|
||
#### Mid-call changes | ||
|
||
Once in a call, the client listens for changes to `m.call.member` state events | ||
and if another member starts using one of the client's preferred foci, the client | ||
switches to that focus. | ||
|
||
### Initial offer/answer dance | ||
|
||
During the initial offer/answer dance, the client establishes a data-channel | ||
between itself and the SFU to use later for rapid signalling. | ||
SimonBrandner marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
### Simulcast | ||
|
||
#### RTP munging | ||
|
||
#### vp8 munging | ||
|
||
### RTCP re-transmission | ||
|
||
### Data-channel messaging | ||
|
||
The client uses the established data channel connection to the SFU to perform | ||
low-latency signalling to rapidly (un)subscribe/(un)publish streams, send | ||
keep-alive messages, metadata, cascade and perform re-negotiation. | ||
|
||
- **TODO: It feels like these ought to be `m.` namespaced** | ||
- **TODO: Why `op` instead of `type`?** | ||
- **TODO: It feels like these ought to have `content` rather than being on the | ||
same layer** | ||
- **TODO: Spell out how the DC traffic interacts with application-layer | ||
traffic** | ||
|
||
#### SDP Stream Metadata extension | ||
|
||
The client will be receiving multiple streams from the SFU and it will need to | ||
be able to distinguish them, this therefore build on | ||
SimonBrandner marked this conversation as resolved.
Show resolved
Hide resolved
|
||
[MSC3077](https://github.com/matrix-org/matrix-spec-proposals/pull/3077) and | ||
[MSC3291](https://github.com/matrix-org/matrix-spec-proposals/pull/3291) to | ||
provide the client with the necessary metadata. Some of the data-channel events | ||
include a `metadata` field including a description of the stream being sent | ||
either from the SFU to the client or from the client to the SFU. | ||
|
||
Other than mute information and stream purpose, the metadata includes video | ||
track resolution. The SFU may not be able to determine the resolution of the | ||
track itself but it does need to know for simulcast; therefore, we include this | ||
in the metadata. | ||
|
||
```json | ||
{ | ||
"streamId1": { | ||
"purpose": "m.usermedia", | ||
"audio_muted": false, | ||
"video_muted": true, | ||
"tracks": { | ||
"trackId1": { | ||
"width": 1920, | ||
"height": 1080 | ||
SimonBrandner marked this conversation as resolved.
Show resolved
Hide resolved
|
||
}, | ||
"trackId2": {} | ||
} | ||
} | ||
} | ||
``` | ||
|
||
#### Event types | ||
|
||
##### Subscribe | ||
|
||
This event is sent by the client to request a set of tracks. In the case of | ||
video tracks the client can also request a specific resolution of a given a | ||
track; this resolution is a resolution the client wishes to receive but the SFU | ||
may send a lower one due to bandwidth etc. | ||
SimonBrandner marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
If the user for example switches from "spotlight" (one large tile) to "grid" | ||
(multiple small tiles) view, it should also send this request to let the SFU | ||
know of the resolution change. | ||
|
||
- **TODO: how do we prove to the SFU that we have the right to subscribe to | ||
track?** | ||
|
||
```json | ||
{ | ||
"op": "subscribe", | ||
"start": [ | ||
"stream_id": "streamId1", | ||
"track_id": "trackId1", | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Btw, do we really need both track ID and stream ID for the SFU use case? The track IDs that browsers generate seem to be GUIDs that are unique enough (i.e. it's unlikely that there would be 2 tracks with the same GUID). Does this mean that instead of using two values, we could just send the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. To explain the reasoning here. When browsing via Pion docs, I've noticed that Also, using only the |
||
"width": 1920, | ||
"height": 1080 | ||
], | ||
} | ||
``` | ||
|
||
##### Unsubscribe | ||
|
||
```json | ||
{ | ||
"op": "unsubscribe", | ||
"stop": [ | ||
"stream_id": "streamId1", | ||
"track_id": "trackId1" | ||
], | ||
} | ||
``` | ||
|
||
##### Publish | ||
|
||
##### Unpublish | ||
|
||
##### Offer | ||
|
||
##### Answer | ||
|
||
##### Metadata | ||
|
||
```json | ||
{ | ||
"op": "metadata", | ||
"metadata": {...} // As specified in the Metadata section | ||
} | ||
``` | ||
|
||
##### Keep-alive | ||
|
||
```json | ||
{ | ||
"op": "alive" | ||
} | ||
``` | ||
|
||
##### Connect | ||
|
||
If a user is using their SFU in a call, it will need to know how to connect to | ||
other SFUs present in order to participate in the full-mesh of SFU traffic (if | ||
any). The client is responsible for doing this using the `connect` op. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do we want to specify the cascading specific logic in this MSC or would it be better to make a separate MSC for cascading? Rationale: if we have a dedicated MSC for the SFU, we'll be able to finalize and merge it faster to master. Iterating with small MSCs might be a better idea given the amount of time it normally takes until the MSC is merged? (just a gut feeling) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The issue is that the event fields used in a single focus case are quite different from the cascading case. I wonder if there is a way to avoid that issue There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sorry, I did not fully get what you mean. I think the reason why I initially commented is that it seems like we're not going to have the cascading implemented in the very nearest future (currently we don't really support it), so I thought maybe it would be faster to limit this MSC to the SFU and then create a cascading MSC after that (once we have a stable SFU). I was just afraid that otherwise the MSC would stay open (or in a draft state) for too long. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree with that but I am not sure how to technically handle this - the MSC currently specifies an SFU selection algorithm and the fields it uses, if we wanted to split the MSC into two, we would need to completely different ways to specify the SFU, I think... |
||
|
||
```json | ||
{ | ||
"op": "connect" | ||
// TODO: How should this look? | ||
} | ||
``` | ||
|
||
### Encryption | ||
|
||
When SFUs are on the media path, they will necessarily terminate the SRTP | ||
dbkr marked this conversation as resolved.
Show resolved
Hide resolved
|
||
traffic from the peer, breaking E2EE. To address this, we apply an additional | ||
end-to-end layer of encryption to the media using [WebRTC Encoded | ||
Transform](https://github.com/w3c/webrtc-encoded-transform/blob/main/explainer.md) | ||
(formerly Insertable Streams) via | ||
[SFrame](https://datatracker.ietf.org/doc/draft-omara-sframe/). | ||
|
||
In order to provide PFS, The symmetric key used for these streams from a given | ||
participating device is a megolm key. Unlike a normal megolm key, this is shared | ||
via `m.room_key` over Olm to the devices participating in the conference | ||
including an `m.call_id` and `m.room_id` field on the key to correlate it to the | ||
conference traffic, rather than using the `session_id` event field to correlate | ||
(given the encrypted traffic is SRTP rather than events, and we don't want to | ||
have to send fake events from all senders every time the megolm session is | ||
replaced). | ||
|
||
The megolm key is ratcheted forward for every SFrame, and shared with new | ||
SimonBrandner marked this conversation as resolved.
Show resolved
Hide resolved
|
||
participants at the current index via `m.room_key` over Olm as per above. When | ||
participants leave, a new megolm session is created and shared with all | ||
participants over Olm. The new session is only used once all participants have | ||
received it. | ||
|
||
### Notes | ||
|
||
#### Hiding behind foci | ||
|
||
We do not recommend that users utilise a focus to hide behind for privacy, but | ||
instead use a TURN server, only providing relay candidates, rather than | ||
consuming focus resources and unnecessarily mandating the presence of a focus. | ||
|
||
## Potential issues | ||
|
||
The SFUs participating in a conference end up in a full mesh. Rather than | ||
inventing our own spanning-tree system for SFUs however, we should fix it for | ||
Matrix as a whole (as is happening in the LB work) and use a Pinecone tree or | ||
similar to decide what better-than-full-mesh topology to use. In practice, full | ||
mesh cascade between SFUs is probably not that bad (especially if SFUs only | ||
request the streams over the trunk their clients care about) - and on aggregate | ||
will be less obnoxious than all the clients hitting a single SFU. | ||
|
||
Too many foci will chew bandwidth due to full-mesh between them. In the worst | ||
case, if every use is on their own HS and picks a different foci, it degenerates | ||
to a full-mesh call (just server-side rather than client-side). Hopefully this | ||
shouldn't happen as you will converge on using a single SFU with the most | ||
clients, but need to check how this works in practice. | ||
|
||
SFrame mandates its own ratchet currently which is almost the same as megolm but | ||
not quite. Switching it out for megolm seems reasonable right now (at least | ||
until MLS comes along) | ||
|
||
## Alternatives | ||
|
||
An option would be to treat 1:1 (and full mesh) entirely differently to SFU | ||
based calling rather than trying to unify them. Also, it's debatable whether | ||
supporting full mesh is useful at all. In the end, it feels like unifying 1:1 | ||
and SFU calling is for the best though, as it then gives you the ability to | ||
trivially upgrade 1:1 calls to group calls and vice versa, and avoids | ||
maintaining two separate hunks of spec. It also forces 1:1 calls to take | ||
multi-stream calls seriously, which is useful for more exotic capture devices | ||
(stereo cameras; 3D cameras; surround sound; audio fields etc). | ||
|
||
### Cascading | ||
|
||
One option here is for SFUs to act as an AS and sniff the `m.call.member` | ||
traffic of their associated server, and automatically call any other `m.foci` | ||
which appear. (They don't need to make outbound calls to clients, as clients | ||
always dial in). | ||
|
||
## Security considerations | ||
|
||
Malicious users could try to DoS SFUs by specifying them as their foci. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
(by @HarHarLinks from #3401 (comment)) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As I learn more about this topic, foci seem to not be authenticated. |
||
|
||
SFrame E2EE may go horribly wrong if we can't send the new megolm session fast | ||
enough to all the participants when a participant leave (and meanwhile if we | ||
keep using the old session, we're technically leaking call media to the parted | ||
participant until we manage to rotate). | ||
|
||
Need to ensure there's no scope for media forwarding loops through SFUs. | ||
|
||
In order to authenticate that only legitimate users are allowed to subscribe to | ||
a given `conf_id` on an SFU, it would make sense for the SFU to act as an AS and | ||
sniff the `m.call` events on their associated server, and only act on to-device | ||
`m.call.*` events which come from a user who is confirmed to be in the room for | ||
that `m.call`. (In practice, if the conf is E2EE then it's of limited use to | ||
connect to the SFU without having the keys to decrypt the traffic, but this | ||
feature is desirable for non-E2EE confs and to stop bandwidth DoS) | ||
|
||
## Unstable prefixes | ||
|
||
We probably don't care for this for the data-channel? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that the
call_id
does not seem to be necessary.When the SFU sends To-Device messages to the clients, the
conf_id
is specified and given that theconf_id
is a unique identifier of a conference/call, there seem to be no need to have acall_id
in addition to that.Recently I've ran into an issue where I realized that
call_id
andconf_id
are not the same (despite MSC3401 giving me an impression that they are identical). Theconf_id
was the ID of a conference (as expected), but thecall_id
was another random string that was different for each single participant which forced us to use bothcall_id
andconf_id
when sending messages back to the clients (otherwise they would be rejected).It looks like
call_id
should either be removed or (if we want to keep it for the backward compatibility with the older MSC?) it must be equal to theconf_id
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this comment belongs on MSC3401 as this line is specified in other MSC, although I thinik the conclusion is just that there's confusion between call_id and conf_id and we should rename this to conf_id (there's no other conf ID in this event so it is necessary, not just for backwards compat).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I've also written a comment about it in MSC3401 😛
Basically, the problem is not only that they are called differently, but also that the value of
call_id
andconf_id
is different, so they are different for some reason (and on the SFU we are obligated to take both into account:conf_id
for a conference ID and thecall_id
to set a value in outgoing To-Device messages without which the client would discard the messages).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you agree that the correct resolution is to change this to
m.conf_id
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that would be great! Though I wonder what the consequence of that would be (i.e. what is that value that the current
call_id
has? - It's not a conference ID, it's something different, or maybe it's a leftover from a previous implementation for 1:1s wherecall_id
meant something?)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes 🙂 That's something that I discovered a week ago when deploying the first iteration of refactored SFU. I've just tried to join the SFU and the
conf_id
field is equal to1668002318158qFQmBWgVHHXTZsPA
, while thecall_id
is1670443502134bOWVqa3btIfDQMjJ
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
conf_id and call_id from where though? There will also be call_id in the individual calls which will definitely be different. Otherwise we need to work out what's going on here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From To-Device messages that the participants of the conference send to the SFU. We then reply with To-Device messages back (e.g. when we generate an answer), in which case we also set both
conf_id
orcall_id
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean https://github.com/matrix-org/matrix-js-sdk/blob/develop/src/webrtc/call.ts#L2252?
conf_id
is the ID of the conference call (state key of the m.call event),call_id
is the ID of the 1:1 call between the individual group call participants.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, seems like this. But the thing is that, from the SFUs standpoint, the
call_id
does not have any semantics, but currently we're obligated to store bothconf_id
andcall_id
(which have different values), where thecall_id
is only used in order to send To-Device messages to the clients, i.e. when I e.g. want to send messages from the SFU to the client, I have to set both theconf_id
(the ID of a conference) and thecall_id
(the ID of the 1:1 call between individual group call participants).So my point is that we probably want to get rid of mandating
call_id
for the SFU calls since they don't seem any semantic value for this use case. And only use theconf_id
instead?