Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] MSC3898: Native Matrix VoIP signalling for cascaded foci (SFUs, MCUs...) #3898

Draft
wants to merge 30 commits into
base: main
Choose a base branch
from
Draft
Changes from 12 commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
750087f
Native Matrix VoIP signalling for cascaded SFUs
SimonBrandner Sep 25, 2022
aa53398
Update MSC number
SimonBrandner Sep 25, 2022
de302cb
Link to diagrams from MSC3401
SimonBrandner Oct 2, 2022
7474782
Use correct number for file
SimonBrandner Oct 2, 2022
5cad46d
Update sub and unsub ops
SimonBrandner Nov 11, 2022
2cbc2d6
Merge remote-tracking branch 'upstream/main' into SimonBrandner/msc/sfu
SimonBrandner Nov 11, 2022
f542fcb
Give a reason for specifying res in metadata
SimonBrandner Nov 11, 2022
6f01a94
Specify foci by `device_id` too
SimonBrandner Nov 12, 2022
575e16c
Fixup some json
SimonBrandner Nov 12, 2022
33b1880
Typo
SimonBrandner Nov 12, 2022
65faee4
Specify how to handle foci better
SimonBrandner Nov 13, 2022
9882c97
Amend TODOs
SimonBrandner Nov 13, 2022
c66bbe4
Add rationale behind usage of data channels
daniel-abramov Nov 15, 2022
1b2d740
Add TODO
SimonBrandner Dec 2, 2022
feb064b
Update event types
SimonBrandner Dec 2, 2022
d96d101
Add unstable prefixes
SimonBrandner Dec 2, 2022
d538e1e
Use `subscribe` instead of `select`
SimonBrandner Dec 6, 2022
91470a2
`op` -> `event`
SimonBrandner Dec 6, 2022
2ef7425
Fixup formatting
SimonBrandner Dec 6, 2022
5a186e4
Use `content`
SimonBrandner Dec 6, 2022
b461525
Namespace things
SimonBrandner Dec 6, 2022
e49e80d
Further namespacing
SimonBrandner Dec 6, 2022
6b3fd47
Update the events to match current Matrix
SimonBrandner Dec 6, 2022
bf52e02
Fix typo
SimonBrandner Dec 7, 2022
f81dd9d
Use `subscribe`/`unsbuscribe`
SimonBrandner Dec 7, 2022
9c32b96
Add informational section on active/preferred foci.
dbkr Dec 8, 2022
6f8c9d1
Change keepalives to ping/pong
dbkr Dec 8, 2022
ecf2425
Add empty line
SimonBrandner Dec 8, 2022
bf04b17
Fix event name
SimonBrandner Dec 9, 2022
1896fc7
Remove encryption section as it's glossing over details
SimonBrandner Dec 12, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
364 changes: 364 additions & 0 deletions proposals/3898-sfu.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,364 @@
# MSC3898: Native Matrix VoIP signalling for cascaded SFUs

[MSC3401](https://github.com/matrix-org/matrix-spec-proposals/pull/3401)
specifies how full-mesh group calls work in Matrix. While that MSC works well
for small group calls, it does not work so well for large conferences due to
bandwidth (and other) issues.

Selective Forwarding Units (SFUs) - servers which forwarding WebRTC streams
between peers (which could be clients or SFUs or both). To make use of them
effectively, peers need to be able to tell the SFU which streams they want to
receive at what resolutions.

To solve the issue of centralization, the SFUs are also allowed to connect to
each other ("cascade") and therefore the peers also need a way to tell an SFU to
which other SFUs to connect.

## Proposal

- **TODO: spell out how this works with active speaker detection & associated
signalling**

### Diagrams

The diagrams of how this all looks can be found in
[MSC3401](https://github.com/matrix-org/matrix-spec-proposals/pull/3401).

### Additions to the `m.call.member` state event

This MSC proposes adding two _optional_ fields to the `m.call.member` state event:
`m.foci.preferred` and `m.foci.active`.

For instance:

```json
{
"type": "m.call.member",
"state_key": "@matthew:matrix.org",
"content": {
"m.calls": [
{
"m.call_id": "cvsiu2893",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that the call_id does not seem to be necessary.

When the SFU sends To-Device messages to the clients, the conf_id is specified and given that the conf_id is a unique identifier of a conference/call, there seem to be no need to have a call_id in addition to that.

Recently I've ran into an issue where I realized that call_id and conf_id are not the same (despite MSC3401 giving me an impression that they are identical). The conf_id was the ID of a conference (as expected), but the call_id was another random string that was different for each single participant which forced us to use both call_id and conf_id when sending messages back to the clients (otherwise they would be rejected).

It looks like call_id should either be removed or (if we want to keep it for the backward compatibility with the older MSC?) it must be equal to the conf_id.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this comment belongs on MSC3401 as this line is specified in other MSC, although I thinik the conclusion is just that there's confusion between call_id and conf_id and we should rename this to conf_id (there's no other conf ID in this event so it is necessary, not just for backwards compat).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I've also written a comment about it in MSC3401 😛

Basically, the problem is not only that they are called differently, but also that the value of call_id and conf_id is different, so they are different for some reason (and on the SFU we are obligated to take both into account: conf_id for a conference ID and the call_id to set a value in outgoing To-Device messages without which the client would discard the messages).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you agree that the correct resolution is to change this to m.conf_id?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that would be great! Though I wonder what the consequence of that would be (i.e. what is that value that the current call_id has? - It's not a conference ID, it's something different, or maybe it's a leftover from a previous implementation for 1:1s where call_id meant something?)

Copy link

@daniel-abramov daniel-abramov Dec 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really?

Yes 🙂 That's something that I discovered a week ago when deploying the first iteration of refactored SFU. I've just tried to join the SFU and the conf_id field is equal to 1668002318158qFQmBWgVHHXTZsPA, while the call_id is 1670443502134bOWVqa3btIfDQMjJ.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

conf_id and call_id from where though? There will also be call_id in the individual calls which will definitely be different. Otherwise we need to work out what's going on here.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

conf_id and call_id from where though?

From To-Device messages that the participants of the conference send to the SFU. We then reply with To-Device messages back (e.g. when we generate an answer), in which case we also set both conf_id or call_id.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean https://github.com/matrix-org/matrix-js-sdk/blob/develop/src/webrtc/call.ts#L2252? conf_id is the ID of the conference call (state key of the m.call event), call_id is the ID of the 1:1 call between the individual group call participants.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, seems like this. But the thing is that, from the SFUs standpoint, the call_id does not have any semantics, but currently we're obligated to store both conf_id and call_id (which have different values), where the call_id is only used in order to send To-Device messages to the clients, i.e. when I e.g. want to send messages from the SFU to the client, I have to set both the conf_id (the ID of a conference) and the call_id (the ID of the 1:1 call between individual group call participants).

So my point is that we probably want to get rid of mandating call_id for the SFU calls since they don't seem any semantic value for this use case. And only use the conf_id instead?

"m.devices": [{
"device_id": "U738KDF9WJ",
"m.foci.active": [
{ "user_id": "@sfu-lon:matrix.org", "device_id": "FS5F589EF" }
],
"m.foci.preferred": [
{ "user_id": "@sfu-bon:matrix.org", "device_id": "3FSF589EF" },
{ "user_id": "@sfu-mon:matrix.org", "device_id": "GFSDH93EF" },
]
}]
}
],
"m.expires_ts": 1654616071686
}
}
```

#### `m.foci.active`

This field is a list of foci the user's device is publishing to. Usually, this
list will have a length of 1, yet a client might publish to multiple foci if
they are on different networks, for instance, or to simultaneously fan-out in
different directions from the client if there is no nearby focus. If the client
is participating full-mesh, it should either omit this field from the state
event or leave the list empty.

#### `m.foci.preferred`

This field is a list of foci the client would prefer to switch to from the
current active focus, if any other client also starts using the given focus. If
the client is already using one of its preferred foci, it should either omit
this field from the state event or leave the list empty.
dbkr marked this conversation as resolved.
Show resolved Hide resolved

### Choosing a focus

#### Discovering foci

- **TODO: How does a client discover foci? We could use well-known or a custom endpoint**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts: how we load balance between SFUs and manage availability will have a bearing on this, ie. would we expect SFUs to become unavailable when they get restarted / updated and therefore how often a client expect its SFU (or list of SFUs?) to change?


Foci are identified by a tuple of `user_id` and `device_id`.

#### Determining the best focus

There are many ways to determine the best focus; this MSC recommends the
following:

- Is the quickest to respond to `m.call.invite` with `m.call.answer`.
- Is the quickest to rapidly reject a spurious HTTPS request to a high-numbered
port on the SFU's IP address, if the SFU exposes its IP somewhere - similar to
the [apenwarr/blip](https://github.com/apenwarr/blip) trick, in order to
measure media-path latency rather than signalling path latency.
- Has the best latency of data-channel traffic flows.
- Has the best latency and bandwidth determined by sending a small splurge of
media down the pipe to probe.

#### Joining a call

The following diagram explains how a client chooses a focus when joining a call.

```mermaid
flowchart TD;
wantsToJoin[Wants to join a call];
hasPreferred(Has preferred focus?);
callPreferred[Calls preferred foci without media to grab a slot];
publishPreferred[Publishes `m.foci.preferred`];
checkMembers(Call has more than 2 members including the client itself?);
callFullMesh[Calls other member full-mesh];
callMembersFoci[Tries calling foci from `m.call.member` events];
orderFoci[Orders foci from best to worst];
findFocusPreferredByOtherMember(Goes through ordered foci to find one which is preferred by at least one other member);
callBestPreferred[Calls the focus];
callBestActive[Calls the best active focus in room];
publishActive[Publishes `m.foci.active`];

wantsToJoin-->hasPreferred;
hasPreferred--->|Yes|callPreferred;
hasPreferred--->|No|checkMembers;
callPreferred--->publishPreferred;
publishPreferred--->checkMembers;
checkMembers--->|Yes|callMembersFoci;
checkMembers--->|No|callFullMesh;
callMembersFoci--->orderFoci;
orderFoci--->findFocusPreferredByOtherMember;
findFocusPreferredByOtherMember--->|Found|callBestPreferred;
callBestPreferred--->publishActive;
findFocusPreferredByOtherMember--->|Not found|callBestActive;
callBestActive--->publishActive;
```

#### Mid-call changes

Once in a call, the client listens for changes to `m.call.member` state events
and if another member starts using one of the client's preferred foci, the client
switches to that focus.

### Initial offer/answer dance

During the initial offer/answer dance, the client establishes a data-channel
between itself and the SFU to use later for rapid signalling.
SimonBrandner marked this conversation as resolved.
Show resolved Hide resolved

### Simulcast

#### RTP munging

#### vp8 munging

### RTCP re-transmission

### Data-channel messaging

The client uses the established data channel connection to the SFU to perform
low-latency signalling to rapidly (un)subscribe/(un)publish streams, send
keep-alive messages, metadata, cascade and perform re-negotiation.

- **TODO: It feels like these ought to be `m.` namespaced**
- **TODO: Why `op` instead of `type`?**
- **TODO: It feels like these ought to have `content` rather than being on the
same layer**
- **TODO: Spell out how the DC traffic interacts with application-layer
traffic**

#### SDP Stream Metadata extension

The client will be receiving multiple streams from the SFU and it will need to
be able to distinguish them, this therefore build on
SimonBrandner marked this conversation as resolved.
Show resolved Hide resolved
[MSC3077](https://github.com/matrix-org/matrix-spec-proposals/pull/3077) and
[MSC3291](https://github.com/matrix-org/matrix-spec-proposals/pull/3291) to
provide the client with the necessary metadata. Some of the data-channel events
include a `metadata` field including a description of the stream being sent
either from the SFU to the client or from the client to the SFU.

Other than mute information and stream purpose, the metadata includes video
track resolution. The SFU may not be able to determine the resolution of the
track itself but it does need to know for simulcast; therefore, we include this
in the metadata.

```json
{
"streamId1": {
"purpose": "m.usermedia",
"audio_muted": false,
"video_muted": true,
"tracks": {
"trackId1": {
"width": 1920,
"height": 1080
SimonBrandner marked this conversation as resolved.
Show resolved Hide resolved
},
"trackId2": {}
}
}
}
```

#### Event types

##### Subscribe

This event is sent by the client to request a set of tracks. In the case of
video tracks the client can also request a specific resolution of a given a
track; this resolution is a resolution the client wishes to receive but the SFU
may send a lower one due to bandwidth etc.
SimonBrandner marked this conversation as resolved.
Show resolved Hide resolved

If the user for example switches from "spotlight" (one large tile) to "grid"
(multiple small tiles) view, it should also send this request to let the SFU
know of the resolution change.

- **TODO: how do we prove to the SFU that we have the right to subscribe to
track?**

```json
{
"op": "subscribe",
"start": [
"stream_id": "streamId1",
"track_id": "trackId1",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, do we really need both track ID and stream ID for the SFU use case?

The track IDs that browsers generate seem to be GUIDs that are unique enough (i.e. it's unlikely that there would be 2 tracks with the same GUID). Does this mean that instead of using two values, we could just send the track_id? (the server anyway always knows the stream ID of each track).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have any strong arguments here. @ara4n and @dbkr, do you have any thoughts?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To explain the reasoning here. When browsing via Pion docs, I've noticed that StreamID is said to be unique only within a single peer connection (but not globally), while TrackID is meant to be unique within a stream, but not globally. This means that in the case of the SFU, the combination of StreamID and TrackID as per Pion would not be enough to uniquely identify a track, so initially, I was worried that our implementation is not correct. However, when checking what the browsers actually send as TrackID and StreamID, I've noticed that both are randomly generated with TrackID being a GUID (it's also not that off from the official spec). If the TrackID is a GUID, then it would be enough to use the GUID as an identifier for tracks when trying to subscribe/unsubscribe from tracks. The rest (StreamID) would anyway be known to the client once the subscription is completed since they will get the remote way along with its stream ID.

Also, using only the TrackID would allow us to support streamless tracks that may potentially exist in the MatrixRTC use case.

"width": 1920,
"height": 1080
],
}
```

##### Unsubscribe

```json
{
"op": "unsubscribe",
"stop": [
"stream_id": "streamId1",
"track_id": "trackId1"
],
}
```

##### Publish

##### Unpublish

##### Offer

##### Answer

##### Metadata

```json
{
"op": "metadata",
"metadata": {...} // As specified in the Metadata section
}
```

##### Keep-alive

```json
{
"op": "alive"
}
```

##### Connect

If a user is using their SFU in a call, it will need to know how to connect to
other SFUs present in order to participate in the full-mesh of SFU traffic (if
any). The client is responsible for doing this using the `connect` op.
Copy link

@daniel-abramov daniel-abramov Dec 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to specify the cascading specific logic in this MSC or would it be better to make a separate MSC for cascading?

Rationale: if we have a dedicated MSC for the SFU, we'll be able to finalize and merge it faster to master. Iterating with small MSCs might be a better idea given the amount of time it normally takes until the MSC is merged? (just a gut feeling)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue is that the event fields used in a single focus case are quite different from the cascading case. I wonder if there is a way to avoid that issue

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I did not fully get what you mean.

I think the reason why I initially commented is that it seems like we're not going to have the cascading implemented in the very nearest future (currently we don't really support it), so I thought maybe it would be faster to limit this MSC to the SFU and then create a cascading MSC after that (once we have a stable SFU). I was just afraid that otherwise the MSC would stay open (or in a draft state) for too long.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with that but I am not sure how to technically handle this - the MSC currently specifies an SFU selection algorithm and the fields it uses, if we wanted to split the MSC into two, we would need to completely different ways to specify the SFU, I think...


```json
{
"op": "connect"
// TODO: How should this look?
}
```

### Encryption

When SFUs are on the media path, they will necessarily terminate the SRTP
dbkr marked this conversation as resolved.
Show resolved Hide resolved
traffic from the peer, breaking E2EE. To address this, we apply an additional
end-to-end layer of encryption to the media using [WebRTC Encoded
Transform](https://github.com/w3c/webrtc-encoded-transform/blob/main/explainer.md)
(formerly Insertable Streams) via
[SFrame](https://datatracker.ietf.org/doc/draft-omara-sframe/).

In order to provide PFS, The symmetric key used for these streams from a given
participating device is a megolm key. Unlike a normal megolm key, this is shared
via `m.room_key` over Olm to the devices participating in the conference
including an `m.call_id` and `m.room_id` field on the key to correlate it to the
conference traffic, rather than using the `session_id` event field to correlate
(given the encrypted traffic is SRTP rather than events, and we don't want to
have to send fake events from all senders every time the megolm session is
replaced).

The megolm key is ratcheted forward for every SFrame, and shared with new
SimonBrandner marked this conversation as resolved.
Show resolved Hide resolved
participants at the current index via `m.room_key` over Olm as per above. When
participants leave, a new megolm session is created and shared with all
participants over Olm. The new session is only used once all participants have
received it.

### Notes

#### Hiding behind foci

We do not recommend that users utilise a focus to hide behind for privacy, but
instead use a TURN server, only providing relay candidates, rather than
consuming focus resources and unnecessarily mandating the presence of a focus.

## Potential issues

The SFUs participating in a conference end up in a full mesh. Rather than
inventing our own spanning-tree system for SFUs however, we should fix it for
Matrix as a whole (as is happening in the LB work) and use a Pinecone tree or
similar to decide what better-than-full-mesh topology to use. In practice, full
mesh cascade between SFUs is probably not that bad (especially if SFUs only
request the streams over the trunk their clients care about) - and on aggregate
will be less obnoxious than all the clients hitting a single SFU.

Too many foci will chew bandwidth due to full-mesh between them. In the worst
case, if every use is on their own HS and picks a different foci, it degenerates
to a full-mesh call (just server-side rather than client-side). Hopefully this
shouldn't happen as you will converge on using a single SFU with the most
clients, but need to check how this works in practice.

SFrame mandates its own ratchet currently which is almost the same as megolm but
not quite. Switching it out for megolm seems reasonable right now (at least
until MLS comes along)

## Alternatives

An option would be to treat 1:1 (and full mesh) entirely differently to SFU
based calling rather than trying to unify them. Also, it's debatable whether
supporting full mesh is useful at all. In the end, it feels like unifying 1:1
and SFU calling is for the best though, as it then gives you the ability to
trivially upgrade 1:1 calls to group calls and vice versa, and avoids
maintaining two separate hunks of spec. It also forces 1:1 calls to take
multi-stream calls seriously, which is useful for more exotic capture devices
(stereo cameras; 3D cameras; surround sound; audio fields etc).

### Cascading

One option here is for SFUs to act as an AS and sniff the `m.call.member`
traffic of their associated server, and automatically call any other `m.foci`
which appear. (They don't need to make outbound calls to clients, as clients
always dial in).

## Security considerations

Malicious users could try to DoS SFUs by specifying them as their foci.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are SFUs not (by default, with an option to the admin/operator to open it up) authenticated using one's matrix account? Shouldn't they be?
The cascaded decentralized SFU concept appears to be that there is one focus associated with each homeserver. Hence I would expect that I can ever only access my hs's SFU(s).

(by @HarHarLinks from #3401 (comment))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I learn more about this topic, foci seem to not be authenticated.
As a server admin, I would like if not anyone could use the focus I host. It would appear logical to allow only user of one or more associated homeservers and at most also temporarily their remote call members if the algorithm deems the focus favourable.


SFrame E2EE may go horribly wrong if we can't send the new megolm session fast
enough to all the participants when a participant leave (and meanwhile if we
keep using the old session, we're technically leaking call media to the parted
participant until we manage to rotate).

Need to ensure there's no scope for media forwarding loops through SFUs.

In order to authenticate that only legitimate users are allowed to subscribe to
a given `conf_id` on an SFU, it would make sense for the SFU to act as an AS and
sniff the `m.call` events on their associated server, and only act on to-device
`m.call.*` events which come from a user who is confirmed to be in the room for
that `m.call`. (In practice, if the conf is E2EE then it's of limited use to
connect to the SFU without having the keys to decrypt the traffic, but this
feature is desirable for non-E2EE confs and to stop bandwidth DoS)

## Unstable prefixes

We probably don't care for this for the data-channel?