Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MSC2326: Label based filtering #2326

Open
wants to merge 22 commits into
base: old_master
Choose a base branch
from
Open

MSC2326: Label based filtering #2326

wants to merge 22 commits into from

Conversation

ara4n
Copy link
Member

@ara4n ara4n commented Oct 22, 2019

@ara4n ara4n added the proposal A matrix spec change proposal label Oct 22, 2019
proposals/2326-label-based-filtering.md Outdated Show resolved Hide resolved
proposals/2326-label-based-filtering.md Outdated Show resolved Hide resolved
proposals/2326-label-based-filtering.md Outdated Show resolved Hide resolved
proposals/2326-label-based-filtering.md Outdated Show resolved Hide resolved
proposals/2326-label-based-filtering.md Outdated Show resolved Hide resolved
@richvdh
Copy link
Member

richvdh commented Oct 22, 2019

Seems generally sane to me.

Copy link
Contributor

@Half-Shot Half-Shot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love this idea and it will really help with things like Zulip bridging. My concerns are mostly around privacy/metadata leaks but I think we have a sane solution in the comments.

@turt2live turt2live self-requested a review October 28, 2019 19:36
@babolivier babolivier dismissed Half-Shot’s stale review October 28, 2019 22:03

Concerns should have been addressed.

@dkasak
Copy link
Member

dkasak commented Oct 29, 2019

Some more potential problems:

  • No way to list all threads in a room without crawling the entire room
    history.

    This could be mitigated by having the client crawl the room and
    remembering/indexing the labels. Since this is also necessary for E2EE
    search, it seems like a feasible solution.

  • The described scheme makes it difficult to build rainbow tables, but since
    # labels will commonly be natural language (not random data), it will still
    be quite simple to leak the labels from label hashes in most cases. For
    instance, an attacker could just take a list of the 10000 most common English
    words and hash them on the fly, probably uncovering most labels. Even
    producing simple variations (word_word, wordword, WordWord,
    Word_Word, etc) and checking them all is feasible on modern hardware.

    What if the label hash was peppered and the pepper stored in the encrypted
    part of the labelled event? The hash would then be calculated as Hash(label
    || roomid || pepper) instead of Hash(label || roomid), e.g.
    #fun!someroom:example.com!PEPPER.

@babolivier
Copy link
Contributor

Hey @dkasak, thanks for your feedback!

Some more potential problems:

  • No way to list all threads in a room without crawling the entire room
    history.

    This could be mitigated by having the client crawl the room and
    remembering/indexing the labels. Since this is also necessary for E2EE
    search, it seems like a feasible solution.

FWIW crawling the entire room history isn't always a realistic expectation to have from clients, especially in big and old rooms with thousands of messages and hundreds of servers.
Another solution for this would have to spec a federation endpoint to ask each server in the room for the list of labels (or hashes) it knows about for this room, but this is also bound to create issues in big and old rooms.
I'd rather have this feature of listing every label in a room as a non-spec'd one which existence and behaviour are left as an implementation detail, in order to keep this MSC simple. If there is a need for it to be in the spec, it can also be done in a separate MSC.

  • The described scheme makes it difficult to build rainbow tables, but since
    # labels will commonly be natural language (not random data), it will still
    be quite simple to leak the labels from label hashes in most cases. For
    instance, an attacker could just take a list of the 10000 most common English
    words and hash them on the fly, probably uncovering most labels. Even
    producing simple variations (word_word, wordword, WordWord,
    Word_Word, etc) and checking them all is feasible on modern hardware.

    What if the label hash was peppered and the pepper stored in the encrypted
    part of the labelled event? The hash would then be calculated as Hash(label
    || roomid || pepper) instead of Hash(label || roomid), e.g.
    #fun!someroom:example.com!PEPPER.

Although I agree with your point, and like your solution, the whole point of using hashes instead of e.g. random opaque strings is to ensure that two clients can only use the same identifier for the same label text. Bearing in mind that we need to be able to use the label identifiers to filter the history of the room server-side (because we're not expecting clients to know about the whole history of the room, see my first point above), opaque strings had the following downsides, all originating from the fact that nothing would prevent 1000 clients from using each a different identifier:

  • filtering would have serious performances issues in E2EE rooms, as the server would need to return all events it knows about which label identifier is any of the 1000 identifiers provided by the client, which is quite expensive to do (even when using PostgreSQL with the right indexing strategy).
  • it would be impossible for a filtered /message (or /sync) request to include every event matching the desired label because we can't expect a client to know about every identifier that has been used in the whole history of the room, or about the fact that another client might suddenly decide to use another identifier for the same label text, and include those identifiers in its filtered request.

Although what your proposing isn't using a random opaque string, it would still end up having different identifiers for a same label, which would end up causing the issues with filtering I've already mentioned. This hash(label + room_id) solution is the best compromise we've come up with between enabling filtering and hiding the actual label as much as possible from servers.

I realise that this explanation belongs to an "Alternative solutions" section of the MSC which I've actually forgotten to add, I'll update the proposal with it.

Copy link
Contributor

@Half-Shot Half-Shot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Notes about the hash

proposals/2326-label-based-filtering.md Outdated Show resolved Hide resolved
proposals/2326-label-based-filtering.md Show resolved Hide resolved
@dkasak
Copy link
Member

dkasak commented Oct 29, 2019

@Half-Shot:

It's important to note here that the ID + text label is used to mainain
consistency of the key without having access to history. E.g. a random new
joiner without history could compute the same hash for the same label as
another client.

This was indeed the part I was missing, but I'm less clear on why this ability
is useful.

I suppose the intent is to enable users wanting to talk about (e.g.) cats to
simply start sending messages labelled with #cats. However, the room might
have an existing thread for talking about cats -- yet it is called #Cats or
#cats![*] but the user is unaware of them. Furthermore, the new #cats
thread might also be entirely contextually unrelated to the old one (except
sharing a name coincidentally), but a user has no way of knowing this without
the ability to inspect the old thread's content.

In essence, continuing an old thread seems primarily useful when you have the
ability to access the old content. But if you can access the old content, then
you also have access to the corresponding pepper to continue that particular
thread.

I suppose the desired scenario might be to make labelling a message with
a particular label (or, to put it differently, putting the message in
a particular "bin") cheap, while making only the retrieval potentially costly
(due to having to walk the room's history). Is this the design goal you had in
mind? In that case, I can see how the unpeppered system provides an advantage.
However, I still consider this ability to be of limited value if one is unable
to inspect either the set of all labels or the content behind them.

My main concern is that just hashing the labels has a rather large downside of
simply obscuring them from casual glances but doing nothing to truly hide them.
They are effectively public information, but this is made unobvious by the fact
that they are hashed.

[*]: Which makes me think of another problem: is ! disallowed in labels due
to it also being used as a separator? More generally, which characters are
allowed?

@dkasak
Copy link
Member

dkasak commented Oct 29, 2019

[*]: Which makes me think of another problem: is ! disallowed in labels due
to it also being used as a separator? More generally, which characters are
allowed?

On second thought, there's probably no problem with any character here since there's no parsing involved.

@babolivier
Copy link
Contributor

I suppose the intent is to enable users wanting to talk about (e.g.) cats to
simply start sending messages labelled with #cats. However, the room might
have an existing thread for talking about cats -- yet it is called #Cats or
#cats![*] but the user is unaware of them. Furthermore, the new #cats
thread might also be entirely contextually unrelated to the old one (except
sharing a name coincidentally), but a user has no way of knowing this without
the ability to inspect the old thread's content.

All of this is a result of the fact that we can't realistically expect a client to know about the entire history of every room it's in (which is why we're doing the filtering on the server side btw), rather than the solution described in this MSC.

In essence, continuing an old thread seems primarily useful when you have the
ability to access the old content. But if you can access the old content, then
you also have access to the corresponding pepper to continue that particular
thread.

I suppose the desired scenario might be to make labelling a message with
a particular label (or, to put it differently, putting the message in
a particular "bin") cheap, while making only the retrieval potentially costly
(due to having to walk the room's history). Is this the design goal you had in
mind? In that case, I can see how the unpeppered system provides an advantage.

The MSC aims at allowing both scenarios rather than one to the detriment of the other. We want both to allow users to continue a thread à la Slack if their client can access one of the messages in the thread, and to be able to show conversation topics à la Zulip alongside messages in the timeline. To our (that would be mine and @Half-Shot's) knowledge, the solution that's currently described is the only one that allows both to coexist, because we can't rely on the client to know about every event, thus every possible label in a room.

If anyone has an idea for a better solution that allows for both scenarios, I'd love to hear about it.

However, I still consider this ability to be of limited value if one is unable
to inspect either the set of all labels or the content behind them.

Again, that is not an issue with this MSC, and this MSC doesn't aim to solve it.

My main concern is that just hashing the labels has a rather large downside of
simply obscuring them from casual glances but doing nothing to truly hide them.

My take on this is that the current solution makes it as hard as possible for rogue servers to figure out the labels on an encrypted event while allowing the features described to work. I agree on the fact that it's not a perfectly secure solution, and again, if anyone knows about a better one I'd be more than happy to hear about it.

They are effectively public information, but this is made unobvious by the fact
that they are hashed.

The labels are not public information in E2EE rooms, which is the only scenario in which hashing happens (and is also the reason hashing happens).

@turt2live
Copy link
Member

(please move to threads so the conversation can actually be followed)

@dkasak
Copy link
Member

dkasak commented Oct 29, 2019

Again, that is not an issue with this MSC, and this MSC doesn't aim to solve it.

I realized I haven't said so earlier, so just to be sure: I'm not advocating for this MSC to gain the ability to list all labels.

The labels are not public information in E2EE rooms, which is the only scenario in which hashing happens (and is also the reason hashing happens).

Yes, sorry, my wording was not the best. What I meant is that the server will be able to reverse the hashes (and hence read the labels) easily in a very large number of cases, if it wants to. I wouldn't had considered this to be the case for E2EE rooms intuitively, had I not seen this MSC.

That is, unless people start naming their threads with peppers embedded in them. ;)

(also, sorry, I didn't understand Travis' comment; is there a Github threading feature I'm not aware of or was that tongue-in-cheek?)

@turt2live
Copy link
Member

Line comments on the diff rather than using the comment box here are threads. Trying to track down what people are talking about here is way too difficult, which is why we require people to use threads.


Therefore, this is a simpler proposal to allow messages in a room to be filtered
based on a given label in order to give basic one-layer-deep threading
functionality.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still, the 'reply to an existing message' is an important use case.

A message cannot be replied to unless it is already labelled. Regular users cannot add labels to messages they did not author. Should a (unique) label to otherwise unlabelled messages be required?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't that like replying to the message based on message ID?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#1849 might be better suited for this kind of thing since a client could directly request the message that is referenced... Not sure though.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ptman I was thinking subsequent messages could reuse the existing label. Basically avoid nested threads.

AFAICT #1849 proposes a strictly one-way relation (i.e. child points to parent). Wouldn't that make it expensive to list the thread starting from a particular message? Each subsequent message would have to be determined by reverse look-up.

Copy link
Contributor

@ptman ptman May 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally prefer hierarchical threads, but either way, regardless of the how the relationships are recorded, they can be shown flat, just like apple mail and gmail do.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, but I was under the impression that this proposal specifically aims for flat threads for the sake of simplicity.

This is done by new `labels` and `not_labels` fields to the `EventFilter`
object, which specifies a list of labels to include or exclude in the given
filter.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be something for another MSC down the road, but would it make sense to have power level restrictions for labels? For example, I may want to have an announcements label in my room that only bots and I can post in. For example, a GitHub status label could be useful to filter out announcement spam from the main room conversation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an interesting thought, but is probably out of the scope of this MSC imho


When a user wants to filter a room to given label(s), it defines a filter for
use with `/sync`, `/context`, `/search` or `/messages` to limit appropriately.
This is done by new `labels` and `not_labels` fields to the `EventFilter`
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, this doesn't seem to make it possible to expose to the user which labels are present in a room (since the client would have to scan all of the events in a room for labels). IMO, it would be poor UX to not have available labels visible. I suppose there could be two ways of doing this:

  1. Have the server scan each event for labels and keep track of them
  2. Add a state event (ex, m.labels) where the state key is set to a label and the content could contain attributes like description or label color (if desired for UX). IMO this would be the cleanest option.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a proposal about an endpoint to get the list of labels available in a given room here: #2326 (comment), but haven't gotten around to add it to the proposal yet.

maintaining one rainbow table for each encrypted room it's in, which would
probably make it not worth the trouble.

## Alternative solutions
Copy link
Member Author

@ara4n ara4n Sep 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was just rereading this proposal, I think @dkasak's points on the main thread are legitimate: that hashing the labels give a very false sense of security here. Given how strong our e2ee is, folks will assume opaque labels are actually encrypted, rather than just obfuscated by a hash which can be easily rainbow-tabled.

Personally, I think it'd be fine to add a pepper to the hashed events, and require at the application level that for labels to work in encrypted rooms, the new user must be brought up to speed on the pepper (e.g. by the inviter sharing the pepper in an encrypted message, possibly to-device, after having invited them).

This is simpler than using opaque IDs for the unencrypted event headers, as there's only one pepper that needs to be shared to new users, rather than the whole set of opaque->real label mappings.

@ara4n
Copy link
Member Author

ara4n commented Sep 27, 2020

@mscbot concern hashing the labels in encrypted rooms provides a very false sense of security. i think we should add a pepper to the hash, which can be sent to new users in an encrypted msg when they're invited to the room (if the new users need to know about the existing labels).

@mscbot mscbot added the unresolved-concerns This proposal has at least one outstanding concern label Mar 9, 2021
@richvdh
Copy link
Member

richvdh commented Apr 6, 2021

This seems to have lots of outstanding concerns, and is generally looking a bit unloved, so I don't think it's ready for FCP generally.

@mscbot fcp cancel

@mscbot mscbot added proposal-in-review and removed disposition-merge proposed-final-comment-period Currently awaiting signoff of a majority of team members in order to enter the final comment period. labels Apr 6, 2021
@turt2live turt2live added the needs-implementation This MSC does not have a qualifying implementation for the SCT to review. The MSC cannot enter FCP. label Jun 8, 2021
@erlend-sh erlend-sh mentioned this pull request Aug 9, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind:core MSC which is critical to the protocol's success needs-implementation This MSC does not have a qualifying implementation for the SCT to review. The MSC cannot enter FCP. proposal A matrix spec change proposal unresolved-concerns This proposal has at least one outstanding concern
Projects
None yet
Development

Successfully merging this pull request may close these issues.