Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Peer Discovery DEP #7

Open
wants to merge 7 commits into
base: master
Choose a base branch
from
Open
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
162 changes: 162 additions & 0 deletions proposals/0000-peer-discovery.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,162 @@

Title: **DEP-0000: Peer Discovery**

Short Name: `0000-peer-discovery`

Type: Informative

Status: Undefined (as of 2018-02-06)

Github PR: (add HTTPS link here after PR is opened)

Authors: [Paul Frazee](https://github.com/pfrazee)


# Summary
[summary]: #summary

An important aspect of Dat's networking is peer discovery, the techniques that peers use to find each other. Peer discovery means finding the IP and port of data sources online that have a copy of that data you are looking for. You can then connect to them and begin exchanging data. By using peer discovery techniques Dat is able to create a network where data can be discovered even if the original data source disappears.

Peer discovery can happen over many kinds of networks, as long as you can model the following actions:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how I feel about encoding these function signatures here. I did something similar in the hyperdb DEP, but here it feels really specific to the existing dat implementation. Maybe just describe what semantics are necessary for lookup, and what additional semantics are necessary to announce/cancel membership or subscribe to a feed of peer updates?


- `join(key, [port])` - Begin performing regular lookups on an interval for `key`. Specify `port` if you want to announce that you share `key` as well.
- `leave(key, [port])` - Stop looking for `key`. Specify `port` to stop announcing that you share `key` as well.
- `foundpeer(key, ip, port)` - Called when a peer is found by a lookup.

In the Dat implementation we implement the above actions on top of three types of discovery networks:

- Multicast DNS - Useful for discovering peers on local networks
- DNS name servers - An Internet standard mechanism for resolving keys to addresses
- Kademlia Mainline Distributed Hash Table - Less central points of failure, increases probability of Dat working even if DNS servers are unreachable

Additional discovery networks can be implemented as needed. We chose the above three as a starting point to have a complementary mix of strategies to increase the probability of source discovery.


# Peer discovery methods
[peer-discovery-methods]: #peer-discovery-methods

Dat uses multiple discovery networks, to provide redundancy and to suit differing network needs. There is no restriction on which discovery solutions are allowed, but at time of writing there are three in active use.


## Multicast DNS
[multicast-dns]: #multicast-dns

Multicast DNS (mDNS) resolves host names to IP addresses within small networks without a local name server. It is a zero-configuration service, using essentially the same interfaces, packet formats and operating semantics as unicast DNS. The mDNS protocol is published as [RFC 6762](https://tools.ietf.org/html/rfc6762) and is built on multicast UDP.

Dat treats Hypercore public keys as domain names on the mDNS protocol. Therefore, peer discovery is an IP lookup for a given public key name. Currently the public key is encoded to hex and truncated to 40 bytes. The domain name format used is:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not the public key, but the hypercore discovery key. This is an important detail. We never expose the public key over the network, as it is used as a capability

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's truncated because dns subdomains can at most be 63 chars also

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure we use discovery keys (not public keys) for all discovery mechanisms.

The discovery key is a BLAKE2b "keyed hash" of the string "hypercore" using the public key (32 bytes), described in the wire protocol DEP (WIP).


```
{PUBKEY}.dat.local
```

Dat uses the `TXT` record type. A query is submitted as a simple `TXT` query for `{PUBKEY}.dat.local`. The response provides a peer-listing which will only include the local node, if it is actively hosting the requested Hypercore.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

{DISCOVERY_KEY}.dat.local



### TXT data encoding
[dns-txt-data-encoding]: #dns-txt-data-encoding

TXT record data is encoded as key/values using [RFC 6763](https://tools.ietf.org/html/rfc6763#section-6) DNS-SD encoding.

Peer listings are a base64-encoded buffer of 6-byte peer items. Each peer item is packed as follows:

```
{4 bytes: IPv4 address}{2 bytes: port}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's big endian port

```


## DNS name servers
[dns-name-servers]: #dns-name-servers

While mDNS is effective for tracking and discovering peers on the Local Area Network, it does not work for the global Internet. For that, Dat's solution is to use DNS name servers with custom behaviors. These servers are maintained by the Dat protocol Working Group members, but may be reconfigured to use other servers.

The DNS protocol queries serve to lookup peers, announce swarm membership, and subscribe to push-updates. To interact with a DNS name server, a client must first "probe" the server for a session token. This is described in the "Session token exchange" section.

Much of the details of the DNS discovery is shared with mDNS, including the TXT data encoding (see above). At time of writing, the DNS and mDNS discovery tools are implemented in one codebase in the active Dat implementation.


### Session token exchange
[dns-name-server-session-token-exchange]: #dns-name-server-session-token-exchange

DNS is built on UDP, a sessionless connection protocol. Because the DNS peer discovery protocol involves registration for future messages, it's important that the DNS server verifies the IP of a registrar. Otherwise, a malicious peer could spoof its IP in order to register other devices for receiving messages, leading to potential DoS attacks.

To verify the addresses of clients, the DNS discovery protocol uses a session token exchange. All clients must first request a token before sending protocol messages. The server will generate the token using the following algorithm:

```
sha256(secret + client-address)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

personal note that we should use blake2b here instead

```

The secret should be generated from random.

This token must be included in queries which include mutation fields in the "additional" section. (Simple lookups do not require the token.) By requiring the token, we prove that the sender's IP is not spoofed, as it *must* provide a valid address in order to receive the token during the session token exchange.

The token is requested by sending a `TXT` record to the DNS server with a target name of `"dat.local"`. The server will respond with the token, plus the port and address of the sending device (which are useful as a "whoami").

Over time, the server will rotate the secret it uses to generate tokens. In order to update clients' tokens, every response includes the latest token. The client should update its token with every response it receives. (It's advised that the server keeps the most recently expired secre so that old tokens can be accepted and replaced smoothly.)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

secre -> secret



### Lookup query
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SRV requests are also possible. Eg, on the command line:

dig @discovery1.publicbits.org 905fd1b6504698425e8bec3dbb77d757e281d505.dat.local SRV

returns something like:

0 0 44113 172.19.0.4.

Which, IIRC, is port 44113 on host 172.19.0.4 (note the trailing period, which is not a typo).

[dns-name-server-lookup-query]: #dns-name-server-lookup-query

To request the current list of known peers for a pubkey, send a `TXT` question query with `{PUBKEY}.dat.local` as the name. Currently the public key is encoded to hex and truncated to 40 bytes. You will receive a response that includes a full peer listing and the latest token. See "TXT data encoding" above for information about encoding.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

{DISCOVERY_KEY}.dat.local

Copy link

@tristanls tristanls Feb 10, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You will receive a response that includes a full peer listing

If I go through a valid "probe" step, acquire a session token, and then announce multiple ports, that would seem to increase the full peer listing arbitrarily.

Since simple lookups do not require a token, then it should be possible for me to spoof the IP address in simple lookups and use (previously constructed) arbitrarily large full peer listing to execute DoS on a target.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good find, this is a weakness cc @mafintosh


Every query may include a `TXT` "additional" section which includes the session token and any behavior fields (described below).


#### Subscribe flag
[dns-name-server-subscribe-flag]: #dns-name-server-subscribe-flag

The `subscribe` flag instructs the DNS name server to add the device to the list of active listeners for the given Hypercore. Any time a new peer is announced, the server will "push" a notification to the device.

The push is sent as the "additional" section of an `SRV` query. It contains as its data the `target` (address) and `port` of the new peer.

If a `TXT` lookup query is sent with an "additional" section that does not have the `subscribe` flag, that is treated as an "unsubscribe" message and the device is removed from the active listeners.

TODO- what's the TTL?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question, I think its 1-2min



#### Announce field
[dns-name-server-announce-field]: #dns-name-server-announce-field

The `announce` field instructs the DNS name server to add the device to the list of active hosting peers for the given Hypercore. Its value should be the port from which the device is listening. Multiple ports may be announced using separate queries. Upon announce, the new peer is pushed to any subscribed devices using an `SRV` query.

TODO- how long till announce records expire? Should the client reannounce periodically?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes they should. They are GC'ed every 5-10 mins

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think 10 minutes, and clients should re-announce, but I don't have a reference.



#### Unannounce field
[dns-name-server-unannounce-field]: #dns-name-server-unannounce-field

The `unannounce` field instructs the DNS name server to remove the device from the list of active hosting peers. Its value should be the port from which the device was previously listening.


## Kademlia Mainline DHT
[kademlia-mainline-dht]: #kademlia-mainline-dht

Mainline DHT is the name given to the Kademlia-based Distributed Hash Table (DHT) used by BitTorrent clients to find peers. Dat has adopted it temporarily to track peers in its own network. You can find the specification at [BEP 0005](http://www.bittorrent.org/beps/bep_0005.html).

There are some issues with Dat's use of Mainline which limit the usefulness of its function. BitTorrent uses a 20 byte sha1 hash to identify torrents, while Dat uses a 32 byte public key to identify Hypercore registers. As a result, Dat has to truncate its keys to the first 20 bytes, leading to false positives when connecting to peers.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

discovery key



# Privacy concerns
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are two other privacy concerns off the top of my head:

  • the concern of being able to discovery (and potentially download) all content "discovered" via these mechanisms. This is mitigated by using discovery keys (instead of public keys) for download
  • the ability to discover who has what content on the network (if you know a priori what content is associated with which discovery keys). Eg, imagine somebody sharing leaked documents; if the documents (dat archive, and thus discovery key) become public, somebody can make a list of all peers who have exchanged (or "knew of") that archive. I don't know any mitigation for this right now. This is also a concern with the wire protocol; in that case it could be mitigated by encrypting the entire transaction (including the discovery key verification), but not with the current encryption scheme.

[privacy-concerns]: #privacy-concerns

Peer discovery networks reveal the participants in a Dat swarm to any device which can access the network. This presents a privacy risk for users who may not want to have their activity broadcasted.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

main thing leaked is who is talking to who (which is of course important). we never leak the capability (public key) so passive listeners cannot access data / decrypt data - they can also see Alice and Bob are talking to each other probably

Copy link

@anarcat anarcat Aug 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's important to indicate who that information is leaked to. elsewhere in the documentation (e.g. in the security FAQ) we are lead to believe that information is leaked only to the members of the swarm, which is not really accurate. sure, the contents are visible only to the members of the swarm, but metadata like public (and private?) IP addresses and relationships between people are spread out much more widely that I first believed when reviewing the protocol.

in particular, if i understand this DEP correctly, it implies that discoverN.datproject.org know precisely:

  1. when a peer comes online (when Alice runs dat share)
  2. when a peer looks for content (when Bob runs dat clone $ALICEHASH)
  3. that Alice and Bob are related

This raises all sorts of privacy concerns which should be answered by the dat project. For example:

  • does the discovery server keep logs?
  • what is the retention policy?
  • who has access to those logs?

I think the current section about Privacy concerns is great, but should be expanded to cover for this peculiar property of the protocol. The security FAQ should also be updated to mention this, but that's a separate issue: I've documented my concerns with that in dat-ecosystem-archive/docs#127


There are many solutions to explore to this issue:

- Private discovery networks. This will reduce the number of possible data sources, which reduces the success rate of discovery, but also limits the exposure of the user's activity.
- Proxy services. This will increase the latency of traffic and will expose all activity to the proxy, but it will mask the user's activity among the activity of all proxy users.


# Unresolved questions
[unresolved]: #unresolved-questions

- Does the DNS network *need* to truncate the public key to 40 bytes? Could we fit the full 64 bytes by using another level of subdomain?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea good idea. 32-chars.32-chars or use an encoding other than hex that is still dns friendly. Thoughts?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could hash the discovery key to something like 63 bytes?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Smart

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we hash to 63 bytes (not change encoding) we're basically just losing a byte of specificity. Why not just do 32.32 and stick with hex? We could also switch to base32.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

32.32 seems reasonable.



# Changelog
[changelog]: #changelog

A brief statemnt about current status can go here, follow by a list of dates
when the status line of this DEP changed (in most-recent-last order).

- YYYY-MM-DD: First complete draft submitted for review