Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dht fixes #192

Merged
merged 13 commits into from
Oct 23, 2014
Merged

dht fixes #192

merged 13 commits into from
Oct 23, 2014

Conversation

jbenet
Copy link
Member

@jbenet jbenet commented Oct 21, 2014

Fixing dht issues.

@jbenet jbenet added the status/in-progress In progress label Oct 21, 2014
@@ -272,7 +271,7 @@ func (dht *IpfsDHT) getValueOrPeers(ctx context.Context, p peer.Peer,
// Perhaps we were given closer peers
var peers []peer.Peer
for _, pb := range pmes.GetCloserPeers() {
pr, err := dht.addPeer(pb)
pr, err := dht.peerFromInfo(pb)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@whyrusleeping turns out that we don't want ensureConnectedToPeer here, because we won't connect to all of these peers. That's why there was addPeer. But addPeer is now equivalent to peerFromInfo, so i got rid of it as suggested :) 👍

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(You were right though, that we do need to connect to these peers. see note below in query)

@jbenet
Copy link
Member Author

jbenet commented Oct 21, 2014

@whyrusleeping i found the addresses issue. Then Listening, we don't learn the dialer's address. Because this address we're getting is the one they dialed from, not their proper one.

Options:

  • peers send their listen addrs in the version handshake (pre secure channel 👎 )
  • peers send their listen addrs after the secure channel is established. a sort of post handshake. this is probably where all the service negotiation belongs as well.
  • peers run a "node / peer info" service that just returns basic public information about themselves:

peer info:

  • peer.ID
  • peer.PubKey (already should be retrieved in secure handshake, but might be good to have too)
  • peer Listen Address
  • peer API Address (if exposed)

Oh yeah, that reminds me. The peer.Addresses really should be qualified. Something like:

// in peer/

type AddrType string

const (
  APIAddr = "api"
  ListenAddr = "listen" // rename config.Addresses.Swarm -> config.Addresses.Listen ?
  DialAddr = "dial" // may not even need to track these.
)

type Addrs map[string]ma.Multiaddr

func (* peer) Addresses() *Addrs { ... }

Thoughts on both (a) peer info service, and (b) qualified addrs?

@whyrusleeping
Copy link
Member

Huh, so this is one of the same problems weve bee fighting for a while now... Im a fan of the option of just sending the addresses over during the handshake, its one less step and reduces complexity IMO. If you want to do the 'peer info' service though, i feel we should mold that into diagnostics, send a diagnostics message to the peer and ensure its not recursive, although, some people might not be okay with serving diagnostics information.. so that becomes weird too...

As for qualified addresses, im not so sure we need them... i dont beleive we should care about other peers API ports, that just seems weird to me, and the address they dialed from shouldnt really matter because it means so many different things depending on the context (udp vs tcp vs anything else)

whyrusleeping and others added 6 commits October 22, 2014 03:24
The query-- once it's actually attempting to connect to a peer--
should be the one connecting.
dht doesn't need the whole network interface, only needs a Dialer.
(much reduced surface of possible errors)
@jbenet
Copy link
Member Author

jbenet commented Oct 22, 2014

@whyrusleeping i think i fixed it. i no longer get "no addresses on peer being sent!".

Though this dhtHell test takes forever and lags my computer, invariably making me exit it.

100
[4-99]->[0-3]
--
17 store key value
17 provide key
80 store key2 value2
80 provide key
[11-99] get key
[11-79] get key2
==

I do see some:

2014-10-22 05:37:39.541895 ERROR service service.go:236:    no request key [18 32 230 111 131 198 39 197 4 114 214 189 191 65 198 187 120 133 43 68 22 76 110 13 25 250 59 47 148 200 93 68 107 252 94 109 136 164] (timeout?)

@jbenet
Copy link
Member Author

jbenet commented Oct 22, 2014

@whyrusleeping quick CR + test it out pls

@whyrusleeping
Copy link
Member

The no request key happens when a node replies to a request that we dont care about anymore, seems pretty normal to me, especially if we ask two people for a value, and one returns first. We might want to handle that a little better, just to avoid logging confusion, but its not harming us any.

I ran a bunch of large concurrent "get" tests in dhthell with expect, and they all passed. Gonna try for ~500 nodes later when i get back to my desktop.

What specs does the machine youre running this on have? (the 100 node test you said was slow)

@whyrusleeping
Copy link
Member

(the reason i ask about specs, is because youre probably running out of RAM and swapping hard, thus causing the slowness)

@whyrusleeping
Copy link
Member

Creating node with id:  'QmUJwAW9DxxfSq9Bj1yJV29LCmiGLoJBzbqiwx9j8sepCQ'
Creating node with id: 'QmWzWuqmja92BXRHsUk2WDiQSzGcQqD3Z3HFTYgpNsQgh5'
Creating node with id: 'QmRhifxubZsjoxmfAqR4wDKQYdkYgqAWgfaBG5tNeFbER8'
runtime: memory allocated by OS (0xa8fde000) not in usable range [0x10800000,0x90800000)
runtime: out of memory: cannot allocate 1048576-byte block (1879048192 in use)
runtime: memory allocated by OS (0xa8fde000) not in usable range [0x10800000,0x90800000)
runtime: out of memory: cannot allocate 1048576-byte block (1879048192 in use)
runtime: memory allocated by OS (0xa8fde000) not in usable range [0x10800000,0x90800000)
runtime: out of memory: cannot allocate 1048576-byte block (1879048192 in use)
fatal error: out of memory

goroutine 16114 [running]:
runtime.throw(0x5d3bed)
    /home/whyrusleeping/go/src/pkg/runtime/panic.c:520 +0x5c fp=0xa92e9e64 sp=0xa92e9e58
largealloc(0x1, 0xa92e9ec4)
    /home/whyrusleeping/go/src/pkg/runtime/malloc.goc:226 +0xb0 fp=0xa92e9e88 sp=0xa92e9e64
runtime.mallocgc(0x100000, 0x3300f1, 0x1)
    /home/whyrusleeping/go/src/pkg/runtime/malloc.goc:169 +0x94 fp=0xa92e9ec0 sp=0xa92e9e88
cnew(0x3300f0, 0x100000, 0x1)
    /home/whyrusleeping/go/src/pkg/runtime/malloc.goc:836 +0xc8 fp=0xa92e9ed0 sp=0xa92e9ec0
runtime.cnewarray(0x3300f0, 0x100000)
    /home/whyrusleeping/go/src/pkg/runtime/malloc.goc:849 +0x3c fp=0xa92e9ee0 sp=0xa92e9ed0
makeslice1(0x327230, 0x100000, 0x100000, 0xa92e9f24)
    /home/whyrusleeping/go/src/pkg/runtime/slice.goc:55 +0x50 fp=0xa92e9eec sp=0xa92e9ee0
runtime.makeslice(0x327230, 0x100000, 0x0, 0x100000, 0x0, 0x0, 0x100000, 0x100000)
    /home/whyrusleeping/go/src/pkg/runtime/slice.goc:36 +0x1c8 fp=0xa92e9f0c sp=0xa92e9eec
github.com/jbenet/go-ipfs/Godeps/_workspace/src/github.com/jbenet/go-msgio.(*Chan).ReadFrom(0x181c0e60, 0xb6d4b460, 0x181c0e20, 0x100000)
    /home/whyrusleeping/gopkg/src/github.com/jbenet/go-ipfs/Godeps/_workspace/src/github.com/jbenet/go-msgio/chan.go:28 +0x154 fp=0xa92e9f9c sp=0xa92e9f0c
github.com/jbenet/go-ipfs/net/conn.func·002()
    /home/whyrusleeping/gopkg/src/github.com/jbenet/go-ipfs/net/conn/conn.go:76 +0xd8 fp=0xa92e9fcc sp=0xa92e9f9c
runtime.goexit()
    /home/whyrusleeping/go/src/pkg/runtime/proc.c:1445 fp=0xa92e9fcc sp=0xa92e9fcc
created by github.com/jbenet/go-ipfs/net/conn.newSingleConn
        /home/whyrusleeping/gopkg/src/github.com/jbenet/go-ipfs/net/conn/conn.go:78 +0x2d0`

FWIW, my chromebook cant run 100 nodes either.

}

// ErrVersionMismatch is returned when two clients don't share a protocol version
var ErrVersionMismatch = errors.New("protocol missmatch")

// Compatible checks wether two versions are compatible
// Handshake1Compatible checks wether two versions are compatible
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: whether

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed in 88cc1e2

@whyrusleeping
Copy link
Member

A couple comments, and i think we should have a README.md in the handshake package to clearly document the protocol. Just to make things even clearer and easier for anyone wanting to audit it,

@jbenet
Copy link
Member Author

jbenet commented Oct 22, 2014

Yeah probably memory. what's the issue though? are we leaking all the things? we should be able to run 100 nodes no problem just sending messages. msgs should be garbage collected, etc.

@jbenet
Copy link
Member Author

jbenet commented Oct 22, 2014

A couple comments, and i think we should have a README.md in the handshake package to clearly document the protocol. Just to make things even clearer and easier for anyone wanting to audit it,

+:100:

@jbenet
Copy link
Member Author

jbenet commented Oct 22, 2014

@whyrusleeping wat: https://travis-ci.org/jbenet/go-ipfs/jobs/38748767 never seen that before

@whyrusleeping
Copy link
Member

yeah, thats a bug i need to fix, see #141
I left it out because it was complicated code, and was (supposed to be) very rare. but now that ive seen it two times in the wild, ill have to go fix it.

@whyrusleeping
Copy link
Member

essentially, whats happenening is this:
We fill up one bucket of the routing table, so we go to split it, when the split happens, all the peers in that bucket move to the new bucket, thus putting us in a situation where we need to split again, this can happen up to (KEYLEN - 1) times in a row... So i need a loop, or recursion...

jbenet added a commit that referenced this pull request Oct 23, 2014
@jbenet jbenet merged commit bd5a1c0 into master Oct 23, 2014
@jbenet jbenet removed the status/in-progress In progress label Oct 23, 2014
@jbenet jbenet deleted the dhtfixes branch October 23, 2014 07:15
@aschmahmann aschmahmann mentioned this pull request Aug 23, 2021
62 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants