Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

binance: Retry keep alive. #2958

Merged
merged 5 commits into from
Oct 10, 2024
Merged

Conversation

JoeGruffins
Copy link
Member

@JoeGruffins JoeGruffins commented Sep 6, 2024

Untested. Attempting to solve a problem with stuck books.

@JoeGruffins JoeGruffins force-pushed the binancewatchlistenkey branch 3 times, most recently from f436428 to 7b64763 Compare September 6, 2024 08:11
Copy link
Member

@buck54321 buck54321 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's do something like this so that we can test.

Also, please look into the LIST_SUBSCRIPTIONS websocket message to see if we can leverage that in a periodic check to catch discrepancies.

@JoeGruffins
Copy link
Member Author

Looking into querying the subscriptions.

@JoeGruffins
Copy link
Member Author

Sorry I have not tested this on testnet/mainnet yet, but I should be able to use on testnet correct?

@buck54321
Copy link
Member

Sorry I have not tested this on testnet/mainnet yet, but I should be able to use on testnet correct?

Maybe? I haven't found testing mm on testnet very useful.

Copy link
Contributor

@martonp martonp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a better technique would be if an orderbook has not received an update within a certain period, maybe 30 seconds, then resync the orderbook. Currently we will only know that an orderbook is stale every time checkSubs is run, so it could remain stale for 30 mins.

@buck54321
Copy link
Member

Maybe a better technique would be if an orderbook has not received an update within a certain period, maybe 30 seconds, then resync the orderbook. Currently we will only know that an orderbook is stale every time checkSubs is run, so it could remain stale for 30 mins.

This would result in tons of unnecessary reloads on low-volume markets. checkSubs interval can be much shorter though.

Copy link
Contributor

@martonp martonp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been using this diff to test on mainnet. I've turned off my internet connection for 20 mins, then turned it back on, and the orderbooks immediately resync without hitting the new code here. The code looks good otherwise and works fine with the simnet test. Would be nice to be able to reproduce the issue though.

client/mm/libxc/binance.go Outdated Show resolved Hide resolved
client/mm/libxc/binance.go Outdated Show resolved Hide resolved
@JoeGruffins JoeGruffins force-pushed the binancewatchlistenkey branch 3 times, most recently from 56ed9ba to e3cf2a8 Compare September 23, 2024 06:54
@JoeGruffins
Copy link
Member Author

Copy link
Contributor

@martonp martonp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I disconnect for > 3 minutes, then reconnect, there seems to be a deadlock when calling VWAP. I'm looking into it.

@@ -288,6 +291,11 @@ func (b *binanceOrderBook) Connect(ctx context.Context) (*sync.WaitGroup, error
if retry != nil { // don't hammer
continue
}
case <-b.disconnectedChan:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like if for some reason the Websocket has not reconnected, but the snapshot request successfully completes, the orderbook will appear synced, but no messages will be coming through.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the order book stuff doesn't happen over websocket? It's a separate secure http request? So we should stop the books from looking synced until the websocket looks up again?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about the last change https://github.com/decred/dcrdex/compare/7833026d42ae23acc577fd07548c97c2158e580f..adef014da6fd9c56e84197ce13a1a5e3a24d0147

So, not allowing sync again until the websocket tells us it is up. If the connect messages came out of order or we missed one that would be bad though, unsure if that can happen.

@JoeGruffins JoeGruffins force-pushed the binancewatchlistenkey branch 2 times, most recently from 7833026 to adef014 Compare September 25, 2024 08:05
@JoeGruffins
Copy link
Member Author

I don't think this test fail is due to my changes....

// will not place new orders.
connected := cs == comms.Connected
bnc.booksMtx.RLock()
defer bnc.booksMtx.RLock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
defer bnc.booksMtx.RLock()
defer bnc.booksMtx.RUnlock()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whoops

Comment on lines 1780 to 1791
select {
case reconnectC <- struct{}{}:
default:
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reason for this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure. I think this is a batch of changes I got from @buck54321

A natural reconnect shouldn't need a new connection I guess.

@martonp
Copy link
Contributor

martonp commented Sep 25, 2024

There were a few issues I fixed.. consider these changes: 68008b3

You can test using the TestVWAP in the diff I posted above: #2958 (review)

I think another good change would be to trigger a reconnect if the list subscriptions request fails, because this probably means that the connection is broken.

@JoeGruffins JoeGruffins force-pushed the binancewatchlistenkey branch 2 times, most recently from d2dc7fb to b97d591 Compare September 26, 2024 06:17
Copy link
Member Author

@JoeGruffins JoeGruffins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@martonp I added your changes.

The changes in wsconn.go, why not just add the reconnect timer to keepAlive? As it is now we have to wait for a read or a write for the scheduled reconnect to happen. So, it will happen after we want it to, or not at all if we never have another read or write.

PingWait: time.Minute * 4,
EchoPingData: true,
ReconnectSync: func() {
bnc.log.Debugf("Binance reconnected")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK not to trigger a full reconnect here, but maybe a call to checkSubs would be prudent.

Comment on lines 113 to 115
type WsCfg struct {
// URL is the websocket endpoint URL.
URL string

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this need to be a separate argument to the constructor?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to URL to only be stored in one place, be able to be updated, and I didn't want to add a mutex for the config or a field in the config. If we don't update the URL, when there is a read error, and it reconnects, it will only reconnect to the first market that was subscribed to.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to URL to only be stored in one place

I don't really see this as a compelling reason to change the function signature for every consumer. You can still have an unexported field that you update. When we have structs for config settings, we generally try to keep everything contained to the struct.

@martonp
Copy link
Contributor

martonp commented Sep 28, 2024

I made some updated to that commit: a15e541

The changes in wsconn.go, why not just add the reconnect timer to keepAlive? As it is now we have to wait for a read or a write for the scheduled reconnect to happen. So, it will happen after we want it to, or not at all if we never have another read or write.

You're right, but I don't think the timer should be in keepAlive, because we need the read loop to end. I've updated it so that we don't have to wait for a read.

@JoeGruffins
Copy link
Member Author

JoeGruffins commented Sep 30, 2024

Thanks those changes look good to me. Added them. They replaced the last commit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're making a lot of changes to shoehorn the auto-reconnect into the read/readRaw loops. Can you tell me what functionally is the difference between the proposed changes for this file (+112/-63) and this alternative take (+51/-6)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that change, won't there be an additional read loop running after each reconnect? When we reconnect from handleReadError, the read/readRaw loops will return, but that won't happen during a reconnect. It would work if there was a new context created each time read/readRaw is called, and it is cancelled before sending a struct to reconnectChan.

Copy link
Member

@buck54321 buck54321 Oct 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah. Instead of using conn.reconnectCh <- struct{}{}, we could

conn.wsMtx.Lock()
if conn.ws != nil {
	conn.ws.Close()
}
conn.wsMtx.Unlock()

I think, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sending to reconnectCh already calls connect which closes the old connection so I guess the read loop would error out already.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This latest change doesn't work.. after the first AutoReconnect it just starts reconnecting over and over every second.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about this one: martonp@bd18aaa

There is a generic read function to avoid duplication.

With this commit you can use TestVWAP to test: martonp@f6029d7

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This latest change doesn't work.. after the first AutoReconnect it just starts reconnecting over and over every second.

Oh yeah it does, hmm

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added marton changes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was nothing wrong with the read loops and we don't need to spawn a new goroutine for every message. Why are we making that change?

Copy link
Contributor

@martonp martonp Oct 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason for the read loop change is to be able to end the read loop whenever we do an auto reconnect. The reason for the new goroutine is to be able to end the read loop without having to wait for the ReadMessage or ReadJSON call to return.

If we just send a message on reconnectCh, the old read loop will still be running, and it will attempt to reconnect again whenever it encounters an error. We could solve this by creating new contexts for each call to read, but I think it's the most simple to only have one read loop running at one time, and only send a message on reconnectCh whenever the read loop is being ended either due to an error or an auto reconnect.

@JoeGruffins JoeGruffins force-pushed the binancewatchlistenkey branch 3 times, most recently from cd816d9 to d089412 Compare October 10, 2024 05:58
@JoeGruffins
Copy link
Member Author

testbinance panic:

2024-10-10 17:10:45.032 [ERR] TB: Error getting deposit confirmations for ETH -> 0x41e71205021391160dc30bd5840f8a93e9d2bdcaf2fbe5c60f50c75d2fc7cc7a: TransactionReceipt error: not found
2024-10-10 17:25:17.298 [ERR] TB: Error getting deposit confirmations for BTC -> f7e8cb5a3ba91149970f97d3d8c9b9b645442514849a55a139d01b3d8495f129: gettransaction error with output = "error code: -5\nerror message:\nInvalid or non-wallet transaction id\n", err = exit status 5
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0x8016f8]

goroutine 70 [running]:
main.(*fakeBinance).updateOrderBalances(0xc0001243c0, {0xc000144df6?, 0xd659c0?}, 0x0, 0x3fb999999999999a, 0x3fa43731c574e9b7, 0x2)
        /home/joe/git/dcrdex/client/cmd/testbinance/main.go:968 +0xf8
main.(*fakeBinance).run.func2()
        /home/joe/git/dcrdex/client/cmd/testbinance/main.go:408 +0x4b3
created by main.(*fakeBinance).run in goroutine 1
        /home/joe/git/dcrdex/client/cmd/testbinance/main.go:369 +0xd1

@JoeGruffins
Copy link
Member Author

Removed the wsconn changes for now. They should maybe be a separate pr. Attempting to fix the other issues pointed out by @martonp https://github.com/decred/dcrdex/compare/d089412d7cd1f74acae443b4728d581dad35efb3..1ea349eed8446916cd75718efdf265400db9925d

@buck54321 buck54321 merged commit 5f4a258 into decred:master Oct 10, 2024
5 checks passed
buck54321 added a commit to buck54321/dcrdex that referenced this pull request Oct 17, 2024
* handle ws reconnect signal

* binance: Retry keep alive.

* testbinance: Add flappy websocket.

* binance: Check market depth subs.

* binance: Desync books on disconnect.

---------

Co-authored-by: Brian Stafford <buck54321@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants