Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Periodically resubscribe to chain event logs #922

Merged
merged 4 commits into from
Oct 6, 2022
Merged

Periodically resubscribe to chain event logs #922

merged 4 commits into from
Oct 6, 2022

Conversation

lalexgap
Copy link
Contributor

@lalexgap lalexgap commented Oct 4, 2022

Fixes statechannels/go-nitro-testground#116

It looks like geth has an open issue around SubscribeFilterLogs failing to report events after it's been running for some time. This is similar to the behaviour we see in our long running test failure.

Based on some code snippets on the issue the eth chain service now resubscribes to the event logs every 60 seconds. This avoids the problem by ensuring that we don't have long running subscription. I chose the duration of 60 seconds based on 1) that's what they used and 2) and it seemed reasonable.

Here is a 5 minute testground run with this change that succeeds! Here is a 10 min run🚀

@lalexgap lalexgap changed the title Periodically resubscribe to events Periodically resubscribe to chain event logs Oct 4, 2022
@lalexgap lalexgap requested a review from kerzhner October 4, 2022 20:16
@kerzhner
Copy link
Contributor

kerzhner commented Oct 4, 2022

Do you think our issue and ethereum/go-ethereum#23845 are related? The geth issue description talks about missing events after days of running a script, not minutes...

In general, having to resubscribe to events every minute seems a bit wild. Also, is it possible to miss events while a subscription is inactive (after cancellation but before resubscription)?

client/engine/chainservice/eth_chainservice.go Outdated Show resolved Hide resolved
return nil
}

func (ecs *EthChainService) listenForLogEvents() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am probably missing something, but why is the below logic moved to listenForLogEvents?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think our issue and ethereum/go-ethereum#23845 are related? The geth issue description talks about missing events after days of running a script, not minutes...

Based on our test run behaviour I'm pretty sure that our issue is being caused by the geth issue as:

  • Without this change (with a long running test) we see the withdraw transaction submitted but no chain events in the logs
  • Without this change tests will fail if they run over ~3 minutes

In general, having to resubscribe to events every minute seems a bit wild.

Agreed 😔.

Also, is it possible to miss events while a subscription is inactive (after cancellation but before resubscription)?

That's a good point! I haven't run into that problem when running testground runs. I think we may be ok but I'll investigate.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on our test run behaviour I'm pretty sure that our issue is being caused by the geth issue as:

  • Without this change (with a long running test) we see the withdraw transaction submitted but no chain events in the logs
  • Without this change tests will fail if they run over ~3 minutes

For a test with a long runtime, does the test open/close more on-chain channels than a short running test? Or do longer running tests open/use/close just more virtual channels?

We use a websocket for connecting to an Ethereum JSON rpc endpoint. I wonder if there is some timeout issues with that websocket...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a test with a long runtime, does the test open/close more on-chain channels than a short running test? Or do longer running tests open/use/close just more virtual channels?

Nope, the long running test would create the same amount of ledger channels, so we should have the same amount of on chain channels (every participant creates a ledger channel with every hub).

This comment was marked as off-topic.

Copy link
Contributor Author

@lalexgap lalexgap Oct 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a little spelunking through go-ethereum and found this:

// Config represents the configuration of the filter system.
type Config struct {
	LogCacheSize int           // maximum number of cached blocks (default: 32)
	Timeout      time.Duration // how long filters stay active (default: 5min)
}

5 minutes seems roughly in line with the failures we've seen, so that might explain it.

Edit: I've posted my hunch to the issue

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it makes sense to change our resubscription period to something like 4 minutes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it makes sense to change our resubscription period to something like 4 minutes?

Yup, now that we know it's a 5 minute timeout it makes sense to use a larger value.

@@ -111,25 +116,41 @@ func (ecs *EthChainService) SendTransaction(tx protocols.ChainTransaction) error
}

func (ecs *EthChainService) subcribeToEvents() error {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change

Copy link
Contributor

@kerzhner kerzhner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving this PR, but I am worried about:

  • Missing events between unsubscribe and resubscribe.
  • Duplicate events if resubscribe has a lookback window for events.

@lalexgap
Copy link
Contributor Author

lalexgap commented Oct 6, 2022

Approving this PR, but I am worried about:

  • Missing events between unsubscribe and resubscribe.
  • Duplicate events if resubscribe has a lookback window for events.

As discussed in slack the plan is to merge this in and keep an eye out for any duplicate or missing events

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Long running test seem to fail closing ledger channels
2 participants