Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak on txpool priced urgent list #23690

Closed
ferranbt opened this issue Oct 7, 2021 · 5 comments
Closed

Memory leak on txpool priced urgent list #23690

ferranbt opened this issue Oct 7, 2021 · 5 comments
Assignees
Labels

Comments

@ferranbt
Copy link
Contributor

ferranbt commented Oct 7, 2021

System information

This behaviour was found on Polygon PoS (upstream v1.10.8) but I think it also affects Go-ethereum, London fork is not enabled yet in our chain.

Expected behaviour

Some nodes on the Polygon PoS network have been having OOM crashes lately since the London changes were upstreamed. On pprof we found out that much of the allocated memory was on transactions.

image

We though that the issue was either on:

  • Transaction pool.
  • Broadcast announcer.

This error was not reproducible and it would take at least a day to show up on the metrics. Some nodes even reported more than 100Gb of RAM because of this.

On the txpool we could not find any out of place metrics till we added (#23691) and we found this:

image

At some point, the size of the urgent list in the transaction pool starts to grow non-stop at the same time as the memory, this happens until the node runs out of memory (note that there are millions of transactions in the urgent list!). Besides, in the chart we can see that there are no more "reheap" metrics being emitted, which makes sense since that indicates that the urgent list is being cleaned.

In this node the max txpool size was on the 100k and all the transactions were inserted on the pool with the normal peer to peer gossip protocol.

IMHO, the problem is that under certain unknown circumstances, the txpool gets stuck and keeps adding new transactions into the urgent list without cleaning the old ones, this is, the Reheap method is not called anymore (no more reheap metrics).

However, if the London fork is enabled as in the Ethereum mainnet network, SetBaseFee is called periodically here and the urgent list is cleaned automatically, that is why the Ethereum mainnet has not seen this issue so far. But, if London fork is not enabled as in Polygon PoS, this does not happen. We found a solution here by calling Reheap if London is not enabled and we have not seen the problem so far.

Though I do not think there is a need for immediate action since we found a workaround (we can PR it if you guys want it) and the Ethereum mainnet will not likely find it on production I think it is good to keep in mind that there might be a memory leak on the priced list if not used correctly.

@bassoy
Copy link

bassoy commented Oct 7, 2021

Seems that this issue is related or equal to this one #23195 and specifically #23195 (comment) and #23195 (comment).

This would mean that it is also relevant for the Ethereum mainnet. Would be good to clarify.

@ferranbt
Copy link
Contributor Author

ferranbt commented Oct 7, 2021

Hi! I just found out that this might affect the Ethereum mainnet too. If the node gets stuck syncing in the same block (if there is a bad block for example), the reorg code is never called and neither the Reheap. We are seeing that at least in Polygon PoS after our fix too.

@rjl493456442 rjl493456442 self-assigned this Oct 8, 2021
@bassoy
Copy link

bassoy commented Oct 18, 2021

Is there any update on this issue?

@MariusVanDerWijden MariusVanDerWijden linked a pull request Oct 21, 2021 that will close this issue
@MariusVanDerWijden
Copy link
Member

MariusVanDerWijden commented Oct 21, 2021

It might have already been fixed by this: 067084f
I can not replicate it whatsoever

The adding to urgent and updating stales are in line at all times for me. Even when I modify some code to make the stales go negative for a while, they will eventually catch up and perform the reheap

@holiman
Copy link
Contributor

holiman commented Nov 3, 2021

Since we've failed to repro, and believe it's already fixed, I'm closing this. Feel free to reopen (or open a new one) if the issue persists.

hbandura added a commit to celo-org/celo-blockchain that referenced this issue Jan 25, 2022
This PR fixes a memory leak found while running v1.5.0 nodes before the Espresso fork.

txPricedList keeps both a list of all txs in the pool, and also a heap for fast access to the cheapest txs. These heaps are not always synchronized with the txs list: they increment at the same time as the list, but they do not decrement at the same time. To decrement, the txpool marks how many txs are 'stale', and after a critical point, a Reheap() is called, where the whole memory of the heaps is released and they are recreated.

The issue in the code was that the critical point was never reached, and the heaps grew indefinitely. Strangely, the behavior is different after Expresso HF activation, since then Reheap() is called periodically on changes for the base fee. But before the HF nothing forces this so the heaps grow forever. One of the reasons for the critical point never being reached is the tx pool not respecting the contract of calling Removed every time it should.

The issue seems to be similar to ethereum/go-ethereum#23690, which was workarounded by reheaping after every pool reorg. Here we also add a couple of safeguards based on the size of the tx pool to ensure that, even if the node is stuck (due to a bad block or any other reason), proper reheaps are triggered if necessary.

A follow-up investigation should be made to have an exhaustive list of all the reasons why the de-synchronization occurs. An example of those is the method demoteUnexecutables, where many txs are removed from the pool but not from the pricedlist.
gastonponti pushed a commit to celo-org/celo-blockchain that referenced this issue Jan 26, 2022
This PR fixes a memory leak found while running v1.5.0 nodes before the Espresso fork.

txPricedList keeps both a list of all txs in the pool, and also a heap for fast access to the cheapest txs. These heaps are not always synchronized with the txs list: they increment at the same time as the list, but they do not decrement at the same time. To decrement, the txpool marks how many txs are 'stale', and after a critical point, a Reheap() is called, where the whole memory of the heaps is released and they are recreated.

The issue in the code was that the critical point was never reached, and the heaps grew indefinitely. Strangely, the behavior is different after Expresso HF activation, since then Reheap() is called periodically on changes for the base fee. But before the HF nothing forces this so the heaps grow forever. One of the reasons for the critical point never being reached is the tx pool not respecting the contract of calling Removed every time it should.

The issue seems to be similar to ethereum/go-ethereum#23690, which was workarounded by reheaping after every pool reorg. Here we also add a couple of safeguards based on the size of the tx pool to ensure that, even if the node is stuck (due to a bad block or any other reason), proper reheaps are triggered if necessary.

A follow-up investigation should be made to have an exhaustive list of all the reasons why the de-synchronization occurs. An example of those is the method demoteUnexecutables, where many txs are removed from the pool but not from the pricedlist.
gastonponti pushed a commit to celo-org/celo-blockchain that referenced this issue Jan 26, 2022
This PR fixes a memory leak found while running v1.5.0 nodes before the Espresso fork.

txPricedList keeps both a list of all txs in the pool, and also a heap for fast access to the cheapest txs. These heaps are not always synchronized with the txs list: they increment at the same time as the list, but they do not decrement at the same time. To decrement, the txpool marks how many txs are 'stale', and after a critical point, a Reheap() is called, where the whole memory of the heaps is released and they are recreated.

The issue in the code was that the critical point was never reached, and the heaps grew indefinitely. Strangely, the behavior is different after Expresso HF activation, since then Reheap() is called periodically on changes for the base fee. But before the HF nothing forces this so the heaps grow forever. One of the reasons for the critical point never being reached is the tx pool not respecting the contract of calling Removed every time it should.

The issue seems to be similar to ethereum/go-ethereum#23690, which was workarounded by reheaping after every pool reorg. Here we also add a couple of safeguards based on the size of the tx pool to ensure that, even if the node is stuck (due to a bad block or any other reason), proper reheaps are triggered if necessary.

A follow-up investigation should be made to have an exhaustive list of all the reasons why the de-synchronization occurs. An example of those is the method demoteUnexecutables, where many txs are removed from the pool but not from the pricedlist.
@dindinw dindinw mentioned this issue Jun 16, 2023
37 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
5 participants