-
Notifications
You must be signed in to change notification settings - Fork 11.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[TransactionManager] refactor and fix memory usage issues #10829
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎ |
6c1555c
to
3882662
Compare
return ready_certificates; | ||
} | ||
|
||
let input_count = self.input_objects.get_mut(&input_key.0).unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to confirm its safe to unwrap here because callers of the method must ensure input key is available in store and storage here refers to self.input_objects
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, because we know there are non-zero # of transactions that are waiting on the input object. Adding a panic message.
return ready_certificates; | ||
}; | ||
|
||
// Waiters can acquire lock in eitehr readonly or default mode. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/eitehr/either
} | ||
|
||
/// After reaching 3/4 load in hashmaps, increase capacity to decrease load to 1/2. | ||
fn maybe_reserve_capacity(&mut self) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need a MAX_HASHMAP_CAPACITY to keep this bounded? Or is it not an issue because we have MAX_PER_OBJECT_QUEUE_LENGTH & MAX_TM_QUEUE_LENGTH for protection
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about this, and decided to rely on the existing limits for protection, since the actual usage at most doubles these values, which should still be tolerable in a spike.
// TransactionManager can receive the notifications for objects that it did not find | ||
// in the objects table. | ||
// | ||
// REQUIRED: this must be called before tx_guard.commit_tx() (below), to ensure |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ooc why is this no longer true?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tx_guard.commit_tx() is currently a no-op FYI
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, and the existence of executed effects are checked when transaction manager recovers from pending_execution
table now, so even if a txn did not call tx_guard.commit_tx() due to a crash gets retried, transaction manager would handle the case correctly by avoiding re-execution and removing the transaction from the pending_execution
table.
self.executing_certificates.shrink_to(max( | ||
self.executing_certificates.capacity() / 2, | ||
MIN_HASHMAP_CAPACITY, | ||
)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain more about what memory problem this is addressing? It seems to me like the main problem would be peak usage, which this doesn't seem to help with.
Also, this code is quite repetive and could be factored with a generic function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This tries to reduce memory usage after a spike of pending transactions, which can be much larger than the usual number of pending transactions. Will refactor this.
@@ -103,15 +103,10 @@ pub async fn execution_process( | |||
break; | |||
} | |||
} | |||
|
|||
// Remove the certificate that finished execution from the pending_certificates table. | |||
authority.certificate_executed(&digest, &epoch_store); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why was this removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I combined certificate_executed()
and objects_available()
into a single notify_commit()
call from authority state, to simplify reasoning, so only 1 notification is needed after a transaction commits instead of 2. Downside is that we can no longer tell if a committed transaction should have been tracked by transaction manager, because the commit notification will be triggered for system transactions too. I kept objects_available()
though to be called from enqueue()
and tests.
## Description 1. Fix a leak where lock_waiters map inserts an entry with empty LockQueue, that may not get removed. 2. Resize HashMaps in TM as load changes. 3. Only notify TM once after a transaction commits, with both tx digest and output keys. 4. Extract common logic to update lock queue and ready transactions after a transaction and its output objects commit. ## Test Plan unit tests --- If your changes are not user-facing and not a breaking change, you can skip the following section. Otherwise, please indicate what changed, and then add to the Release Notes section as highlighted during the release process. ### Type of Change (Check all that apply) - [ ] user-visible impact - [ ] breaking change for a client SDKs - [ ] breaking change for FNs (FN binary must upgrade) - [ ] breaking change for validators or node operators (must upgrade binaries) - [ ] breaking change for on-chain data layout - [ ] necessitate either a data wipe or data migration ### Release notes
Description
Test Plan
unit tests
If your changes are not user-facing and not a breaking change, you can skip the following section. Otherwise, please indicate what changed, and then add to the Release Notes section as highlighted during the release process.
Type of Change (Check all that apply)
Release notes