Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TransactionManager] refactor and fix memory usage issues #10829

Merged
merged 4 commits into from
Apr 13, 2023
Merged

Conversation

mwtian
Copy link
Contributor

@mwtian mwtian commented Apr 12, 2023

Description

  1. Fix a leak where lock_waiters map inserts an entry with empty LockQueue, that may not get removed.
  2. Resize HashMaps in TM as load changes.
  3. Only notify TM once after a transaction commits, with both tx digest and output keys.
  4. Extract common logic to update lock queue and ready transactions after a transaction and its output objects commit.

Test Plan

unit tests


If your changes are not user-facing and not a breaking change, you can skip the following section. Otherwise, please indicate what changed, and then add to the Release Notes section as highlighted during the release process.

Type of Change (Check all that apply)

  • user-visible impact
  • breaking change for a client SDKs
  • breaking change for FNs (FN binary must upgrade)
  • breaking change for validators or node operators (must upgrade binaries)
  • breaking change for on-chain data layout
  • necessitate either a data wipe or data migration

Release notes

@vercel
Copy link

vercel bot commented Apr 12, 2023

The latest updates on your projects. Learn more about Vercel for Git ↗︎

4 Ignored Deployments
Name Status Preview Comments Updated (UTC)
explorer ⬜️ Ignored (Inspect) Apr 13, 2023 5:01pm
explorer-storybook ⬜️ Ignored (Inspect) Apr 13, 2023 5:01pm
sui-wallet-kit ⬜️ Ignored (Inspect) Apr 13, 2023 5:01pm
wallet-adapter ⬜️ Ignored (Inspect) Apr 13, 2023 5:01pm

@mwtian mwtian force-pushed the txn-mgr-hashmap branch 2 times, most recently from 6c1555c to 3882662 Compare April 12, 2023 20:43
@mwtian mwtian changed the title [TransactionManager] improve memory management [TransactionManager] fix memory issues Apr 12, 2023
@mwtian mwtian changed the title [TransactionManager] fix memory issues [TransactionManager] fix memory usage issues Apr 12, 2023
@mwtian mwtian marked this pull request as ready for review April 12, 2023 20:52
@mwtian mwtian changed the title [TransactionManager] fix memory usage issues [TransactionManager] refactor and fix memory usage issues Apr 12, 2023
return ready_certificates;
}

let input_count = self.input_objects.get_mut(&input_key.0).unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to confirm its safe to unwrap here because callers of the method must ensure input key is available in store and storage here refers to self.input_objects?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, because we know there are non-zero # of transactions that are waiting on the input object. Adding a panic message.

return ready_certificates;
};

// Waiters can acquire lock in eitehr readonly or default mode.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/eitehr/either

}

/// After reaching 3/4 load in hashmaps, increase capacity to decrease load to 1/2.
fn maybe_reserve_capacity(&mut self) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a MAX_HASHMAP_CAPACITY to keep this bounded? Or is it not an issue because we have MAX_PER_OBJECT_QUEUE_LENGTH & MAX_TM_QUEUE_LENGTH for protection

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about this, and decided to rely on the existing limits for protection, since the actual usage at most doubles these values, which should still be tolerable in a spike.

// TransactionManager can receive the notifications for objects that it did not find
// in the objects table.
//
// REQUIRED: this must be called before tx_guard.commit_tx() (below), to ensure
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ooc why is this no longer true?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tx_guard.commit_tx() is currently a no-op FYI

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, and the existence of executed effects are checked when transaction manager recovers from pending_execution table now, so even if a txn did not call tx_guard.commit_tx() due to a crash gets retried, transaction manager would handle the case correctly by avoiding re-execution and removing the transaction from the pending_execution table.

self.executing_certificates.shrink_to(max(
self.executing_certificates.capacity() / 2,
MIN_HASHMAP_CAPACITY,
))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain more about what memory problem this is addressing? It seems to me like the main problem would be peak usage, which this doesn't seem to help with.

Also, this code is quite repetive and could be factored with a generic function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This tries to reduce memory usage after a spike of pending transactions, which can be much larger than the usual number of pending transactions. Will refactor this.

@@ -103,15 +103,10 @@ pub async fn execution_process(
break;
}
}

// Remove the certificate that finished execution from the pending_certificates table.
authority.certificate_executed(&digest, &epoch_store);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why was this removed?

Copy link
Contributor Author

@mwtian mwtian Apr 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I combined certificate_executed() and objects_available() into a single notify_commit() call from authority state, to simplify reasoning, so only 1 notification is needed after a transaction commits instead of 2. Downside is that we can no longer tell if a committed transaction should have been tracked by transaction manager, because the commit notification will be triggered for system transactions too. I kept objects_available() though to be called from enqueue() and tests.

@mwtian mwtian enabled auto-merge (squash) April 13, 2023 20:13
@mwtian mwtian merged commit bf757cc into main Apr 13, 2023
@mwtian mwtian deleted the txn-mgr-hashmap branch April 13, 2023 20:43
mwtian added a commit that referenced this pull request Apr 13, 2023
## Description 

1. Fix a leak where lock_waiters map inserts an entry with empty
LockQueue, that may not get removed.
2. Resize HashMaps in TM as load changes.
3. Only notify TM once after a transaction commits, with both tx digest
and output keys.
4. Extract common logic to update lock queue and ready transactions
after a transaction and its output objects commit.

## Test Plan 

unit tests

---
If your changes are not user-facing and not a breaking change, you can
skip the following section. Otherwise, please indicate what changed, and
then add to the Release Notes section as highlighted during the release
process.

### Type of Change (Check all that apply)

- [ ] user-visible impact
- [ ] breaking change for a client SDKs
- [ ] breaking change for FNs (FN binary must upgrade)
- [ ] breaking change for validators or node operators (must upgrade
binaries)
- [ ] breaking change for on-chain data layout
- [ ] necessitate either a data wipe or data migration

### Release notes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants