Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider adding LRU cache to RocksDB impl of Database #5203

Open
MaksymZavershynskyi opened this issue Nov 11, 2021 · 8 comments · Fixed by #5212
Open

Consider adding LRU cache to RocksDB impl of Database #5203

MaksymZavershynskyi opened this issue Nov 11, 2021 · 8 comments · Fixed by #5212
Assignees
Labels
A-storage Area: storage and databases Node Node team T-node Team: issues relevant to the node experience team T-public-interfaces Team: issues relevant to the public interfaces team

Comments

@MaksymZavershynskyi
Copy link
Contributor

MaksymZavershynskyi commented Nov 11, 2021

We haven't had an opportunity to optimize the performance of the RPC server and the ViewClient inside nearcore. Which leads to individual RPC nodes not being able to sustain the needed throughput, which in turn leads to high cloud expenses for running multiple nodes and the need to use throttling.

Anecdotally, some people were able to optimize RPC server to achieve higher QPS by using Redis in nearcore: https://twitter.com/vgrichina/status/1458289037525934084 . We might be able to achieve QPS increase by simpler means as a temporary measure until we revisit RPC server and ViewClient implementations. I suggest we consider adding simple LRU cache from https://docs.rs/cached/0.26.2/cached/ directly into get and write methods of impl Database for RocksDB here:

impl Database for RocksDB {
fn get(&self, col: DBCol, key: &[u8]) -> Result<Option<Vec<u8>>, DBError> {
let read_options = rocksdb_read_options();
let result = self.db.get_cf_opt(unsafe { &*self.cfs[col as usize] }, key, &read_options)?;
Ok(RocksDB::get_with_rc_logic(col, result))
}
fn iter_without_rc_logic<'a>(
&'a self,
col: DBCol,
) -> Box<dyn Iterator<Item = (Box<[u8]>, Box<[u8]>)> + 'a> {
let read_options = rocksdb_read_options();
unsafe {
let cf_handle = &*self.cfs[col as usize];
let iterator = self.db.iterator_cf_opt(cf_handle, read_options, IteratorMode::Start);
Box::new(iterator)
}
}
fn iter<'a>(&'a self, col: DBCol) -> Box<dyn Iterator<Item = (Box<[u8]>, Box<[u8]>)> + 'a> {
let read_options = rocksdb_read_options();
unsafe {
let cf_handle = &*self.cfs[col as usize];
let iterator = self.db.iterator_cf_opt(cf_handle, read_options, IteratorMode::Start);
RocksDB::iter_with_rc_logic(col, iterator)
}
}
fn iter_prefix<'a>(
&'a self,
col: DBCol,
key_prefix: &'a [u8],
) -> Box<dyn Iterator<Item = (Box<[u8]>, Box<[u8]>)> + 'a> {
// NOTE: There is no Clone implementation for ReadOptions, so we cannot really reuse
// `self.read_options` here.
let mut read_options = rocksdb_read_options();
read_options.set_prefix_same_as_start(true);
unsafe {
let cf_handle = &*self.cfs[col as usize];
// This implementation is copied from RocksDB implementation of `prefix_iterator_cf` since
// there is no `prefix_iterator_cf_opt` method.
let iterator = self
.db
.iterator_cf_opt(
cf_handle,
read_options,
IteratorMode::From(key_prefix, Direction::Forward),
)
.take_while(move |(key, _value)| key.starts_with(key_prefix));
RocksDB::iter_with_rc_logic(col, iterator)
}
}
fn write(&self, transaction: DBTransaction) -> Result<(), DBError> {
if let Err(check) = self.pre_write_check() {
if check.is_io() {
warn!("unable to verify remaing disk space: {}, continueing write without verifying (this may result in unrecoverable data loss if disk space is exceeded", check)
} else {
panic!("{}", check)
}
}
let mut batch = WriteBatch::default();
for op in transaction.ops {
match op {
DBOp::Insert { col, key, value } => unsafe {
batch.put_cf(&*self.cfs[col as usize], key, value);
},
DBOp::UpdateRefcount { col, key, value } => unsafe {
assert!(col.is_rc());
batch.merge_cf(&*self.cfs[col as usize], key, value);
},
DBOp::Delete { col, key } => unsafe {
batch.delete_cf(&*self.cfs[col as usize], key);
},
DBOp::DeleteAll { col } => {
let cf_handle = unsafe { &*self.cfs[col as usize] };
let opt_first = self.db.iterator_cf(cf_handle, IteratorMode::Start).next();
let opt_last = self.db.iterator_cf(cf_handle, IteratorMode::End).next();
assert_eq!(opt_first.is_some(), opt_last.is_some());
if let (Some((min_key, _)), Some((max_key, _))) = (opt_first, opt_last) {
batch.delete_range_cf(cf_handle, &min_key, &max_key);
// delete_range_cf deletes ["begin_key", "end_key"), so need one more delete
batch.delete_cf(cf_handle, max_key)
}
}
}
}
Ok(self.db.write(batch)?)
}
fn as_rocksdb(&self) -> Option<&RocksDB> {
Some(self)
}
}

write method would be responsible for invalidating LRU cache entries.

Unfortunately, AFAIR there is a suboptimality in near-api-js that forces it re-request all access from an account before signing a transaction. This will likely touch iter_* methods (likely iter_prefix method) of impl Database for RocksDB, which we might also want to wrap in cache. CC @MaximusHaximus , @frol

We most likely would want to have this cache enabled only for the RPC nodes, and we should let the Contract Runtime team know about its existence, since it might affect the measurements of the param estimator.

@MaksymZavershynskyi MaksymZavershynskyi added A-storage Area: storage and databases T-public-interfaces Team: issues relevant to the public interfaces team T-node Team: issues relevant to the node experience team labels Nov 11, 2021
@bowenwang1996
Copy link
Collaborator

Rocksdb has in-memory cache internally and we have it enabled.

@MaksymZavershynskyi
Copy link
Contributor Author

Rocksdb has in-memory cache internally and we have it enabled.

Given the complexity of Rocksdb and that we revisited its configuration only once, I don't think we can be sure that we configured it correctly or that it is optimized for the view calls.

@bowenwang1996
Copy link
Collaborator

Indeed, we currently set the cache size for all columns to be 32mb

// We create block_cache for each of 47 columns, so the total cache size is 32 * 47 = 1504mb
, but not all columns are equal. I think we can and should allocate more for ColState. For example, we can do 256mb or even 512 for ColState. @mina86 @pmnoxx given that you worked with Rocksdb before, I wonder whether this is something you could help with.

@pmnoxx
Copy link
Contributor

pmnoxx commented Nov 12, 2021

@bowenwang1996 I can write a PR like this: #5212

near-bulldozer bot pushed a commit that referenced this issue Nov 13, 2021
We recently found out that our rockdb is not efficient enough.
Therefore we would like to increase cache size of specific columns.
In this PR we will increase cache size for `DBCol::ColState` to 128mb.

We can adjust that cache size later, but we try to be conservative in memory usage. 
Some testing should be done to determine optimize cache sizes. 


Closes #5203
@pmnoxx pmnoxx reopened this Nov 13, 2021
@ilblackdragon
Copy link
Member

I would suggest to make this parameter a node configuration.

This way RPC nodes can configure a larger cache if they have more RAM compared to validator nodes or other nodes that don't need to constantly respond to view calls

@pmnoxx
Copy link
Contributor

pmnoxx commented Nov 14, 2021

node configuration

I would suggest to make this parameter a node configuration.

This way RPC nodes can configure a larger cache if they have more RAM compared to validator nodes or other nodes that don't need to constantly respond to view calls

@ilblackdragon I can add it to config.json, is that what you meant?

@stale
Copy link

stale bot commented Feb 12, 2022

This issue has been automatically marked as stale because it has not had recent activity in the last 2 months.
It will be closed in 7 days if no further activity occurs.
Thank you for your contributions.

@andrei-near
Copy link
Contributor

andrei-near commented May 30, 2024

State column cache size param is already implemented #6584

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-storage Area: storage and databases Node Node team T-node Team: issues relevant to the node experience team T-public-interfaces Team: issues relevant to the public interfaces team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants