Description
First of all, thank you for such a great tutorial on LSM-Tree storage engine.
After finishing 3 weeks project, I have got a lot of to share, including some questions and suggestions.
1. StorageIterator specification
Is there an implicit StorageIterator specification?
In week 1 project, there are several iterators that implement StorageIterator.
As I implemented these iterators, I often asked myself, "Have I implemented these iterators in a coherent way?"
It turns out that I wrote the same assertions again and again, so I feel there is some implicit specification that I should apply.
The spec I summarized is listed below:
fn key(&self) -> KeyType<'_>
- This function can be called iff the iterator is valid, if it is called from an invalid iterator, we could panic.
- The return value must be an valid key, which means it is an non-empty key.
fn value(&self) -> &[u8]
- This function can be called iff the iterator is valid, if it is called from an invalid iterator, we could panic.
fn is_valid(&self) -> bool
- This function return
false
iff its underlying storage cannot produce a new element. - For
FuseIterator
, it also returnfalse
afternext
return an error.
- This function return
fn num_active_iterators(&self) -> usize
- For invalid iterator, it should return 0
- For valid iterator, it should not return 0.
- For
FuseIterator
, when the iterator is valid, it should not return 0. when the iterator is invalid, the return value is unspecified.
fn next(&mut self) -> Result<()>
- For invalid iterator, it is no-op.
- For valid iterator, if return
Ok(())
- If underlying storage cannot produce a new element, turn into invalid iterator. return
Ok(())
- Otherwise, store the new element. return
Ok(())
- The key from new element should strict greater than old key.
- If underlying storage cannot produce a new element, turn into invalid iterator. return
- For valid iterator, if return
Err(e)
- The
key
,value
,is_valid
call should return same value as before, that means act like no-op. - For compound of iterators like
MergeIterator
,SstConcatIterator
andTwoMergeIterator
, it should remove the underlying iterator which cause the error. (I'm not sure whether is sound forSstConcatIterator
andTwoMergeIterator
) - For
FuseIterator
, it should turn into invalid iterator.
- The
I'm not sure if it covers all situations.
2. Inconsistent function signature
As mentioned in #72. There is an inconsistent function signature in decode_block_meta
.
mini-lsm/mini-lsm-starter/src/table.rs
Lines 45 to 46 in dd333ca
mini-lsm/mini-lsm/src/table.rs
Lines 63 to 64 in dd333ca
Until week 2 day 7, the signature is fine, but it turns out that this is not sufficient for checksum functionality.
- The original signature implied that this function is infallible. But checksum will cause an error, so the return type should be wrapped in
Result
, &[u8]
is more convenient thanimpl Buf
when implementing checksum functionality.
3. key always non-empty
It seems that an empty key is considered invalid. So we can enforce it by adding ParsedKey
type, we could use nutype
crate to achieve this goal.
But it may not be necessary for an educational project, since it is convenient to use an empty key instead of Option
wrapper.
4. apply_result
does not point out what it should return
mini-lsm/mini-lsm-starter/src/compact/simple_leveled.rs
Lines 41 to 48 in dd333ca
As the document does point out that LsmStorageState
should be the new state, but it does not point out what Vec<usize>
means.
Only after I exam solution code, I figure out that should be sst_ids which need be removed.
5. Seedable level compaction simulator
The level compaction simulator uses rand to generate the key range, so each run will produce different output, making it difficult to verify that the output is identical to the reference solution by using command line tools such as diff
cmp
.
One solution is to add the seed flag to the simulator, and use rand::SeedableRng
to generate random numbers.
6. println!()
to log
In the reference solution, the implementation of apply_result
contains some of println!()
to log information, making it difficult to verify that the output is identical to the reference solution by using command line tools such as diff
, cmp
.
I think it should be replaced with eprintln!()
, which will output to stderr
and does not lose logging functionality.
7. Encourage to add test
In rust project, adding test is pretty easy! and adding tests is an easiest way to contribute to repository.
For example, when implementing manifest functionality, it is good to test serde_json
library first.
So I can quickly write relative test within manifest.rs
.
#[cfg(test)]
#[test]
fn test_record_serde() {
let record1 = ManifestRecord::Flush(1);
let json = serde_json::to_vec(&record1).unwrap();
let record2 = serde_json::from_slice(&json).unwrap();
assert_eq!(record1, record2);
}
Most of IDEs have the ability to run tests directly in Rust, which is different from C/C++ or Java, where you need some sort of test framework or use the main function to run a single simple test.
8. extract apply result logic to bind specific impl
When implementing apply_result
function, I use a lot of related method of Vec
in my solution,
e.g. Vec::splice
, Vec::drain
. But the levels Vec<(usize, Vec<usize>)>
seems inefficient for doing compaction.
We could generalize levels impl like levels: Impl LevelsImpl
and create a type alias type Levels = Vec<(usize, Vec<usize>)>;
, and impl LevelsImpl
trait for Vec<(usize, Vec<usize>)>
, and we have the ability to easily switch to another implementation.
But it is of low priority, since this project is just for educational purposes.
9. immutable memtable is not immutated
Due to the interior mutability of MemTable
, there is no way to forbid programmers from modifying immutable memtables in Vec<Arc<MemTable>>
.
Which may (low probability) cause a nasty bug. But could be avoided with a little care.
The solution is quite simple: Use the Newtype
pattern and expose only the read-only interface.
10. Migration to mvcc version is considered painful
It is kind of dirty work, and frustrating, to refactor code to use the key+ts representation.
The code that needs to be changed is scattered across many files.
Asking students to write unsafe code in Rust is awkward.
Learning on code that does not complied is quite inefficient.
11. use Arc_cycle
to store weak pointer into LsmMvccInner
pub fn new_txn(&self, inner: Arc<LsmStorageInner>, serializable: bool) -> Arc<Transaction>
Strictly speaking, LsmMvccInner::new_txn
can only be called with the LsmStorageInner
that created it, but its semantics do not restrict this.
For example, if there are two different LsmStorageInner
named storageA
and storageB
with associated LsmMvccInner
named mvccA
and mvccB
,
nobody forbids us to call mvccB.new_txn(storageA)
or mvccA.new_txn(storageB)
, which is obviously a logical error.
However, as project developers, we know that there is only one LsmStorageInner
instance, so this won't be a critical issue.
But ideally, we can do better by using Arc::Arc_cycle
to pass a weak pointer directly into LsmMvccInner::new
.
12. TxnIterator, LsmIterator is not compatible on non-mvcc version
Currently TxnIterator
uses TwoMergeIterator<TxnLocalIterator, LsmIterator>
to iterate over the underlying storage.
When switching to the non-mvcc version, creating an empty transaction without mvcc can be tricky.
It might be better to use Option<A>
in TwoMergeIterator
and Option<Arc<Transaction>>
in TxnIterator
to handle this situation.
Again, thank you for such a great tutorial on LSM-Tree storage engine.