Feedback after complete mini-lsm.

First of all, thank you for such a great tutorial on LSM-Tree storage engine.

After finishing 3 weeks project, I have got a lot of to share, including some questions and suggestions.

### 1. StorageIterator specification

Is there an implicit StorageIterator specification?

In week 1 project, there are several iterators that implement StorageIterator.

As I implemented these iterators, I often asked myself, "Have I implemented these iterators in a coherent way?"

It turns out that I wrote the same assertions again and again, so I feel there is some implicit specification that I should apply.

The spec I summarized is listed below:

* `fn key(&self) -> KeyType<'_>`
    * This function can be called iff the iterator is valid, if it is called from an invalid iterator, we could panic.
    * The return value must be an valid key, which means it is an non-empty key.
* `fn value(&self) -> &[u8]`
    * This function can be called iff the iterator is valid, if it is called from an invalid iterator, we could panic.
* `fn is_valid(&self) -> bool`
    * This function return `false` iff its underlying storage cannot produce a new element.
    * For `FuseIterator`, it also return `false` after `next` return an error.
* `fn num_active_iterators(&self) -> usize`
    * For invalid iterator, it should return 0
    * For valid iterator, it should not return 0.
    * For `FuseIterator`, when the iterator is valid, it should not return 0. when the iterator is invalid, the return value is unspecified.
* `fn next(&mut self) -> Result<()>`
    * For invalid iterator, it is no-op.
    * For valid iterator, if return `Ok(())`
        * If underlying storage cannot produce a new element, turn into invalid iterator. return `Ok(())`
        * Otherwise, store the new element. return `Ok(())`
        * The key from new element should strict greater than old key.
    * For valid iterator, if return `Err(e)`
        * The `key`, `value`, `is_valid` call should return same value as before, that means act like no-op.
        * For compound of iterators like `MergeIterator`, `SstConcatIterator` and `TwoMergeIterator`, it should remove the underlying iterator which cause the error. (I'm not sure whether is sound for `SstConcatIterator` and `TwoMergeIterator`)
        * For `FuseIterator`, it should turn into invalid iterator.

I'm not sure if it covers all situations.

### 2. Inconsistent function signature

As mentioned in #72. There is an inconsistent function signature in `decode_block_meta`.

https://github.com/skyzh/mini-lsm/blob/dd333ca5a7c1684357f335c304f885b287fbf8e0/mini-lsm-starter/src/table.rs#L45-L46
https://github.com/skyzh/mini-lsm/blob/dd333ca5a7c1684357f335c304f885b287fbf8e0/mini-lsm/src/table.rs#L63-L64

Until week 2 day 7, the signature is fine, but it turns out that this is not sufficient for checksum functionality.
* The original signature implied that this function is infallible. But checksum will cause an error, so the return type should be wrapped in `Result`,
* `&[u8]` is more convenient than `impl Buf` when implementing checksum functionality.

### 3. key always non-empty

It seems that an empty key is considered invalid. So we can enforce it by adding `ParsedKey` type, we could use [`nutype`](https://docs.rs/nutype/latest/nutype/) crate to achieve this goal.

But it may not be necessary for an educational project, since it is convenient to use an empty key instead of `Option` wrapper.

### 4. `apply_result` does not point out what it should return

https://github.com/skyzh/mini-lsm/blob/dd333ca5a7c1684357f335c304f885b287fbf8e0/mini-lsm-starter/src/compact/simple_leveled.rs#L41-L48

As the document does point out that `LsmStorageState` should be the new state, but it does not point out what `Vec<usize>` means.
Only after I exam solution code, I figure out that should be sst_ids which need be removed.

### 5. Seedable level compaction simulator

The level compaction simulator uses rand to generate the key range, so each run will produce different output, making it difficult to verify that the output is identical to the reference solution by using command line tools such as `diff` `cmp`.

One solution is to add the seed flag to the simulator, and use `rand::SeedableRng` to generate random numbers.

### 6. `println!()` to log

In the reference solution, the implementation of `apply_result` contains some of `println!()` to log information, making it difficult to verify that the output is identical to the reference solution by using command line tools such as `diff`, `cmp`.

I think it should be replaced with `eprintln!()`, which will output to `stderr` and does not lose logging functionality.

### 7. Encourage to add test

In rust project, adding test is pretty easy! and adding tests is an easiest way to contribute to repository.

For example, when implementing manifest functionality, it is good to test `serde_json` library first.
So I can quickly write relative test within `manifest.rs`.

```rust
#[cfg(test)]
#[test]
fn test_record_serde() {
    let record1 = ManifestRecord::Flush(1);
    let json = serde_json::to_vec(&record1).unwrap();
    let record2 = serde_json::from_slice(&json).unwrap();
    assert_eq!(record1, record2);
}
```

Most of IDEs have the ability to run tests directly in Rust, which is different from C/C++ or Java, where you need some sort of test framework or use the main function to run a single simple test.

### 8. extract apply result logic to bind specific impl

When implementing `apply_result` function, I use a lot of related method of `Vec` in my [solution](https://github.com/Foreverhighness/mini-lsm/blob/solution/mini-lsm-starter/src/compact/leveled.rs#L201),
e.g. [`Vec::splice`](https://doc.rust-lang.org/std/vec/struct.Vec.html#method.splice), [`Vec::drain`](https://doc.rust-lang.org/std/vec/struct.Vec.html#method.drain). But the levels `Vec<(usize, Vec<usize>)>` seems inefficient for doing compaction.

We could generalize levels impl like `levels: Impl LevelsImpl` and create a type alias `type Levels = Vec<(usize, Vec<usize>)>;`, and impl `LevelsImpl` trait for `Vec<(usize, Vec<usize>)>`, and we have the ability to easily switch to another implementation.

But it is of low priority, since this project is just for educational purposes.

### 9. immutable memtable is not immutated

Due to the interior mutability of `MemTable`, there is no way to forbid programmers from modifying immutable memtables in `Vec<Arc<MemTable>>`.

Which may (low probability) cause a nasty bug. But could be avoided with a little care.

The solution is quite simple: Use the [`Newtype`](https://rust-unofficial.github.io/patterns/patterns/behavioural/newtype.html) pattern and expose only the read-only interface.

### 10. Migration to mvcc version is considered painful

It is kind of dirty work, and frustrating, to refactor code to use the key+ts representation.

The code that needs to be changed is scattered across many files.

Asking students to write unsafe code in Rust is awkward.

Learning on code that does not complied is quite inefficient.

### 11. use `Arc_cycle` to store weak pointer into LsmMvccInner

`pub fn new_txn(&self, inner: Arc<LsmStorageInner>, serializable: bool) -> Arc<Transaction>`

Strictly speaking, `LsmMvccInner::new_txn` can only be called with the `LsmStorageInner` that created it, but its semantics do not restrict this.

For example, if there are two different `LsmStorageInner` named `storageA` and `storageB` with associated `LsmMvccInner` named `mvccA` and `mvccB`,
nobody forbids us to call `mvccB.new_txn(storageA)` or `mvccA.new_txn(storageB)`, which is obviously a logical error.

However, as project developers, we know that there is only one `LsmStorageInner` instance, so this won't be a critical issue.

But ideally, we can do better by using `Arc::Arc_cycle` to pass a weak pointer directly into `LsmMvccInner::new`.

### 12. TxnIterator, LsmIterator is not compatible on non-mvcc version

Currently `TxnIterator` uses `TwoMergeIterator<TxnLocalIterator, LsmIterator>` to iterate over the underlying storage.

When switching to the non-mvcc version, creating an empty transaction without mvcc can be tricky.

It might be better to use `Option<A>` in `TwoMergeIterator` and `Option<Arc<Transaction>>` in `TxnIterator` to handle this situation.

---

Again, thank you for such a great tutorial on LSM-Tree storage engine.

	/// Apply the compaction result.
	///
	/// The compactor will call this function with the compaction task and the list of SST ids generated. This function applies the
	/// result and generates a new LSM state. The functions should only change `l0_sstables` and `levels` without changing memtables
	/// and `sstables` hash map. Though there should only be one thread running compaction jobs, you should think about the case
	/// where an L0 SST gets flushed while the compactor generates new SSTs, and with that in mind, you should do some sanity checks
	/// in your implementation.
	pub fn apply_compaction_result(

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feedback after complete mini-lsm. #88

1. StorageIterator specification

2. Inconsistent function signature

3. key always non-empty

4. `apply_result` does not point out what it should return

5. Seedable level compaction simulator

6. `println!()` to log

7. Encourage to add test

8. extract apply result logic to bind specific impl

9. immutable memtable is not immutated

10. Migration to mvcc version is considered painful

11. use `Arc_cycle` to store weak pointer into LsmMvccInner

12. TxnIterator, LsmIterator is not compatible on non-mvcc version

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

	/// Decode block meta from a buffer.
	pub fn decode_block_meta(buf: impl Buf) -> Vec<BlockMeta> {

	/// Decode block meta from a buffer.
	pub fn decode_block_meta(mut buf: &[u8]) -> Result<Vec<BlockMeta>> {