-
Couldn't load subscription status.
- Fork 932
v3-alpha: Rbyds and B-trees and gcksums, oh my! #1111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
geky
wants to merge
1,782
commits into
master
Choose a base branch
from
v3-alpha
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+194,814
−28,600
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Globs in CLI attrs (-L'*=bs=%(bs)s' for example), have been remarkably
useful. It makes sense to extend this to the other flags that match
against CSV fields, though this does add complexity to a large number of
smaller scripts.
- -D/--define can now use globs when filtering:
$ ./scripts/code.py lfs.o -Dfunction='lfsr_file_*'
-D/--define already accepted a comma-separated list of options, so
extending this to globs makes sense.
Note this differs from test.py/bench.py's -D/--define. Globbing in
test.py/bench.py wouldn't really work since -D/--define is generative,
not matching. But there's already other differences such as integer
parsing, range, etc. It's not worth making these perfectly consistent
as they are really two different tools that just happen to look the
same.
- -c/--compare now matches with globs when finding the compare entry:
$ ./scripts/code.py lfs.o -c'lfs*_file_sync'
This is quite a bit less useful that -D/--define, but makes sense for
consistency.
Note -c/--compare just chooses the first match. It doesn't really make
sense to compare against multiple entries.
This raised the question of globs in the field specifiers themselves
(-f'bench_*' for example), but I'm rejecting this for now as I need to
draw the complexity/scope _somewhere_, and I'm worried it's already way
over on the too-complex side.
So, for now, field names must always be specified explicitly. Globbing
field names would add too much complexity. Especially considering how
many flags accept field names in these scripts.
So now the hidden variants of field specifiers can be used to manipulate
by fields and field fields without implying a complete field set:
$ ./scripts/csv.py lfs.code.csv \
-Bsubsystem=lfsr_file -Dfunction='lfsr_file_*' \
-fcode_size
Is the same as:
$ ./scripts/csv.py lfs.code.csv \
-bfile -bsubsystem=lfsr_file -Dfunction='lfsr_file_*' \
-fcode_size
Attempting to use -b/--by here would delete/merge the file field, as
cvs.py assumes -b/-f specify all of the relevant field type.
Note that fields can also be explicitly deleted with -D/--define's new
glob support:
$ ./scripts/csv.py lfs.code.csv -Dfile='*' -fcode_size
---
This solves an annoying problem specific to csv.py, where manipulating
by fields and field fields would often force you to specify all relevant
-b/-f fields. With how benchmarks are parameterized, this list ends up
_looong_.
It's a bit of a hack/abuse of the hidden flags, but the alternative
would be field globbing, which 1. would be a real pain-in-the-ass to
implement, and 2. affect almost all of the scripts. Reusing the hidden
flags for this keeps the complexity limited to csv.py.
This adds __csv__ methods to all Csv* classes to indicate how to write csv/json output, and adopts Python's default float repr. As a plus, this also lets us use "inf" for infinity in csv/json files, avoiding potential unicode issues. Before this we were reusing __str__ for both table rendering and csv/json writing, which rounded to a single decimal digit! This made float output pretty much useless outside of trivial cases. --- Note Python apparently does some of its own rounding (1/10 -> 0.1?), so the result may still not be round-trippable, but this is probably fine for our somewhat hack-infested csv scripts.
Whoops! A missing splat repetition here meant we only ever accepted floats with a single digit of precision and no e/E exponents. Humorously this went unnoticed because our scripts were only _outputting_ single digit floats, but now that that's fixed, float parsing also needs a fix. Fixed by allowing >1 digit of precision in our CsvFloat regex.
Before this, the only option for ordering the legend was by specifying
explicit -L/--add-label labels. This works for the most part, but
doesn't cover the case where you don't know the parameterization of the
input data.
And we already have -s/-S flags in other csv scripts, so it makes sense
to adopt them in plot.py/plotmpl.py to allow sorting by one or more
explicit fields.
Note that -s/-S can be combined with explicit -L/--add-labels to order
datasets with the same sort field:
$ ./scripts/plot.py bench.csv \
-bBLOCK_SIZE \
-xn \
-ybench_readed \
-ybench_proged \
-ybench_erased \
--legend \
-sBLOCK_SIZE \
-L'*,bench_readed=bs=%(BLOCK_SIZE)s' \
-L'*,bench_proged=' \
-L'*,bench_erased='
---
Unfortunately this conflicted with -s/--sleep, which is a common flag in
the ascii-art scripts. This was bound to conflict with -s/--sort
eventually, so a came up with some alternatives:
- -s/--sleep -> -~/--sleep
- -S/--coalesce -> -+/--coalesce
But I'll admit I'm not the happiest about these...
This was a simple typo. Unfortunately went unnoticed because the lingering dataset assigned in the above for loop made the results look mostly correct. Yay.
This should be floor (rounds towards -inf), not int (rounds towards zero), otherwise sub-integer results get funky: - floor si(0.00001) => 10u - int si(0.00001) => 0.01m - floor si(0.000001) => 1u - int si(0.000001) => m (???)
Whoops, looks like cumulative results were overlooked when multiple bench measurements per bench were added. We were just adding all cumulative results together! This led to some very confusing bench results. The solution here is to keep track of per-measurement cumulative results via a Python dict. Which adds some memory usage, but definitely not enough to be noticeable in the context of the bench-runner.
This prevents runaway O(n^2) behavior on devices with extremely large block sizes (NAND, bs=~128KiB - ~1MiB). The whole point of shrubs is to avoid this O(n^2) runaway when inline files become necessarily large. Setting FRAGMENT_SIZE to a factor of the BLOCK_SIZE humorously defeats this. The 512 byte cutoff is somewhat arbitrary, it's the natural BLOCK_SIZE/8 FRAGMENT_SIZE on most NOR flash (bs=4096), but it's probably worth tuning based on actual device performance.
This adds mattr_estimate, which is basically the same as rattr_estimate,
but assumes weight <= 1:
rattr tag:
.---+---+---+- -+- -+- -+- -+---+- -+- -+- -. worst case: <=11 bytes
| tag | weight | size | rattr est: <=3t + 4
'---+---+---+- -+- -+- -+- -+---+- -+- -+- -' <=37 bytes
mattr tag:
.---+---+---+---+- -+- -+- -. worst case: <=7 bytes
| tag | w | size | mattr est: <=3t + 4
'---+---+---+---+- -+- -+- -' <=25 bytes
This may seem like only a minor improvement, but with 3 tags for every
attr, this really adds up. And with our compaction estimate overheads we
need every byte of shaving we can get.
---
This ended up necessary to get littlefs running with 512 byte blocks
again. Now that our compaction overheads are so high, littlefs is having
a hard time fitting even just the filesystem config in a single block:
mroot estimate 512B before: 246/256
mroot estimate 512B after: 162/256 (-34.1%)
Whether or not it makes sense to run littlefs with 512 byte blocks is
still an open question, even after this tweak.
Note that even if 512 byte blocks ends up intractable, this doesn't mean
littlefs won't be able to run on SD/eMMC! The configured block_size can
always be a multiple, >=, of the physical block_size, and choosing a
larger block_size completely side-steps this problem.
The new design of littlefs is primarily focused on devices with very
large block sizes, so you may want to use larger block sizes on SD/eMMC
for performance reasons anyways.
---
Code changes were pretty minimal. This does add an additional field to
lfs_t, but it's just a byte and fits into padding with the other small
precomputed constants:
code stack ctx
before: 35824 2368 636
after: 35836 (+0.0%) 2368 (+0.0%) 636 (+0.0%)
So:
$(filter-out %.t.c %.b.c %.a.c,$(wildcard bd/*.c))
Instead of:
$(filter-out $(wildcard bd/*.t.* bd/*.b.*),$(wildcard bd/*.c))
The main benefit is we no longer need to explicitly specify all
subdirectories, though the single wildcard is a bit less flexible if
test.py/bench.py ever end up with other non-C artifacts.
Unfortunately only a single wildcard is supported in filter-out.
This adds --xlim-stddev and --ylim-stddev as alternatives to -X/--xlim and -Y/--ylim that define the plot limits in terms of standard deviations from the mean, instead of in absolute values. So want to only plot data within +-1 standard deviation? Use: $ ./scripts/plot.py --ylim-stddev=-1,+1 Want to ignore outliers >3 standard deviations? Use: $ ./scripts/plot.py --ylim-stddev=3 This is very useful for plotting the amortized/per-byte benchmarks, which have a tendency to run off towards infinity near zero. Before, we could truncate data explicitly with -Y/--ylim, but this was getting very tedious and doesn't work well when you don't know what the data is going to look like beforehand.
Mainly to avoid confusion with littlefs's attrs, uattrs, rattrs, etc. This risked things getting _really_ confusing as the scripts evolve.
- codemapd3.py -> codemapsvg.py - dbgbmapd3.py -> dbgbmapsvg.py - treemapd3.py -> treemapsvg.py Originally these were named this way to match plotmpl.py, but these names were misleading. These scripts don't actually use the d3 library, they're just piles of Python, SVG, and Javascript, modelled after the excellent d3 treemap examples. Keeping the *d3.py names around also felt a bit unfair to brendangregg's flamegraph SVGs, which were the inspiration for the interactive component. With d3 you would normally expect a rich HTML page, which is how you even include the d3 library. plotmpl.py is also an outlier in that it supports both .svg and .png output. So having a different naming convention in this case makes sense to me. So, renaming *d3.py -> *svg.py. The inspiration from d3 is still mentioned in the top-level comments in the relevant files.
This adds an alternative sync path for small in-cache files, where we
combine the shrub commit with the file sync commit, potentially writing
everything out in a single prog.
This is reminiscent of bmoss (old inlined) files, but notably avoids the
additional on-disk data-structure and extra code necessary to manage it.
---
The motivation for this comes from ongoing benchmarking, where we're
seeing a fairly significant regression in small-file performance on NAND
flash. Especially curious since the whole goal of this work was to make
NAND flash tractable.
But it makes sense: 2 commits are more than 1.
While the separate shrub + sync commits are barely noticeable on NOR
flash, on NAND flash, with its huge >512B prog sizes, the extra commit
is hard to miss.
In theory, the most performant solution would be to merge all bshrub
commits with sync commits whenever possible. This is technically doable,
and may make sense for a more performance-focused littlefs driver, but
it would 1. require an invasive code rewrite, 2. entangle lfsr_file_sync
-> lfsr_file_flush -> lfsr_file_carve, and 3. add even more code.
If we only merge shrub + sync commits when the file fits in the cache,
we can skip lfsr_file_flush, craft a simple shrubcommit by hand, and
avoid all of this mess. While still speeding up the most common write
path for small files.
And sure enough, our bench-many benchmark, which creates ~1000 4 byte
files, shows a ~2x speed improvement on bs=128KiB NAND (basically just
because we compact/split ~5 times instead of ~10 times).
---
Unfortunately the shrub commit requires quite a bit of state to set up,
and in the middle of lfsr_file_sync, one of the more critical functions
on our stack hot-path. So this does have a big cost:
code stack ctx
before: 35836 2368 636
after: 35992 (+0.4%) 2408 (+1.7%) 636 (+0.0%)
Though this is also a perfect contender to be compile-time ifdefed. It
may be worth adding something like LFS_NO_MERGESHRUBCOMMITS (better
name?) to claw back some of the cost if you don't care about
performances as much.
This could also probably be a bit cheaper if our file write configs were
organized differently... At the moment we need to check inline_size,
fragment_size, _and_ crystal_thresh since these can sometimes overlap.
But this is waiting on the future config rework.
---
Actually... Looking at this closer, I'm not sure the added commit logic
should really be included in the hot-path cost...
lfsr_file_flush is the hot path, and flush -> sync are sequential
operations that don't really share stack (with the shrub commit we
humorously _never_ call flush). The commit logic is only being dragged
in because our stack measurements are pessimistic about shrinkwrapping,
which is a bit frustrating.
I've explored shrinkwrapping in stack.py before, but the idea pretty
much failed. Unfortunately GCC simply doesn't make this info available
short of parsing the per-arch disassembly.
This adds LFS_NOINLINE, and forces lfsr_file_sync_ (the commit logic in
lfsr_file_sync) off the stack hot-path.
This adds a bit of code, function calls are surprisingly expensive, but
saves a nice big chunk of stack:
code stack ctx
before: 35992 2408 636
after: 36016 (+0.1%) 2296 (-4.7%) 636 (+0.0%)
Well, maybe not _real_ stack. The fact that this worked suggests the
real stack usage is less than our measured value.
The reason is because our stack.py script is relatively simple. It just
adds together stack frames based on the callgraph at compile time, which
misses shrinkwrapping and similar optimizations. Unfortunately that sort
of information is simply not available via GCC short of parsing the
disassembly.
But this is the number that will be used for statically allocated stacks,
and of course the number that will probably end up associated with
littlefs, so it still seems like a worthwhile number to "optimize" for.
Maybe in the future this will be different as tooling around stack
measurements improves.
---
The other benefit of moving lfsr_file_sync_ off the hot-path is that we
now no longer incorrectly include the sync commit context in the
hot-path. This tells a much different story for the cost of 1-commit
shrubs:
code stack ctx
before 1c-shrubs: 35848 2296 636
after 1c-shrubs: 36016 (+0.5%) 2296 (+0.0%) 636 (+0.0%)
Maybe it's just habit, but the trailing underscores_ felt far more useful serving only as a out-pointer/new/biproduct hint. Having trailing underscores_ serve dual purposes as both a new/biproduct hint and optional hint just muddies things and makes the hint much less useful. No code changes.
TLDR: Added file->leaf, which can track file fragments (read only) and
blocks independently from file->b.shrub. This speeds up linear
read/write performance at a heavy code/stack cost.
The jury is still out on if this ends up reverted.
---
This is another change motivated by benchmarking, specifically the
significant regression in linear reads.
The problem is that CTZ skip-lists are actually _really_ good at
appending blocks! (but only appending blocks) The entire state of the
file is contained in the last block, so file writes can resume without
any reads. With B-trees, we need at least 1 B-tree lookup to resume
appending, and this really adds up when writing extremely blocks.
To try to mitigate this, I added file->leaf, a single in-RAM bptr for
tracking the most recent leaf we've operated on. This avoids B-tree
lookups during linear reads, and allowing the leaf to fall out-of-sync
with the B-tree avoids both B-tree lookups and commits during writes.
Unfortunately this isn't a complete win for writes. If we write
fragments, i.e. cache_size < prog_size, we still need to incrementally
commit to the B-tree. Fragments are a bit annoying for caching as any
B-tree commit can discard the block they reside on.
For reading, however, this brings read performance back to roughly the
same as CTZ skip-lists.
---
This also turned into more-or-less a full rewrite of the lfsr_file_flush
-> lfsr_file_crystallize code path, which is probably a good thing. This
code needed some TLC.
file->leaf also replaces the previous eblock/eoff mechanism for
erased-state tracking via the new LFSR_BPTR_ISERASED flag. This should
be useful when exploring more erased-state tracking mechanisms (ddtree).
Unfortunately, all of this additional in-RAM state is very costly. I
think there's some cleanup that can be done (the current impl is a bit
of a mess/proof-of-concept), but this does add a significant chunk of
both code and stack:
code stack ctx
before: 36016 2296 636
after: 37228 (+3.4%) 2328 (+1.4%) 636 (+0.0%)
file->leaf also increases the size of lfsr_file_t, but this doesn't show
up in ctx because struct lfs_info dominates:
lfsr_file_t before: 116
lfsr_file_t after: 136 (+17.2%)
Hm... Maybe ctx measurements should use a lower LFS_NAME_MAX?
Mostly adding convenience functions to deduplicate code:
- Adopted lfsr_bptr_claim
- Renamed lfsr_file_graft -> lfsr_file_graft_
- Adopted lfsr_file_graft
- Didn't bother with lfsr_file_discardleaf
This saves a bit of code, though not that much in the context of the
file->leaf code cost:
code stack ctx
before cleanup: 37228 2328 636
after: 37180 (-0.1%) 2360 (+1.4%) 636 (+0.0%)
code stack ctx
before file->leaf: 36016 2296 636
after: 37180 (+3.2%) 2360 (+2.8%) 636 (+0.0%)
This is just a bit simpler/more flexible of an API. Taking flags
directly has worked well for similar functions.
This also drops lfsr_*_mkdirty. I think we should keep the mk* names
reserved for heavy-weight filesystem operations.
That being said, this does add a surprising bit of code. I because the
flags end up in literal pools? Doesn't thumb have a bunch of fancy
single-bit immediate encodings?
code stack ctx
before: 37180 2360 636
after: 37192 (+0.0%) 2360 (+0.0%) 636 (+0.0%)
These mostly just help with the mess that is: file->leaf.bptr.data.u.disk.block No code changes.
Except for the unknown flag checks. I don't know why but they really
mess with readability there for me. Maybe because the logic matches
english grammar ("is not any of these" vs "is any not of these")?
No code changes.
This sort of abuses the bptr/data type overlap again, taking an explicit
delta along with a list of datas where:
- data_count=-1 => single bptr
- data_count>=0 => list of concatenated fragments
It's a bit of a hack, but the previous rattr argument it replaces was
an arguably worse hack. I figured if we're going to interrogate the
rattr to figure out what type it is, we might as well just make the type
explicit.
Saved a surprising amount of stack! So that's nice:
code stack ctx
before: 37192 2360 636
after: 37080 (-0.3%) 2304 (-2.4%) 636 (+0.0%)
With the new crystallization logic, we have two routes for resuming
crystallization:
1. before finding our crystal heuristic, if buffer is in-block and
enough for prog alignment
2. after finding our crystal heuristic, if crystal heuristic is in-block
and enough for prog alignment
But thinking about the second case, when would this happen that isn't
caught by the first case? When there are fragments trailing our buffer?
Are you writing to the file backwards?
This corner case doesn't seem worth the extra logic.
Benchmarking didn't find a noticeable difference in performance, so
removing.
Saves a bit of code:
code stack ctx
before: 37080 2304 636
after: 37056 (-0.1%) 2304 (+0.0%) 636 (+0.0%)
In lfsr_mdir_compact__, we rely on shrub_.block != mdir.block to avoid
compacting shrubs multiple times. This works for the most part because
we set shrub_.block = shrub.block (the old mdir block) at the beginning
of lfsr_mdir_commit. We don't actually reset shrub_.block on a bad prog,
but in theory that was ok because we never try to compact into the same
block twice.
But this falls apart if we overrecycle the mdir!
With overrecycling, if we encounter a bad prog during a compaction and
there are no more blocks to relocate to, we try one last time to compact
into the same block (this logic is mainly for recycle overflows, where
it makes a bit more sense).
Of course, compacting into the same block breaks the above shrub_.block
!= mdir.block invariant, which causes the shrub compaction to be
skipped, uses the old shrub_.trunk (which now points to garbage), and
breaks everything.
Fortunately the solution is relatively simple: Just discard any staged
shrubs that have been committed when we relocate/overrecycle.
---
While fixing this I went ahead and renamed overcompaction ->
overrecycling. To me, overcompaction implies something _very_ different,
and I think this better describes the relationship between overrecycling
and block_recycles.
Also added test_ck_ckprogs_overrecycling to nail this down and prevent a
regression in the future. This bug _was_ caught by
test_ck_spam_fwrite_fuzz, but only after unrelated fs changes.
Adds a bit of code, but a smaller + dysfunctional filesystem is not very
useful:
code stack ctx
before: 37056 2304 (+0.0%) 636 (+0.0%)
after: 37088 (+0.1%) 2304 (+0.0%) 636 (+0.0%)
This tweaks lfsr_mdir_commit_ to avoid overrecycling if we encounter a
bad prog (LFS_ERR_CORRUPT). This avoids compacting to the same block
twice, which risks an undetected prog error and breaks internal
invariants.
Note we still overrecycle if the relocation reason is a recycle
overflow.
---
This is an alternative solution to the previous overrecycling + shrub +
ckprog bug: Just make sure we don't compact to the same block twice!
After all, if we just got a bad prog, why are we trying to prog again?
(There are actually some arguments for multiple prog attempts, bus
errors for example, but I don't think that's a great excuse for littlefs
attempting multiple progs without user input.)
Even though this adds logic to lfsr_mdir_commit_, it ends up saving
code since we can drop the shrub discard pass:
code stack ctx
before: 37088 2304 636
after: 37056 (-0.1%) 2304 (+0.0%) 636 (+0.0%)
Not that we _really_ care about this quantity of code. The real
motivation is 1. lowering the risk of a missed prog error, and
2. maintaining the never-compact-same-block invariant in case there
are other invariant-dependent bugs lurking around.
This should better match other relocation loops in the codebase, and is
hopefully a bit more readable.
---
Note we generally have two patterns for relocation loops:
Loops where we unconditionally allocate/relocate:
relocate:;
alloc();
compact();
if (err) goto relocate;
commit();
if (err) goto relocate;
return;
And loops where we fallback to allocation/relocation:
while (true) {
commit();
if (err) goto relocate;
return;
relocate:;
alloc();
compact();
if (err) goto relocate;
}
lfsr_mdir_commit_ falls into the latter.
No code changes.
- lfsr_file_discardcache
- lfsr_file_discardleaf
- lfsr_file_discardbshrub
The code deduplication saves a bit of code:
code stack ctx
before: 37056 2304 636
after: 37012 (-0.1%) 2304 (+0.0%) 636 (+0.0%)
This adopts lazy crystallization in _addition_ to lazy grafting, managed
by separate LFS_o_UNCRYST and LFS_o_UNGRAFT flags:
LFS_o_UNCRYST 0x00400000 File's leaf not fully crystallized
LFS_o_UNGRAFT 0x00800000 File's leaf does not match bshrub/btree
This lets us graft not-fully-crystallized blocks into the tree without
needing to fully crystallize, avoiding repeated recrystallizations when
linearly rewriting a file.
Long story short, this gives file rewrites roughly the same performance
as linear file writes.
---
In theory you could also have fully crystallized but ungrafted blocks
(UNGRAFT + ~UNCRYST), but this doesn't happen with the current logic.
lfsr_file_crystallize eagerly grafts blocks once they're crystallized.
Internally, lfsr_file_crystallize replaces lfsr_file_graft for the
"don't care, gimme file->leaf" operation. This is analogous to
lfsr_file_flush for file->cache.
Note we do _not_ use LFS_o_UNCRYST to track erased-state! If we did,
erased-state wouldn't survive lfsr_file_flush!
---
Of course, this adds even more code. Fortunately not _that_ much
considering how many lines of code changed:
code stack ctx
before: 37012 2304 636
after 37084 (+0.2%) 2304 (+0.0%) 636 (+0.0%)
There is another downside however, and that's that our benchmarked disk
usage is slightly worse during random writes.
I haven't fully investigated this, but I think it's due to more
temporary fragments/blocks in the B-tree before flushing. This can cause
B-tree inner nodes to split earlier than when eagerly recrystallizing.
This also leads to higher disk usage pre-flush since we keep both the
old and new blocks around while uncrystallized, but since most rewrites
are probably going to be CoW on top of committed files, I don't think
this will be a big deal.
Note the disk usage ends up the same after lfsr_file_flush.
This reverts most of the lazy-grafting/crystallization logic, but keeps
the general crystallization algorithm rewrite and file->leaf for caching
read operations and erased-state.
Unfortunately lazy-grafting/crystallization is both a code and stack
heavy feature for a relatively specific write pattern. It doesn't even
help if we're forced to write fragments due to prog alignment.
Dropping lazy-grafting/crystallization trades off linear write/rewrite
performance for code and stack savings:
code stack ctx
before: 37084 2304 636
after: 36428 (-1.8%) 2248 (-2.4%) 636 (+0.0%)
But with file->leaf we still keep the improvements to linear read
performance!
Compared to pre-file->leaf:
code stack ctx
before file->leaf: 36016 2296 636
after lazy file->leaf: 37084 (+3.0%) 2304 (+0.3%) 636 (+0.0%)
after eager file->leaf: 36428 (+1.1%) 2248 (-2.1%) 636 (+0.0%)
I'm still on the fence about this, but lazy-grafting/crystallization is
just a lot of code... And the first 6 letters of littlefs don't spell
"speedy" last time I checked...
At the very least we can always add lazy-grafting/crystallization as an
opt-in write strategy later.
And:
- Tweaked the behavior of gbmap.window/known to _not_ match disk.
gbmap.known matching disk is what required a separate
lookahead.bmapped in the first place, but we never use both fields.
- _Don't_ revert gbmap on failed mdir commits!
This was broken! If we reverted we risked inheriting outdated
in-flight block information.
This could be fixed by also zeroing lookahead.bmapped, but would force
a gbmap rebuild. And why? The only interaction between mdir commit and
the gbmap is block allocation, which is intentionally allowed to go
out-of-sync to relax issues like this.
Note we still revert in lfs3_fs_grow, the new gbmap we create there is
incompatible with the previous disk size.
As a part of these changes, gbmap.window now behaves roughly the same as
gbmap.known and updates eagerly on block allocation.
This makes lookahead.window and gbmap.window somewhat redundant, but
simplifies the relevant logic (especially due to how lookahead.window
lags behind lookahead.off).
---
A bunch of bugs fell out-of-this, the interactions with lfs3_fs_mkgbmap
and lfs3_fs_grow being especially tricky, but fortunately our testing is
doing a good job.
At least the code changes were minimal, saves a bit of RAM:
code stack ctx
no-gbmap before: 37168 2352 684
no-gbmap after: 37168 (+0.0%) 2352 (+0.0%) 684 (+0.0%)
code stack ctx
maybe-gbmap before: 39688 2392 852
maybe-gbmap after: 39720 (+0.1%) 2376 (-0.7%) 848 (-0.5%)
code stack ctx
yes-gbmap before: 39156 2392 852
yes-gbmap after: 39208 (+0.1%) 2376 (-0.7%) 848 (-0.5%)
lfs3_fs_mkconsistent is already limited to call sites where
lfs3_alloc_ckpoint is valid (lfs3_fs_mkconsistent internally relies on
lfs3_mdir_commit), so might as well include an unconditional
lfs3_alloc_ckpoint to populate allocators and save some code:
code stack ctx
no-gbmap before: 37168 2352 684
no-gbmap after: 37164 (-0.0%) 2352 (+0.0%) 684 (+0.0%)
code stack ctx
maybe-gbmap before: 39720 2376 848
maybe-gbmap after: 39708 (-0.0%) 2376 (+0.0%) 848 (+0.0%)
code stack ctx
yes-gbmap before: 39208 2376 848
yes-gbmap after: 39204 (-0.0%) 2376 (+0.0%) 848 (+0.0%)
This was referenced Oct 22, 2025
This adds LFS3_T_REBUILDGBMAP and friends, and enables incremental gbmap
rebuilds as a part of gc/traversal work:
LFS3_M_REBUILDGBMAP 0x00000400 Rebuild the gbmap
LFS3_GC_REBUILDGBMAP 0x00000400 Rebuild the gbmap
LFS3_I_REBUILDGBMAP 0x00000400 The gbmap is not full
LFS3_T_REBUILDGBMAP 0x00000400 Rebuild the gbmap
On paper, this is more or less identical to repopulating the lookahead
buffer -- traverse the filesystem, mark blocks as in-use, adopt the new
gbmap/lookahead buffer on success -- but a couple nuances make
rebuilding the gbmap a bit trickier:
- Unlike the lookahead buffer, which eagerly zeros in allocation, we
need an explicit zeroing pass before we start marking blocks as
in-use. This means multiple traversals can potentially conflict with
each other, risking the adoption of a clobbered gbmap.
- The gbmap, which stores information on disk, relies on block
allocation and the temporary "in-flight window" defined by allocator
ckpoints to avoid circular block states during gbmap rebuilds. This
makes gbmap rebuilds sensitive to allocator ckpoints, which we
consider more-or-less a noop in other parts of the system.
Though now that I'm writing this, it might have been possible to
instead include gbmap rebuild snapshots in fs traversals... but that
would probably have been much more complicated.
- Rebuilding the gbmap requires writing to disk and is generally much
more expensive/destructive. We want to avoid trying to rebuild the
gbmap when it's not possible to actually make progress.
On top of this, the current trv-clobber system is a delicate,
error-prone mess.
---
To simplify everything related to gbmap rebuilds, I added a new
internal traversal flag: LFS3_t_CKPOINTED:
LFS3_t_CKPOINTED 0x04000000 Filesystem ckpointed during traversal
LFS3_t_CKPOINTED is set, unconditionally, on all open traversals in
lfs3_alloc_ckpoint, and provides a simple, robust mechanism for checking
if _any_ allocator checkpoints have occured since a traversal was
started. Since lfs3_alloc_ckpoint is required before any block
allocation, this provides a strong guarantee that nothing funny happened
to any allocator state during a traversal.
This makes lfs3_alloc_ckpoint a bit less cheap, but the strong
guarantees that allocator state is unmodified during traversal are well
worth it.
This makes both lookahead and gbmap passes simpler, safer, and easier to
reason about.
I'd like to adopt something similar+stronger for LFs3_t_MUTATED, and
reduce this back to two flags, but that can be a future commit.
---
Unfortunately due to the potential for recursion, this ended up reusing
less logic between lfs3_alloc_rebuildgbmap and lfs3_mtree_gc than I had
hoped, but at like the main chunks (lfs3_alloc_remap,
lfs3_gbmap_setbptr, lfs3_alloc_adoptgbmap) could be split out into
common functions.
The result is a decent chunk of code and stack, but the value is high as
incremental gbmap rebuilds are the only option to reduce the latency
spikes introduced by the gbmap allocator (it's not significantly worse
than the lookahead buffer, but both do require traversing the entire
filesystem):
code stack ctx
before: 37164 2352 684
after: 37208 (+0.1%) 2360 (+0.3%) 684 (+0.0%)
code stack ctx
gbmap before: 39708 2376 848
gbmap after: 40100 (+1.0%) 2432 (+2.4%) 848 (+0.0%)
Note the gbmap build is now measured with LFS3_GBMAP=1, instead of
LFS3_YES_GBMAP=1 (maybe-gbmap) as before. This includes the cost of
mkgbmap, lfs3_f_isgbmap, etc.
- lfs3_gbmap_set* -> lfs3_gbmap_mark* - lfs3_alloc_markfree -> lfs3_alloc_adopt - lfs3_alloc_mark* -> lfs3_alloc_markinuse* Mainly for consistency, since the gbmap and lookahead buffer are more or less the same algorithm, ignoring nuances (lookahead only ors inuse bits, gbmap rebuilding can result in multiple snapshots, etc). The rename lfs3_gbmap_set* -> lfs3_gbmap_mark* also makes space for lfs3_gbmap_set* to be used for range assignments with a payload, which may be useful for erased ranges (gbmap tracked ecksums?)
A bit less simplified than I hoped, we don't _strictly_ need both
LFS3_t_DIRTY + LFS3_t_MUTATED if we're ok with either (1) making
multiple passes to confirm fixorphans succeeded or (2) clear the COMPACT
flag after one pass (which may introduce new uncompacted metadata). But
both of these have downsides, and we're not _that_ stressed for flag
space yet...
So keeping all three of:
LFS3_t_DIRTY 0x04000000 Filesystem modified outside traversal
LFS3_t_MUTATED 0x02000000 Filesystem modified during traversal
LFS3_t_CKPOINTED 0x01000000 Filesystem ckpointed during traversal
But I did manage to get rid of the bit swapping by tweaking LFS3_t_DIRTY
to imply LFS3_t_MUTATED instead of being exclusive. This removes the
"failed" gotos in lfs3_mtree_gc and makes things a bit more readable.
---
I also split lfs3_fs/handle_clobber into separate lfs3_fs/handle_clobber
and lfs3_fs/handle_mutate functions. This added a bit of code, but I
think is worth it for a simpler internal API. A confusing internal API
is no good.
In total these simplifications saved a bit of code:
code stack ctx
before: 37208 2360 684
after: 37176 (-0.1%) 2360 (+0.0%) 684 (+0.0%)
code stack ctx
gbmap before: 40100 2432 848
gbmap after: 40060 (-0.1%) 2432 (+0.0%) 848 (+0.0%)
A big downside of LFS3_T_REBUILDGBMAP is the addition of an lfs3_btree_t
struct to _every_ traversal object.
Unfortunately, I don't see a way around this. We need to track the new
gbmap snapshot _somewhere_, and other options (such as a global gbmap.b_
snapshot) just move the RAM around without actually saving anything.
To at least mitigate this internally, this splits lfs3_trv_t into
distinct lfs3_trv_t, lfs3_mgc_t, and lfs3_mtrv_t structs that capture
only the relevant state for internal traversal layers:
- lfs3_mtree_traverse <- lfs3_mtrv_t
- lfs3_mtree_gc <- lfs3_mgc_t (contains lfs3_mtrv_t)
- lfs3_trv_read <- lfs3_trv_t (contains lfs3_mgc_t)
This minimizes the impact of the gbmap rebuild snapshots, and saves a
big chunk of RAM. As a plus it also saves RAM in the default build by
limiting the 2-block block queue to the high-level lfs3_trv_read API:
code stack ctx
before: 37176 2360 684
after: 37176 (+0.0%) 2352 (-0.3%) 684 (+0.0%)
code stack ctx
gbmap before: 40060 2432 848
gbmap after: 40024 (-0.1%) 2368 (-2.6%) 848 (+0.0%)
The main downside? Our field names are continuing in their
ridiculousness:
lfs3.gc.gc.t.b.h.flags // where else would the global gc flags be?
And tweaked a few related comments. I'm still on the fence with this name, I don't think it's great, but it at least betters describes the "repopulation" operation than "rebuilding". The important distinction is that we don't throw away information. Bad/erased block info (future) is still carried over into the new gbmap snapshot, and persists unless you explicitly call rmgbmap + mkgbmap. So, adopting gbmap_repop_thresh for now to see if it's just a habit thing, but may adopt a different name in the future. As a plus, gbmap_repop_thresh is two characters shorter.
This really didn't match the use of "flush" elsewhere in the system.
There's a strong argument for naming this inline_size as that's more likely what users expect, but shrub_size is just the more correct name and avoids confusion around having multiple names for the same thing. It also highlights that shrubs in littlefs3 are a bit different than inline files in littlefs2, and that this config also affects large files with a shrubbed root. May rerevert this in the future, but probably only if there is significant user confusion.
And friends: LFS3_M_REPOPLOOKAHEAD 0x00000200 Repopulate lookahead buffer LFS3_GC_REPOPLOOKAHEAD 0x00000200 Repopulate lookahead buffer LFS3_I_REPOPLOOKAHEAD 0x00000200 Lookahead buffer is not full LFS3_T_REPOPLOOKAHEAD 0x00000200 Repopulate lookahead buffer To match LFS3_T_REPOPGBMAP, which is more-or-less the same operation. Though this does turn into quite the mouthful...
- LFS3_T_COMPACT -> LFS3_T_COMPACTMETA - gc_compact_thresh -> gc_compactmeta_thresh And friends: LFS3_M_COMPACTMETA 0x00000800 Compact metadata logs LFS3_GC_COMPACTMETA 0x00000800 Compact metadata logs LFS3_I_COMPACTMETA 0x00000800 Filesystem may have uncompacted metadata LFS3_T_COMPACTMETA 0x00000800 Compact metadata logs --- This does two things: 1. Highlights that LFS3_T_COMPACTMETA only interacts with metadata logs, and has no effect on data blocks. 2. Better matches the verb+noun names used for other gc/traversal flags (REPOPGBMAP, CKMETA, etc). It is a bit more of a mouthful, but I'm not sure that's entirely a bad thing. These are pretty low-level flags.
This is an alias for all possible gc work, which is a bit more complicated than you might think due to compile-time features (example: LFS3_GC_REPOPGBMAP). The intention is to make loops like the following easy to write: struct lfs3_fsinfo fsinfo; lfs3_fs_stat(&lfs3, &fsinfo) => 0; lfs3_trv_t trv; lfs3_trv_open(&lfs3, &trv, fsinfo.flags & LFS3_GC_ALL) => 0; ... It's possible to do this by explicitly setting all gc flags, but that requires quite a bit of knowledge from the user. Another option is allowing -1 for gc/traversal flags, but that loses assert protection against unknown/misplaced flags. --- This raises more questions about the prefix naming: it feels a bit weird to take LFS3_I_* flags, mask with LFS3_GC_* flags, and pass them as LFS3_T_* flags, but it gets the job done. Limiting LFS3_GC_ALL to the LFS3_GC_* namespace avoids issues with opt-out/mode flags such as LFS3_T_RDONLY, LFS3_T_MTREEONLY, etc. For this reason it probably doesn't make sense to add something similar to the other namespaces.
To allow relaxing when LFS3_I_REPOPLOOKAHEAD and LFS3_I_REPOPGBMAP will
be set, potentially reducing gc workload after allocating only a couple
blocks.
The relevant cfg comments have quite a bit more info.
Note -1 (not the default, 0, maybe we should explicitly flip this?)
restores the previous functionality of setting these flags on the first
block allocation.
---
Also tweaked gbmap repops during gc/traversals to _not_ try to repop
unless LFS3_I_REPOPGBMAP is set. We probably should have done this from
the beginning since repopulating the gbmap writes to disk and is
potentially destructive.
Adds code, though hopefully we can claw this back with future config
rework:
code stack ctx
before: 37176 2352 684
after: 37208 (+0.1%) 2352 (+0.0%) 688 (+0.6%)
code stack ctx
gbmap before: 40024 2368 848
gbmap after: 40120 (+0.2%) 2368 (+0.0%) 856 (+0.9%)
Unfortunately this doesn't work and will need to be ripped-out/reverted. --- The goal was to limit in-use -> free zeroing to the uknown window, which would allow the gbmap to be updated in-place, saving the extra RAM we need to maintain the extra gbmap snapshot during traversals and lfs3_alloc_zerogbmap. Unfortunately this doesn't seem to work. If we limit zeroing to the unknown window, blocks can get stuck in the in-use state as long as they stay in the known window. Since the gbmap's known window encompasses most of the disk, this can cause the allocators to lock up and be unable to make progress. So will revert, but committing the current implementation in case we revisit the idea. As a plus, reverting avoids needing to maintain this unknown window logic, which is tricky and error-prone.
See previous commit for motivation
These are more-or-less equivalent, but:
- Making lfs3_alloc_zerogbmap a non-gbmap function avoids awkward
conversations about why it's not atomic.
- Making lfs3_alloc_zerogbmap alloc-specific makes room for pererased-
specific zeroing operations that we might need when adopt bmerased
ranges (future).
No code changes, which means const-propagation works as expected:
code stack ctx
before: 37208 2352 688
after: 37208 (+0.0%) 2352 (+0.0%) 688 (+0.0%)
code stack ctx
gbmap before: 40120 2368 856
gbmap after: 40120 (+0.0%) 2368 (+0.0%) 856 (+0.0%)
This relaxes error encountered during lfs3_mtree_gc to _not_ propagate,
but instead just log a warning and prevent the relevant work from being
checked off during EOT.
The idea is this allows other work to make progress in low-space
conditions.
I originally meant to limit this to gbmap repopulations, to match the
behavior of lfs3_alloc_repopgbmap, but I think extending the idea to all
filesystem mutating operations makes sense (LFS3_T_MKCONSISTENT +
LFS3_T_REPOPGBMAP + LFS3_T_COMPACTMETA).
---
To avoid incorrectly marking traversal work as completed, we need to
track if we hit any ENOSPC errors, thus the new LFS3_t_NOSPC flag:
LFS3_t_NOSPC 0x00800000 Optional gc work ran out of space
Not the happiest just throwing flags at problems, but I can't think of a
better solution at the moment.
This doesn't differentiate between ENOSPC errors during the different
types of work, but in theory if we're hitting ENOSPC errors whatever
work returns the error is a toss-up anyways.
---
Adds a bit of code:
code stack ctx
before: 37208 2352 688
after: 37248 (+0.1%) 2352 (+0.0%) 688 (+0.0%)
code stack ctx
gbmap before: 40120 2368 856
gbmap after: 40204 (+0.2%) 2368 (+0.0%) 856 (+0.0%)
This adds test_gc_nospc with more aggressive testing of gc/traversal
operations in low-space conditions. The original intention was to test
the new soft-ENOSPC traversal behavior, but instead it found a couple
unrelated bugs.
In my defense these involve some rather subtle filesystem interactions
and went unnoticed because we don't usually check data checksums:
1. lfs3_bd_flush had a rare chance where it could corrupt our
prog-aligned pcksum when (1) we bypass the pcache, allowing any
previous contents to stay there until flush/pcksum, and (2) some
other failed prog, in this case failing repopgbmaps due to the
low-space condition, leaves garbage in the pcache. When we flush
we corrupt the pcksum even though the old data belongs to an
unrelated block.
This resulted in CKDATA failing, though the failed check is a false
positive.
As a workaround, lfs3_bd_prog and lfs3_bd_prognext now discard _any_
unrelated pcache, even if bypassing the pcache. This should ensure
consistent behavior in all cases. Note we do something similar for
with the file cache in lfs3_file_write.
This means progs may not complete unless lfs3_bd_flush is called, but
I think we need to call lfs3_bd_flush in all cases anyways to ensure
power-loss safe behavior.
The end result should be a more reliable internal bd prog API.
2. On a successful traversal with LFS3_T_REPOPLOOKAHEAD and
LFS3_T_REPOPGBMAP we adopt both the new gbmap and lookahead buffer.
This is wrong! The lookahead buffer is not aware of the gbmap during
the traversal, and _can't_ be aware as the gbmap changes during
repopulation work. This is the whole reason we have the alloc
ckpoints and the in-flight window.
To fix, adopting the lookahead buffer is now conditional on _not_
adopting a new gbmap.
It makes the code a bit more messy, but this is the correct behavior.
Populating both the gbmap and lookahead buffere requires at least two
passes.
Code changes minimal:
code stack ctx
before: 37248 2352 688
after: 37260 (+0.0%) 2352 (+0.0%) 688 (+0.0%)
code stack ctx
gbmap before: 40204 2368 856
gbmap after: 40220 (+0.0%) 2368 (+0.0%) 856 (+0.0%)
Note: This affects the blocking lfs3_alloc_repopgbmap as well as
incremental gc/traversal repopulations. Now all repop attempts return
LFS3_ERR_NOSPC when we don't have space for the gbmap, motivation below.
This reverts the previous LFS3_t_NOSPC soft error, in which traversals
were allowed to continue some gc/traversal work when encountering
LFS3_ERR_NOSPC. This results in a simpler implementation and fewer error
cases to worry about.
Observation/motivation:
- The main motivation is noticing that when we're in low-space
conditions, we just start spamming gbmap repops even if they all fail.
That's really not great! We might as well just mark the flash as dead
if we're going to start spamming erases!
At least with an error the user can call rmgbmap to try to make
progress.
- If we're in a low-space condition, something else will probably return
LFS3_ERR_NOSPC anyways. Might as well report this early and simplify
our system.
- It's a simpler model, and littlefs3 is already much more complicated
than littlefs2. Maybe we should lean more towards a simpler system
at the cost of some niche optimizations.
---
This had the side-effect of causing more lfs3_alloc_ckpoints to return
errors during testing, which revealed a bug in our uz/uzd_fuzz tests:
- We weren't flushing after writes to the opened RDWR files, which could
cause delayed errors to occur during the later read checks in the
test.
Fortunately LFS3_O_FLUSH provides a quick and easy fix!
Note we _don't_ adopt this in all uz/uzd_fuzz tests, only those that
error. It's good to test both with and without LFS3_O_FLUSH to test
that read-flushing also works under stress.
Saves a bit of code:
code stack ctx
before: 37260 2352 688
after: 37220 (-0.1%) 2352 (+0.0%) 688 (+0.0%)
code stack ctx
gbmap before: 40220 2368 856
gbmap after: 40184 (-0.1%) 2368 (+0.0%) 856 (+0.0%)
This drops LFS3_t_MUTATED in favor of just using LFS3_t_CKPOINTED
everywhere:
1. These meant roughly the same thing, with LFS3_t_MUTATED being a bit
tighter at the cost of needing to be explicitly set.
2. The implicit setting of LFS3_t_CKPOINTED by lfs3_alloc_ckpoint -- a
function that already needs to be called before mutation -- means we
have one less thing to worry about.
Implicit properties like LFS3_t_CKPOINTED are great for building a
reliable system. Manual flags like LFS3_t_MUTATED, not so much.
3. Why use two flags when we can get away with one?
The only downside is we may unnecessarily clobber gc/traversal work when
we don't actually mutate the filesystem. Failed file open calls are a
good example.
However this tradeoff seems well worth it for an overall simpler +
more reliable system.
---
Saves a bit of code:
code stack ctx
before: 37220 2352 688
after: 37160 (-0.2%) 2352 (+0.0%) 688 (+0.0%)
code stack ctx
gbmap before: 40184 2368 856
gbmap after: 40132 (-0.1%) 2368 (+0.0%) 856 (+0.0%)
This has just proven much easier to tweak in dbgtag.py, so adopting the same self-parsing pattern in dbgflags.py/dbgerr.py. This makes editing easier by (1) not needing to worry about parens/quotes/commas, and (2) allowing for non-python expressions, such as the mode flags in dbgflags.py. The only concern is script startup may be slightly slower, but we really don't care.
This required a bit of a hack: LFS3_seek_MODE, which is marked internal to try to minimize confusion, but really doesn't exist in the code at all. But a hack is probably good enough for now.
This should make tag editing less tedious/error-prone. We already used
self-parsing to generate -l/--list in dbgtag.py, but this extends the
idea to tagrepr (now Tag.repr), which is used in quite a few more
scripts.
To make this work the little tag encoding spec had to become a bit more
rigorous, fortunately the only real change was the addition of '+'
characters to mark reserved-but-expected-zero bits.
Example:
TAG_CKSUM = 0x3000 ## v-11 ---- ++++ +pqq
^--^----^----^--^-^-- valid bit, unmatched
'----|----|--|-|-- matches 1
'----|--|-|-- matches 0
'--|-|-- reserved 0, unmatched
'-|-- perturb bit, unmatched
'-- phase bits, unmatched
dbgtag.py 0x3000 => cksumq0
dbgtag.py 0x3007 => cksumq3p
dbgtag.py 0x3017 => cksumq3p 0x10
dbgtag.py 0x3417 => 0x3417
Though Tag.repr still does a bit of manual formatting for the
differences between shrub/normal/null/alt tags.
Still, this should reduce the number of things that need to be changed
from 2 -> 1 when adding/editing most new tags.
So: - cfg.gc_repoplookahead_thresh -> cfg.gc_relookahead_thresh - cfg.gc_repopgbmap_thresh -> cfg.gc_regbmap_thresh - cfg.gbmap_repop_thresh -> cfg.gbmap_re_thresh - LFS3_*_REPOPLOOKAHEAD -> LFS3_*_RELOOKAHEAD - LFS3_*_REPOPGBMAP -> LFS3_*_REGBMAP Mainly trying to reduce the mouthful that is REPOPLOOKAHEAD and REPOPGBMAP. As a plus this also avoids potential confusion of "repop" as a push/pop related operation.
Just to avoid the awkward escaped newlines when possible. Note this has no effect on the output of dbgflags.py.
This walks back some of the attempt at strict object namespacing in struct lfs3_cfg: - cfg.file_cache_size -> cfg.fcache_size - filecfg.cache_size -> filecfg.fcache_size - filecfg.cache_buffer -> filecfg.fcache_buffer - cfg.gbmap_re_thresh -> cfg.regbmap_thresh Motivation: - cfg.regbmap_thresh now matches cfg.gc_regbmap_thresh, instead of using awkwardly different namespacing patterns. - Giving fcache a more unique name is useful for discussion. Having pcache, rcache, and then file_cache was a bit awkward. Hopefully it's also more clear that cfg.fcache_size and filecfg.fcache_size are related. - Config in struct lfs3_cfg is named a bit more consistently, well, if you ignore gc_*_* options. - Less typing. Though this gets into pretty subjective naming territory. May revert this if the new terms are uncomfortable after use.
This just fell out-of-sync a bit during the gbmap work. Note we _do_
support LFS3_RDONLY + LFS3_GBMAP, as fetching the gbmap is necessary for
CKMETA to check all metadata. Fortunately this is relatively cheap:
code stack ctx
rdonly: 10716 896 532
rdonly+gbmap: 10988 (+2.5%) 896 (+0.0%) 680 (+27.8%)
Though this does highlight that a sort of LFS3_NO_TRV mode could remove
quite a bit of code.
I think these are good ideas to bring back when littlefs3 is more mature, but at the moment the number of different builds is creating too much friction. LFS3_KVONLY and LFS3_2BONLY in particular _add_ significant chunks of code (lfs3_file_readget_, lfs3_file_flushset_, and various extra logic sprinkled throughout the codebase), and the current state of testing means I have no idea if any of it still works. These are also low-risk for introducing any disk related changes. So, ripping out for now to keep the current experimental development tractable. May reintroduce in the future (probably after littlefs3 is stabilized) if there is sufficient user interest. But doing so will probably also need to come with actual testing in CI.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
needs major version
breaking functionality only allowed in major versions
next major
on-disk major WEEWOOWEEWOO
v3
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Note: v3-alpha discussion (#1114)
Unfortunately GitHub made a complete mess of the PR discussion. To try to salvage things, please use #1114 for new comments. Feedback/criticism are welcome and immensely important at this stage.
Table of contents ^
Hello! ^
Hello everyone! As some of you may have already picked up on, there's been a large body of work fermenting in the background for the past couple of years. Originally started as an experiment to try to solve littlefs's$O(n^2)$ metadata compaction, this branch eventually snowballed into more-or-less a full rewrite of the filesystem from the ground up.
There's still several chunks of planned work left, but now that this branch has reached on-disk feature parity with v2, there's nothing really stopping it from being merged eventually.
So I figured it's a good time to start calling this v3, and put together a public roadmap.
NOTE: THIS WORK IS INCOMPLETE AND UNSTABLE
Here's a quick TODO list of planned work before stabilization. More details below:
This work may continue to break the on-disk format.
That being said, I highly encourage others to experiment with v3 where possible. Feedback is welcome, and immensely important at this stage. Once it's stabilized, it's stabilized.
To help with this, the current branch uses a v0.0 as its on-disk version to indicate that it is experimental. When it is eventually released, v3 will reject this version and fail to mount.
Unfortunately, the API will be under heavy flux during this period.
A note on benchmarking: The on-disk block-map is key for scalable allocator performance, so benchmarks at this stage needs to be taken with a grain of salt when many blocks are involved. Please refer to this version as "v3 (no bmap)" or something similar in any published benchmarks until this work is completed.
Wait, a disk breaking change? ^
Yes. v3 breaks disk compatibility from v2.
I think this is a necessary evil. Attempting to maintain backwards compatibility has a heavy cost:
Development time - The littlefs team is ~1 guy, and v3 has already taken ~2.5 years. The extra work to make everything compatible would stretch this out much longer and likely be unsustainable.
Code cost - The goal of littlefs is to be, well, little. This is unfortunately in conflict with backwards compatibility.
Take the new B-tree data-structure, for example. It would be easy to support both B-tree and CTZ skip-list files, but now you need ~2x the code. This cost gets worse for the more enmeshed features, and potentially exceeds the cost of just including both v3 and v2 in the codebase.
So I think it's best for both littlefs as a project and long-term users to break things here.
Note v2 isn't going anywhere! I'm happy to continue maintaining the v2 branch, merge bug fixes when necessary, etc. But the economic reality is my focus will be shifting to v3.
What's new ^
Ok, with that out of the way, what does breaking everything actually get us?
Implemented: ^
Efficient metadata compaction:$O(n^2) \rightarrow O(n \log n)$ ^
v3 adopts a new metadata data-structure: Red-black-yellow Dhara trees (rbyds). Based on the data-structure invented by Daniel Beer for the Dhara FTL, rbyds extend log-encoded Dhara trees with self-balancing and self-counting (also called order-statistic) properties.
This speeds up most metadata operations, including metadata lookup ($O(n) \rightarrow O(\log n)$ ), and, critically, metadata compaction ( $O(n^2) \rightarrow O(n \log n)$ ).
This improvement may sound minor on paper, but it's a difference measured in seconds, sometimes even minutes, on devices with extremely large blocks.
Efficient random writes:$O(n) \rightarrow O(\log_b^2 n)$ ^
A much requested feature, v3 adopts B-trees, replacing the CTZ skip-list that previously backed files.
This avoids needing to rewrite the entire file on random reads, bringing the performance back down into tractability.
For extra cool points, littlefs's B-trees use rbyds for the inner nodes, which makes CoW updates much cheaper than traditional array-packed B-tree nodes when large blocks are involved ($O(n) \rightarrow O(\log n)$ ).
Better logging: No more sync-padding issues ^
v3's B-trees support inlining data directly in the B-tree nodes. This gives us a place to store data during sync, without needing to pad things for prog alignment.
In v2 this padding would force the rewriting of blocks after sync, which had a tendency to wreck logging performance.
Efficient inline files, no more RAM constraints:$O(n^2) \rightarrow O(n \log n)$ ^
In v3, B-trees can have their root inlined in the file's mdir, giving us what I've been calling a "B-shrub". This, combined with the above inlined leaves, gives us a much more efficient inlined file representation, with better code reuse to boot.
Oh, and B-shrubs also make small B-trees more efficient by avoiding the extra block needed for the root.
Independent file caches ^
littlefs's
pcache,rcache, and file caches can be configured independently now. This should allow for better RAM utilization when tuning the filesystem.Easier logging APIs:
lfs3_file_fruncate^Thanks to the new self-counting/order-statistic properties, littlefs can now truncate from both the end and front of files via the new
lfs3_file_fruncateAPI.Before, the best option for logging was renaming log files when they filled up. Now, maintaining a log/FIFO is as easy as:
Sparse files ^
Another advantage of adopting B-trees, littlefs can now cheaply represent file holes, where contiguous runs of zeros can be implied without actually taking up any disk space.
Currently this is limited to a couple operations:
lfs3_file_truncatelfs3_file_fruncatelfs3_file_seek+lfs3_file_writepast the end of the fileBut more advanced hole operations may be added in the future.
Efficient file name lookup:$O(n) \rightarrow O(\log_b n)$ ^
littlefs now uses a B-tree (yay code reuse) to organize files by file name. This allows for much faster file name lookup than the previous linked-list of metadata blocks.
A simpler/more robust metadata tree ^
As a part of adopting B-trees for metadata, the previous threaded file tree has been completely ripped out and replaced with one big metadata tree: the M-tree.
I'm not sure how much users are aware of it, but the previous threaded file tree was a real pain-in-the-ass with the amount of bugs it caused. Turns out having a fully-connected graph in a CoBW filesystem is a really bad idea.
In addition to removing an entire category of possible bugs, adopting the M-tree allows for multiple directories in a single metadata block, removing the 1-dir = 1-block minimum requirement.
A well-defined sync model ^
One interesting thing about littlefs, it doesn't have a strictly POSIX API. This puts us in a relatively unique position, where we can explore tweaks to the POSIX API that may make it easer to write powerloss-safe applications.
To leverage this (and because the previous sync model had some real problems), v3 includes a new, well-defined sync model.
I think this discussion captures most of the idea, but for a high-level overview:
Open file handles are strictly snapshots of the on-disk state. Writes to a file are copy-on-write (CoW), with no immediate affect to the on-disk state or any other file handles.
Syncing or closing an in-sync file atomically updates the on-disk state and any other in-sync file handles.
Files can be desynced, either explicitly via
lfs3_file_desync, or because of an error. Desynced files do not recieve sync broadcasts, and closing a desynced file has no affect on the on-disk state.Calling
lfs3_file_syncon a desynced file will atomically update the on-disk state, any other in-sync file handles, and mark the file as in-sync again.Calling
lfs3_file_resyncon a file will discard its current contents and mark the file as in-sync. This is equivalent toclosing and reopening the file.
Stickynotes, no more 0-sized files ^
As an extension of the littlefs's new sync model, v3 introduces a new file type:
LFS3_TYPE_STICKYNOTE.A stickynote represents a file that's in the awkward state of having been created, but not yet synced. If you lose power, stickynotes are hidden from the user and automatically cleaned up on the next mount.
This avoids the 0-sized file issue, while still allowing most of the POSIX interactions users expect.
A new and improved compat flag system ^
v2.1 was a bit of a mess, but it was a learning experience. v3 still includes a global version field, but also includes a set of compat flags that allow non-linear addition/removal of future features.
These are probably familiar to users of Linux filesystems, though I've given them slightly different names:
rcompat flags- Must understand to read the filesystem (incompat_flags)wcompat flags- Must understand to write to the filesystem (ro_compat_flags)ocompat flags- No understanding necessary (compat_flags)This also provides an easy route for marking a filesystem as read-only, non-standard, etc, on-disk.
Error detection! - Global-checksums ^
v3 now supports filesystem-wide error-detection. This is actually quite tricky in a CoBW filesystem, and required the invention of global-checksums (gcksums) to prevent rollback issues caused by naive checksumming.
With gcksums, and a traditional Merkle-tree-esque B-tree construction, v3 now provides a filesystem-wide self-validating checksum via
lfs3_fs_cksum. This checksum can be stored external to the filesystem to provide protection against last-commit rollback issues, metastability, or just for that extra peace of mind.Funny thing about checksums. It's incredibly cheap to calculate checksums when writing, as we're already processing that data anyways. The hard part is, when do you check the checksums?
This is a problem that mostly ends up on the user, but to help, v3 adds a large number checksum checking APIs (probably too many if I'm honest):
LFS3_M_CKMETA/CKDATA- Check checksums during mountLFS3_O_CKMETA/CKDATA- Check checksums during file openlfs3_fs_ckmeta/ckdata- Explicitly check all checksums in the filesystemlfs3_file_ckmeta/ckdata- Explicitly check a file's checksumsLFS3_T_CKMETA/CKDATA- Check checksums incrementally during a traversalLFS3_GC_CKMETA/CKDATA- Check checksums during GC operationsLFS3_M_CKPROGS- Closed checking of data during progsLFS3_M_CKFETCHES- Optimistic (not closed) checking of data during fetchesLFS3_M_CKREADS(planned) - Closed checking of data during readsBetter traversal APIs ^
The traversal API has been completely reworked to be easier to use (both externally and internally).
No more callback needed, blocks can now be iterated over via the dir-like
lfs3_trv_readfunction.Traversals can also perform janitorial work and check checksums now, based on the flags provided to
lfs3_trv_open.Incremental GC ^
GC work can now be accomplished incrementally, instead of requiring one big go. This is managed by
lfs3_fs_gc,cfg.gc_flags, andcfg.gc_steps.Internally, this just shoves one of the new traversal objects into
lfs3_t. It's equivalent to managing a traversal object yourself, but hopefully makes it easier to write library code.However, this does add a significant chunk of RAM to
lfs3_t, so GC is now an opt-in feature behind theLFS3_GCifdef.Better recovery from runtime errors ^
Since we're already doing a full rewrite, I figured let's actually take the time to make sure things don't break on exceptional errors.
Most in-RAM filesystem state should now revert to the last known-good state on error.
The one exception involves file data (not metadata!). Reverting file data correctly turned out to roughly double the cost of files. And now that you can manual revert with
lfs3_file_resync, I figured this cost just isn't worth it. So file data remains undefined after an error.In total, these changes add a significant amount of code and stack, but I'm of the opinion this is necessary for the maturing of littlefs as a filesystem.
Standard custom attributes ^
Breaking disk gives us a chance to reserve attributes
0x80-0xbffor future standard custom attributes:0x00-0x7f- Free for user-attributes (uattr)0x80-0xbf- Reserved for standard-attributes (sattr)0xc0-0xff- Encouraged for system-attributes (yattr)In theory, it was technically possible to reserve these attributes without a disk-breaking change, but it's much safer to do so while we're already breaking the disk.
v3 also includes the possibility of extending the custom attribute space from 8-bits to ~25-bits in the future, but I'd hesitate to to use this, as it risks a significant increase in stack usage.
More tests! ^
v3 comes with a couple more tests than v2 (+~6812.2%):
You may or may not have seen the test framework rework that went curiously under-utilized. That was actually in preparation for the v3 work.
The goal is not 100% line/branch coverage, but just to have more confidence in littlefs's reliability.
Simple key-value APIs ^
v3 includes a couple easy-to-use key-value APIs:
lfs3_get- Get the contents of a filelfs3_size- Get the size of a filelfs3_set- Set the contents of a filelfs3_remove- Remove a file (this one already exists)This API is limited to files that fit in RAM, but if it fits your use case, you can disable the full file API with
LFS3_KVONLYto save some code.If your filesystem fits in only 2 blocks, you can also define
LFS3_2BONLYto save more code.These can be useful for creating small key-value stores on systems that already use littlefs for other storage.
Planned: ^
Efficient block allocation, via optional on-disk block-map (bmap) ^
The one remaining bottleneck in v3 is block allocation. This is a tricky problem for littlefs (and any CoW/CoBW filesystem), because we don't actually know when a block becomes free.
This is in-progress work, but the solution I'm currently looking involves 1. adding an optional on-disk block map (bmap) stored in gstate, and 2. updating it via tree diffing on sync. In theory this will drop huge file writes:$O(n^2 \log n) \rightarrow O(n \log_b^2 n)$
There is also the option of using the bmap as a simple cache, which doesn't avoid the filesystem-wide scan but at least eliminates the RAM constraint of the lookahead buffer.
As a plus, we should be able to leverage the self-counting property of B-trees to make the on-disk bmap compressible.
Bad block tracking ^
This is a much requested feature, and adding the optional on-disk bmap finally gives us a place to track bad blocks.
Pre-erased block tracking ^
Just like bad-blocks, the optional on-disk bmap gives us a place to track pre-erased blocks. Well, at least in theory.
In practice it's a bit more of a nightmare. To avoid multiple progs, we need to mark erased blocks as unerased before progging. This introduces an unbounded number of catch-22s when trying to update the bmap itself.
Fortunately, if instead we store a simple counter in the bmap's gstate, we can resolve things at the mrootanchor worst case.
Error correction! - Metadata redundancy ^
Note it's already possible to do error-correction at the block-device level outside of littlefs, see ramcrc32cbd and ramrsbd for examples. Because of this, integrating in-block error correction is low priority.
But I think there's potential for cross-block error-correction in addition to the in-block error-correction.
The plan for cross-block error-correction/block redundancy is a bit different for metadata vs data. In littlefs, all metadata is logs, which is a bit of a problem for parity schemes. I think the best we can do is store metadata redundancy as naive copies.
But we already need two blocks for every mdir, one usually just sits unused when not compacting. This, combined with metadata usually being much smaller than data, makes the naive scheme less costly than one might expect.
Error correction! - Data redundancy ^
For raw data blocks, we can be a bit more clever. If we add an optional dedup tree for block -> parity group mapping, and an optional parity tree for parity blocks, we can implement a RAID-esque parity scheme for up to 3 blocks of data redundancy relatively cheaply.
Transparent block deduplication ^
This one is a bit funny. Originally block deduplication was intentionally out-of-scope, but it turns out you need something that looks a lot like a dedup tree for error-correction to work in a system that allows multiple block references.
If we already need a virtual -> physical block mapping for error correction, why not make the key the block checksum and get block deduplication for free?
Though if this turns out to not be as free as I think it is, block deduplication will fall out-of-scope.
Stretch goals: ^
These may or may not be included in v3, depending on time and funding:
lfs3_migratefor v2->v3 migration ^16-bit and 64-bit variants ^
Config API rework ^
Block device API rework ^
Custom attr API rework ^
Alternative (cheaper) write-strategies (write-once, global-aligned, eager-crystallization) ^
Advanced file tree operations (
lfs3_file_punchhole,lfs3_file_insertrange,lfs3_file_collapserange,LFS3_SEEK_DATA,LFS3_SEEK_HOLE) ^Advanced file copy-on-write operations (shallow
lfs3_cowcopy+ opportunisticlfs3_copy) ^Reserved blocks to prevent CoW lockups ^
Metadata checks to prevent metadata lockups ^
Integrated block-level ECC (ramcrc32cbd, ramrsbd) ^
Disk-level RAID (this is just data redund + a disk aware block allocator) ^
Out-of-scope (for now): ^
If we don't stop somewhere, v3 will never be released. But these may be added in the future:
Alternative checksums (crc16, crc64, sha256, etc) ^
Feature-limited configurations for smaller code/stack sizes (
LFS3_NO_DIRS,LFS3_KV,LFS3_2BLOCK, etc) ^lfs3_file_openatfor dir-relative APIs ^lfs3_file_opennfor non-null-terminated-string APIs ^Transparent compression ^
Filesystem shrinking ^
High-level caches (block cache, mdir cache, btree leaf cache, etc) ^
Symbolic links ^
100% line/branch coverage ^
Code/stack size ^
littlefs v1, v2, and v3, 1 pixel ~= 1 byte of code, click for a larger interactive codemap (commit)
littlefs v2 and v3 rdonly, 1 pixel ~= 1 byte of code, click for a larger interactive codemap (commit)
Unfortunately, v3 is a little less little than v2:
On one hand, yes, more features generally means more code.
And it's true there's an opportunity here to carve out more feature-limited builds to save code/stack in the future.
But I think it's worth discussing some of the other reasons for the code/stack increase:
Runtime error recovery ^
Recovering from runtime errors isn't cheap. We need to track both the before and after state of things during fallible operations, and this adds both stack and code.
But I think this is necessary for the maturing of littlefs as a filesystem.
Maybe it will make sense to add a sort of
LFS3_GLASSmode in the future, but this is out-of-scope for now.B-tree flexibility ^
The bad news: The new B-tree files are extremely flexible. Unfortunately, this is a double-edged sword.
B-trees, on their own, don't add that much code. They are a relatively poetic data-structure. But deciding how to write to a B-tree, efficiently, with an unknown write pattern, is surprisingly tricky.
The current implementation, what I've taken to calling the "lazy-crystallization algorithm", leans on the more complicated side to see what is possible performance-wise.
The good news: The new B-tree files are extremely flexible.
There's no reason you need the full crystallization algorithm if you have a simple write pattern, or don't care as much about performance. This will either be a future or stretch goal, but it would be interesting to explore alternative write-strategies that could save code in these cases.
Traversal inversion ^
Inverting the traversal, i.e. moving from a callback to incremental state machine, adds both code and stack as 1. all of the previous on-stack state needs to be tracked explicitly, and 2. we now need to worry about what happens if the filesystem is modified mid-traversal.
In theory, this could be reverted if you don't need incremental traversals, but extricating incremental traversals from the current codebase would be an absolute nightmare, so this is out-of-scope for now.
Benchmarks ^
A note on benchmarking: The on-disk block-map is key for scalable allocator performance, so benchmarks at this stage needs to be taken with a grain of salt when many blocks are involved. Please refer to this version as "v3 (no bmap)" or something similar in any published benchmarks until this work is completed.
First off, I would highly encourage others to do their own benchmarking with v3/v2. Filesystem performance is tricky to measure because it depends heavily on your application's write pattern and hardware nuances. If you do, please share in this thread! Others may find the results useful, and now is the critical time for finding potential disk-related performance issues.
Simulated benchmarks ^
To test the math behind v3, I've put together some preliminary simulated benchmarks.
Note these are simulated and optimistic. They do not take caching or hardware buffers into account, which can have a big impact on performance. Still, I think they provide at least a good first impression of v3 vs v2.
To find an estimate of runtime, I first measured the amount of bytes read, progged, and erased, and then scaled based on values found in relevant datasheets. The options here were a bit limited, but WinBond fortunately provides runtime estimates in the datasheets on their website:
NOR flash - w25q64jv
NAND flash - w25n01gv
SD/eMMC - Also w25n01gv, assuming a perfect FTL
I said optimistic, didn't I? I could't find useful estimates for SD/eMMC, so I'm just assuming a perfect FTL here.
These also assume an optimal bus configuration, which, as any embedded engineer knows, is often not the case.
Full benchmarks here: https://benchmarks.littlefs.org (repo, commit)
And here are the ones I think are the most interesting:
Note that SD/eMMC is heavily penalized by the lack of on-disk block-map! SD/eMMC breaks flash down into many small blocks, which tends to make block allocator performance dominate.
Linear writes, where we write a 1 MiB file and don't call sync until closing the file. ^
This one is the most frustrating to compare against v2. CTZ skip-lists are really fast at appending! The problem is they are only fast at appending:
Random writes, note we start with a 1MiB file. ^
As expected, v2 is comically bad at random writes. v3 is indistinguishable from zero in the NOR case:
Logging, write 4 MiB, but limit the file to 1 MiB. ^
In v2 this is accomplished by renaming the file, in v3 we can leverage
lfs3_file_fruncate.v3 performs significantly better with large blocks thanks to avoiding the sync-padding problem:
Funding ^
If you think this work is worthwhile, consider sponsoring littlefs. Current benefits include:
I joke, but I truly appreciate those who have contributed to littlefs so far. littlefs, in its current form, is a mostly self-funded project, so every little bit helps.
If you would like to contribute in a different way, or have other requests, feel free to reach me at geky at geky.net.
As stabilization gets closer, I will also be open to contract work to help port/integrate/adopt v3. If this is interesting to anyone, let me know.
Thank you @micropython, @fusedFET for sponsoring littlefs, and thank you @Eclo, @kmetabg, and @nedap for your past sponsorships!
Next steps ^
For me, I think it's time to finally put together a website/wiki/discussions/blog. I'm not sure on the frequency quite yet, but I plan to write/publish the new DESIGN.md in chapters in tandem with the remaining work.
EDIT: Pinned codemap/plot links to specific commits via benchmarks.littlefs.org/tree.html
EDIT: Updated with rdonly code/stack sizes
EDIT: Added link to #1114
EDIT: Implemented simple key-value APIs
EDIT: Added lfs3_migrate stretch goal with link to #1120
EDIT: Adopted lfs3_traversal_t -> lfs3_trv_t rename
EDIT: Added link to #1125 to clarify "feature parity"