- 
                Notifications
    You must be signed in to change notification settings 
- Fork 931
Fix several wear-leveling issues found in lfs_alloc_reset #487
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
          
     Merged
      
      
    Conversation
  
    
      This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
      Learn more about bidirectional Unicode characters
    
  
  
    
    Modulus of the offset by block_size was clearly a typo, and should be block_count. Interesting to note that later moduluses during alloc calculations prevents this from breaking anything, but as gtaska notes it could skew the wear-leveling distribution. Found by guiserle and gtaska
On first read, randomizing the allocators offset may seem appropriate for lfs_alloc_reset. However, it ends up using the filesystem-fed pseudorandom seed in situations it wasn't designed for. As noted by gtaska, the combination of using xors for feeding the seed and multiple traverses of the same CRCs can cause the seed to flip to zeros with concerning frequency. Removed the randomization from lfs_alloc_reset, leaving it in only lfs_mount. Found by gtaska
As noted by gtaska, we are sitting on a better hash-combining function than xor: CRC. Previous issues with xor were solvable, but relying on xor for this isn't really worth the risk when we already have a CRC function readily available. To quote a study found by gtaska: https://michiel.buddingh.eu/distribution-of-hash-values > CRC32 seems to score really well, but its graph is skewed by the results > of Dataset 5 (binary numbers), which may or may not be too synthetic to > be considered a fair benchmark. But even if you substract the results > from that test, it does not fare significantly worse than other, > cryptographic hash functions.
8348d7d    to
    f215027      
    Compare
  
    Not sure how this went unnoticed, I guess this is the first bug that needed in-depth inspection after the a last-minute argument cleanup in the debug scripts.
This bug was exposed by the bad-block tests due to changes to block allocation, but could have been hit before these changes. In flash, when blocks fail, they don't fail in a predictable manner. To account for this, the bad-block tests check a number of failure behaviors. The interesting one here is "LFS_TESTBD_BADBLOCK_ERASENOOP", in which bad blocks can not be erased or programmed, and are stuck with the data written at the time the blocks go bad. This is actually a pretty realistic failure behavior, since flash needs a large voltage to force the electrons of the floating gates. Though realistically, such a failure would like corrupt the data a bit, not leave the underlying data perfectly intact. LFS_TESTBD_BADBLOCK_ERASENOOP is rather interesting to test for because it means bad blocks can end up with perfectly valid CRCs after a failed write, confusing littlefs. --- In this case, we had the perfect series of operations such that a test was repeatedly writing the same sequence of metadata commits to the same block, which eventually goes bad, leaving the block stuck with metadata that occurs later in the sequence. What this means is that after the first commit, the metadata block contained both the first and second commits, even though the loop in the test hadn't reached that point yet. expected actual .----------. .----------. | commit 1 | | commit 1 | | crc 1 | | crc 1 | | | | commit 2 <-- (from previous iteration) | | | crc 2 | '----------' '----------' To protect against this, littlefs normally compares the written CRC against the expected CRC, but because this was the exact same data that it was going to write, this CRCs end up the same. Ah! But doesn't littlefs also encode the state of the next page to keep track of if the next page has been erased or not? Wouldn't that change between iterations? It does! In a single bit in the CRC-tag. But thanks to some incorrect logic attempting to avoid an extra condition in the loop for writing out padding commits, the CRC that littlefs checked against was the CRC immediately before we include the "is-next-page-erased" bit. Changing the verification check to use the same CRC as what is used to verify commits on fetch solves this problem.
| The new allocation pattern from changing to CRC as a hash-combination function exposed an interesting bad-block-erase-is-noop bug if we happen to write metadata to a block that is bad, stuck, and contains the exact data we were about to write. Added a fix in 0aba71d. | 
      
     Merged
  
  
    Sign up for free
    to join this conversation on GitHub.
    Already have an account?
    Sign in to comment
  
      
  Add this suggestion to a batch that can be applied as a single commit.
  This suggestion is invalid because no changes were made to the code.
  Suggestions cannot be applied while the pull request is closed.
  Suggestions cannot be applied while viewing a subset of changes.
  Only one suggestion per line can be applied in a batch.
  Add this suggestion to a batch that can be applied as a single commit.
  Applying suggestions on deleted lines is not supported.
  You must change the existing code in this line in order to create a valid suggestion.
  Outdated suggestions cannot be applied.
  This suggestion has been applied or marked resolved.
  Suggestions cannot be applied from pending reviews.
  Suggestions cannot be applied on multi-line comments.
  Suggestions cannot be applied while the pull request is queued to merge.
  Suggestion cannot be applied right now. Please check back later.
  
    
  
    
Fixed incorrect modulus in lfs_alloc_reset
Modulus of the offset by block_size was clearly a typo, and should be block_count. Interesting to note that later moduluses during alloc calculations prevents this from breaking anything, but as gtaska notes it could skew the wear-leveling distribution.
Removed unnecessary randomization of offsets in lfs_alloc_reset
On first read, randomizing the allocators offset may seem appropriate for lfs_alloc_reset. However, it ends up using the filesystem-fed pseudorandom seed in situations it wasn't designed for.
As noted by gtaska, the combination of using xors for feeding the seed and multiple traverses of the same CRCs can cause the seed to flip to zeros with concerning frequency.
Removed the randomization from lfs_alloc_reset, leaving it in only lfs_mount.
Switched to CRC as seed collection function instead of xor
As noted by gtaska, we are sitting on a better hash-combining function than xor: CRC. Previous issues with xor were solvable, but relying on xor for this isn't really worth the risk when we already have a CRC function readily available.
To quote a study found by gtaska:
https://michiel.buddingh.eu/distribution-of-hash-values
Should fix #484
Related discussion #488
Found by @guiserle and @gtaska