Skip to content

rwf encoded #1

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 11 commits into
base: master
Choose a base branch
from
Open

rwf encoded #1

wants to merge 11 commits into from

Conversation

boryas
Copy link

@boryas boryas commented Aug 7, 2020

No description provided.

An encoded extent can be up to 128K in length, which exceeds the largest
value expressible by the current send stream format's 16 bit tlv_len
field. Since encoded writes cannot be split into multiple writes by
btrfs send, the send stream format must change to accommodate encoded
writes.

Supporting this changed format requires retooling how we store the
commands we have processed. Since we can no longer use btrfs_tlv_header
to describe every attribute, we define a new struct btrfs_send_attribute
which has a 32 bit length field, and use that to store the attribute
information needed for receive processing. This is transparent to users
of the various TLV_GET macros.

Signed-off-by: Boris Burkov <boris@bur.io>
@boryas
Copy link
Author

boryas commented Aug 13, 2020

I made your suggested changes and force pushed the branch, it looks alright to me in the UI, still re-testing stuff besides "it builds", so apologies if something dumb snuck through.

@boryas
Copy link
Author

boryas commented Aug 13, 2020

btrfs-progs tests are broken on the truncated thing, debugging.

@boryas
Copy link
Author

boryas commented Aug 13, 2020

dumb mistake, checked against sizeof le16*, not sizeof le16. btrfs-progs tests passing now

Copy link
Owner

@osandov osandov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly style nits this time around.

Copy link
Owner

@osandov osandov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I didn't notice this stuff last time, but there's a few inconsistencies with error return values, and some trivial style nits. Otherwise it looks great. I'm going to give this and the kernel patches a heavier round of testing now, thanks!

boryas and others added 10 commits August 17, 2020 10:14
In send stream v2, write commands can now be an arbitrary size. For that
reason, we can no longer allocate a fixed array in sctx for read_cmd.
Instead, read_cmd dynamically allocates sctx->read_buf. To avoid
needless reallocations, we reuse read_buf between read_cmd calls by also
keeping track of the size of the allocated buffer in sctx->read_buf_sz.

We do the first allocation of the old default size at the start of
processing the stream, and we only reallocate if we encounter a command
that needs a larger buffer.

Signed-off-by: Boris Burkov <boris@bur.io>
The new format privileges the BTRFS_SEND_A_DATA attribute by
guaranteeing it will always be the last attribute in any command that
needs it, and by implicitly encoding the data length as the difference
between the total command length in the command header and the sizes of
the rest of the attributes (and of course the tlv_type identifying the
DATA attribute). To parse the new stream, we must read the tlv_type and
if it is not DATA, we proceed normally, but if it is DATA, we don't
parse a tlv_len but simply compute the length.

In addition, we add some bounds checking when parsing each chunk of
data, as well as for the tlv_len itself.

Signed-off-by: Boris Burkov <boris@bur.io>
Send stream v2 adds three commands and several attributes associated to
those commands. Before we implement processing them, add all the
commands and attributes. This avoids leaving the enums in an
intermediate state that doesn't correspond to any version of send
stream.

Signed-off-by: Boris Burkov <boris@bur.io>
Encoded writes in receive will use pwritev2. It is possible that the
system libc does not export this function, so we stub it out and detect
whether to build the stub code with autoconf.

This syscall has special semantics in x32 (no hi lo, just takes loff_t)
so we have to detect that case and use the appropriate arguments.

Signed-off-by: Boris Burkov <boris@bur.io>
Add a new btrfs_send_op and support for both dumping and proper receive
processing which does actual encoded writes.

Encoded writes are only allowed on a file descriptor opened with an extra
flag that allows encoded writes, so we also add support for this flag
when opening or reusing a file for writing.

Signed-off-by: Boris Burkov <boris@bur.io>
…rite

An encoded_write can fail if the file system it is being applied to does
not support encoded writes or if it can't find enough contiguous space
to accommodate the encoded extent. In those cases, we can likely still
process an encoded_write by explicitly decoding the data and doing a
normal write.

Add the necessary fallback path for decoding data compressed with zlib,
lzo, or zstd. zlib and zstd have reusable decoding context data
structures which we cache in the receive context so that we don't have
to recreate them on every encoded_write.

Finally, add a command line flag for force-decompress which causes
receive to always use the fallback path rather than first attempting the
encoded write.

Signed-off-by: Boris Burkov <boris@bur.io>
Send stream v2 can emit fallocate commands, so receive must support them
as well. The implementation simply passes along the arguments to the
syscall. Note that mode is encoded as a u32 in send stream but fallocate
takes an int, so there is a unsigned->signed conversion there.

Signed-off-by: Boris Burkov <boris@bur.io>
In send stream v2, send can emit a command for setting inode flags via
the setflags ioctl. Pass the flags attribute through to the ioctl call
in receive.

Signed-off-by: Boris Burkov <boris@bur.io>
To make the btrfs send ioctl use the stream v2 format requires passing
BTRFS_SEND_FLAG_STREAM_V2 in flags. Further, to cause the ioctl to emit
encoded_write commands for encoded extents, we must set that flag as
well as BTRFS_SEND_FLAG_COMPRESSED. Finally, we bump up the version in
send.h as well, since we are now fully compatible with v2.

Add two command line arguments to btrfs send: --stream-version and
--compressed. --stream-version requires an argument which it parses as
an integer and sets STREAM_V2 if the argument is 2. --compressed does
not require an argument and automatically implies STREAM_V2 as well
(COMPRESSED alone causes the ioctl to error out).

Some examples to illustrate edge cases:

// v1, old format and no encoded_writes
btrfs send subvol
btrfs send --stream-version 1 subvol

// v2 and compressed, we will see encoded_writes
btrfs send --compressed subvol
btrfs send --compressed --stream-version 2 subvol

// v2 only, new format but no encoded_writes
btrfs send --stream-version 2 subvol

// error: compressed needs version >= 2
btrfs send --compressed --stream-version 1 subvol

// error: invalid version (not 1 or 2)
btrfs send --stream-version 3 subvol
btrfs send --compressed --stream-version 0 subvol
btrfs send --compressed --stream-version 10 subvol

Signed-off-by: Boris Burkov <boris@bur.io>
Adapt the existing send/receive tests by passing '-o --force-compress'
to the mount commands in a new test. After writing a few files in the
various compression formats, send/receive them with and without
--force-decompress to test both the encoded_write path and the
fallback to decode+write.

Signed-off-by: Boris Burkov <boris@bur.io>
osandov pushed a commit that referenced this pull request Nov 16, 2021
…properly handled

[BUG]
When a special image (diverted from fsck/012) has its unused slots (slot
number >= nritems) with garbage, lowmem mode btrfs check can crash:

  (gdb) run check --mode=lowmem ~/downloads/good.img.restored
  Starting program: /home/adam/btrfs/btrfs-progs/btrfs check --mode=lowmem ~/downloads/good.img.restored
  ...
  ERROR: root 5 INODE[5044031582654955520] nlink(257228800) not equal to inode_refs(0)
  ERROR: root 5 INODE[5044031582654955520] nbytes 474624 not equal to extent_size 0

  Program received signal SIGSEGV, Segmentation fault.
  0x0000555555639b11 in btrfs_inode_size (eb=0x5555558a7540, s=0x642e6cd1) at ./kernel-shared/ctree.h:1703
  1703	BTRFS_SETGET_FUNCS(inode_size, struct btrfs_inode_item, size, 64);
  (gdb) bt
  #0  0x0000555555639b11 in btrfs_inode_size (eb=0x5555558a7540, s=0x642e6cd1) at ./kernel-shared/ctree.h:1703
  #1  0x0000555555641544 in check_inode_item (root=0x5555556c2290, path=0x7fffffffd960) at check/mode-lowmem.c:2628

[CAUSE]
At check_inode_item() we have path->slot[0] at 29, while the tree block
only has 26 items.

This happens because two reasons:

- btrfs_next_item() never reverts its slots
  Even if we failed to read next leaf.

- check_inode_item() doesn't inform the caller that a fatal error
  happened
  In check_inode_item(), if btrfs_next_item() failed, it goes to out
  label, which doesn't really set @err properly.

This means, when check_inode_item() fails at btrfs_next_item(), it will
increase path->slots[0], while it's already beyond current tree block
nritems.

When the slot increases furthermore, and if the unused item slots have
some garbage, we will get invalid btrfs_item_ptr() result, and causing
above segfault.

[FIX]
Fix the problems by two ways:

- Make btrfs_next_item() to revert its path->slots[0] on failure

- Properly detect fatal error from check_inode_item()

By this, we will no longer crash on the crafted image.

Reported-by: Wang Yugui <wangyugui@e16-tech.com>
Issue: kdave#412
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
osandov pushed a commit that referenced this pull request Feb 8, 2022
…level

[BUG]
When running lowmem mode with METADATA_ITEM which has invalid level, it
will crash with the following backtrace:

 (gdb) bt
 #0  0x0000555555616b0b in btrfs_header_bytenr (eb=0x4)
     at ./kernel-shared/ctree.h:2134
 #1  0x0000555555620c78 in check_tree_block_backref (root_id=5,
     bytenr=30457856, level=256) at check/mode-lowmem.c:3818
 kdave#2  0x0000555555621f6c in check_extent_item (path=0x7fffffffd9c0)
     at check/mode-lowmem.c:4334
 kdave#3  0x00005555556235a5 in check_leaf_items (root=0x555555688e10,
     path=0x7fffffffd9c0, nrefs=0x7fffffffda30, account_bytes=1)
     at check/mode-lowmem.c:4835
 kdave#4  0x0000555555623c6d in walk_down_tree (root=0x555555688e10,
     path=0x7fffffffd9c0, level=0x7fffffffd984, nrefs=0x7fffffffda30,
     check_all=1) at check/mode-lowmem.c:4967
 kdave#5  0x000055555562494f in check_btrfs_root (root=0x555555688e10, check_all=1)
     at check/mode-lowmem.c:5266
 kdave#6  0x00005555556254ee in check_chunks_and_extents_lowmem ()
     at check/mode-lowmem.c:5556
 kdave#7  0x00005555555f0b82 in do_check_chunks_and_extents () at check/main.c:9114
 kdave#8  0x00005555555f50ea in cmd_check (cmd=0x55555567c640 <cmd_struct_check>,
     argc=3, argv=0x7fffffffdec0) at check/main.c:10892
 kdave#9  0x000055555556b2b1 in cmd_execute (argv=0x7fffffffdec0, argc=3,
     cmd=0x55555567c640 <cmd_struct_check>) at cmds/commands.h:125

[CAUSE]
For function check_extent_item() it will go through inline extent items
and then check their backrefs.

But for METADATA_ITEM, it doesn't really validate key.offset, which is
u64 and can contain value way larger than BTRFS_MAX_LEVEL (mostly caused
by bit flip).

In that case, if we have a larger value like 256 in key.offset, then
later check_tree_block_backref() will use 256 as level, and overflow
path->nodes[level] and crash.

[FIX]
Just verify the level, no matter if it's from btrfs_tree_block_level()
(which is just u8), or it's from key.offset (which is u64).

To do the check properly and detect higher bits corruption, also change
the type of @Level from u8 to u64.

Now lowmem mode can detect the problem properly:

 ...
 [2/7] checking extents
 ERROR: tree block 30457856 has bad backref level, has 256 expect [0, 7]
 ERROR: extent[30457856 16384] level mismatch, wanted: 0, have: 256
 ERROR: errors found in extent allocation tree or chunk allocation
 [3/7] checking free space tree
 ...

Reviewed-by: Su Yue <l@damenly.su>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants