Skip to content

Commit 63d0100

Browse files
fdmananagregkh
authored andcommitted
btrfs: fix exhaustion of the system chunk array due to concurrent allocations
[ Upstream commit eafa4fd ] When we are running out of space for updating the chunk tree, that is, when we are low on available space in the system space info, if we have many task concurrently allocating block groups, via fallocate for example, many of them can end up all allocating new system chunks when only one is needed. In extreme cases this can lead to exhaustion of the system chunk array, which has a size limit of 2048 bytes, and results in a transaction abort with errno EFBIG, producing a trace in dmesg like the following, which was triggered on a PowerPC machine with a node/leaf size of 64K: [1359.518899] ------------[ cut here ]------------ [1359.518980] BTRFS: Transaction aborted (error -27) [1359.519135] WARNING: CPU: 3 PID: 16463 at ../fs/btrfs/block-group.c:1968 btrfs_create_pending_block_groups+0x340/0x3c0 [btrfs] [1359.519152] Modules linked in: (...) [1359.519239] Supported: Yes, External [1359.519252] CPU: 3 PID: 16463 Comm: stress-ng Tainted: G X 5.3.18-47-default #1 SLE15-SP3 [1359.519274] NIP: c008000000e36fe8 LR: c008000000e36fe4 CTR: 00000000006de8e8 [1359.519293] REGS: c00000056890b700 TRAP: 0700 Tainted: G X (5.3.18-47-default) [1359.519317] MSR: 800000000282b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE> CR: 48008222 XER: 00000007 [1359.519356] CFAR: c00000000013e170 IRQMASK: 0 [1359.519356] GPR00: c008000000e36fe4 c00000056890b990 c008000000e83200 0000000000000026 [1359.519356] GPR04: 0000000000000000 0000000000000000 0000d52a3b027651 0000000000000007 [1359.519356] GPR08: 0000000000000003 0000000000000001 0000000000000007 0000000000000000 [1359.519356] GPR12: 0000000000008000 c00000063fe44600 000000001015e028 000000001015dfd0 [1359.519356] GPR16: 000000000000404f 0000000000000001 0000000000010000 0000dd1e287affff [1359.519356] GPR20: 0000000000000001 c000000637c9a000 ffffffffffffffe5 0000000000000000 [1359.519356] GPR24: 0000000000000004 0000000000000000 0000000000000100 ffffffffffffffc0 [1359.519356] GPR28: c000000637c9a000 c000000630e09230 c000000630e091d8 c000000562188b08 [1359.519561] NIP [c008000000e36fe8] btrfs_create_pending_block_groups+0x340/0x3c0 [btrfs] [1359.519613] LR [c008000000e36fe4] btrfs_create_pending_block_groups+0x33c/0x3c0 [btrfs] [1359.519626] Call Trace: [1359.519671] [c00000056890b990] [c008000000e36fe4] btrfs_create_pending_block_groups+0x33c/0x3c0 [btrfs] (unreliable) [1359.519729] [c00000056890ba90] [c008000000d68d44] __btrfs_end_transaction+0xbc/0x2f0 [btrfs] [1359.519782] [c00000056890bae0] [c008000000e309ac] btrfs_alloc_data_chunk_ondemand+0x154/0x610 [btrfs] [1359.519844] [c00000056890bba0] [c008000000d8a0fc] btrfs_fallocate+0xe4/0x10e0 [btrfs] [1359.519891] [c00000056890bd00] [c0000000004a23b4] vfs_fallocate+0x174/0x350 [1359.519929] [c00000056890bd50] [c0000000004a3cf8] ksys_fallocate+0x68/0xf0 [1359.519957] [c00000056890bda0] [c0000000004a3da8] sys_fallocate+0x28/0x40 [1359.519988] [c00000056890bdc0] [c000000000038968] system_call_exception+0xe8/0x170 [1359.520021] [c00000056890be20] [c00000000000cb70] system_call_common+0xf0/0x278 [1359.520037] Instruction dump: [1359.520049] 7d0049ad 40c2fff4 7c0004ac 71490004 40820024 2f83fffb 419e0048 3c620000 [1359.520082] e863bcb8 7ec4b378 48010d91 e8410018 <0fe00000> 3c820000 e884bcc8 7ec6b378 [1359.520122] ---[ end trace d6c186e151022e20 ]--- The following steps explain how we can end up in this situation: 1) Task A is at check_system_chunk(), either because it is allocating a new data or metadata block group, at btrfs_chunk_alloc(), or because it is removing a block group or turning a block group RO. It does not matter why; 2) Task A sees that there is not enough free space in the system space_info object, that is 'left' is < 'thresh'. And at this point the system space_info has a value of 0 for its 'bytes_may_use' counter; 3) As a consequence task A calls btrfs_alloc_chunk() in order to allocate a new system block group (chunk) and then reserves 'thresh' bytes in the chunk block reserve with the call to btrfs_block_rsv_add(). This changes the chunk block reserve's 'reserved' and 'size' counters by an amount of 'thresh', and changes the 'bytes_may_use' counter of the system space_info object from 0 to 'thresh'. Also during its call to btrfs_alloc_chunk(), we end up increasing the value of the 'total_bytes' counter of the system space_info object by 8MiB (the size of a system chunk stripe). This happens through the call chain: btrfs_alloc_chunk() create_chunk() btrfs_make_block_group() btrfs_update_space_info() 4) After it finishes the first phase of the block group allocation, at btrfs_chunk_alloc(), task A unlocks the chunk mutex; 5) At this point the new system block group was added to the transaction handle's list of new block groups, but its block group item, device items and chunk item were not yet inserted in the extent, device and chunk trees, respectively. That only happens later when we call btrfs_finish_chunk_alloc() through a call to btrfs_create_pending_block_groups(); Note that only when we update the chunk tree, through the call to btrfs_finish_chunk_alloc(), we decrement the 'reserved' counter of the chunk block reserve as we COW/allocate extent buffers, through: btrfs_alloc_tree_block() btrfs_use_block_rsv() btrfs_block_rsv_use_bytes() And the system space_info's 'bytes_may_use' is decremented everytime we allocate an extent buffer for COW operations on the chunk tree, through: btrfs_alloc_tree_block() btrfs_reserve_extent() find_free_extent() btrfs_add_reserved_bytes() If we end up COWing less chunk btree nodes/leaves than expected, which is the typical case since the amount of space we reserve is always pessimistic to account for the worst possible case, we release the unused space through: btrfs_create_pending_block_groups() btrfs_trans_release_chunk_metadata() btrfs_block_rsv_release() block_rsv_release_bytes() btrfs_space_info_free_bytes_may_use() But before task A gets into btrfs_create_pending_block_groups()... 6) Many other tasks start allocating new block groups through fallocate, each one does the first phase of block group allocation in a serialized way, since btrfs_chunk_alloc() takes the chunk mutex before calling check_system_chunk() and btrfs_alloc_chunk(). However before everyone enters the final phase of the block group allocation, that is, before calling btrfs_create_pending_block_groups(), new tasks keep coming to allocate new block groups and while at check_system_chunk(), the system space_info's 'bytes_may_use' keeps increasing each time a task reserves space in the chunk block reserve. This means that eventually some other task can end up not seeing enough free space in the system space_info and decide to allocate yet another system chunk. This may repeat several times if yet more new tasks keep allocating new block groups before task A, and all the other tasks, finish the creation of the pending block groups, which is when reserved space in excess is released. Eventually this can result in exhaustion of system chunk array in the superblock, with btrfs_add_system_chunk() returning EFBIG, resulting later in a transaction abort. Even when we don't reach the extreme case of exhausting the system array, most, if not all, unnecessarily created system block groups end up being unused since when finishing creation of the first pending system block group, the creation of the following ones end up not needing to COW nodes/leaves of the chunk tree, so we never allocate and deallocate from them, resulting in them never being added to the list of unused block groups - as a consequence they don't get deleted by the cleaner kthread - the only exceptions are if we unmount and mount the filesystem again, which adds any unused block groups to the list of unused block groups, if a scrub is run, which also adds unused block groups to the unused list, and under some circumstances when using a zoned filesystem or async discard, which may also add unused block groups to the unused list. So fix this by: *) Tracking the number of reserved bytes for the chunk tree per transaction, which is the sum of reserved chunk bytes by each transaction handle currently being used; *) When there is not enough free space in the system space_info, if there are other transaction handles which reserved chunk space, wait for some of them to complete in order to have enough excess reserved space released, and then try again. Otherwise proceed with the creation of a new system chunk. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
1 parent 7c12458 commit 63d0100

File tree

3 files changed

+69
-1
lines changed

3 files changed

+69
-1
lines changed

fs/btrfs/block-group.c

Lines changed: 57 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3269,6 +3269,7 @@ static u64 get_profile_num_devs(struct btrfs_fs_info *fs_info, u64 type)
32693269
*/
32703270
void check_system_chunk(struct btrfs_trans_handle *trans, u64 type)
32713271
{
3272+
struct btrfs_transaction *cur_trans = trans->transaction;
32723273
struct btrfs_fs_info *fs_info = trans->fs_info;
32733274
struct btrfs_space_info *info;
32743275
u64 left;
@@ -3283,6 +3284,7 @@ void check_system_chunk(struct btrfs_trans_handle *trans, u64 type)
32833284
lockdep_assert_held(&fs_info->chunk_mutex);
32843285

32853286
info = btrfs_find_space_info(fs_info, BTRFS_BLOCK_GROUP_SYSTEM);
3287+
again:
32863288
spin_lock(&info->lock);
32873289
left = info->total_bytes - btrfs_space_info_used(info, true);
32883290
spin_unlock(&info->lock);
@@ -3301,6 +3303,58 @@ void check_system_chunk(struct btrfs_trans_handle *trans, u64 type)
33013303

33023304
if (left < thresh) {
33033305
u64 flags = btrfs_system_alloc_profile(fs_info);
3306+
u64 reserved = atomic64_read(&cur_trans->chunk_bytes_reserved);
3307+
3308+
/*
3309+
* If there's not available space for the chunk tree (system
3310+
* space) and there are other tasks that reserved space for
3311+
* creating a new system block group, wait for them to complete
3312+
* the creation of their system block group and release excess
3313+
* reserved space. We do this because:
3314+
*
3315+
* *) We can end up allocating more system chunks than necessary
3316+
* when there are multiple tasks that are concurrently
3317+
* allocating block groups, which can lead to exhaustion of
3318+
* the system array in the superblock;
3319+
*
3320+
* *) If we allocate extra and unnecessary system block groups,
3321+
* despite being empty for a long time, and possibly forever,
3322+
* they end not being added to the list of unused block groups
3323+
* because that typically happens only when deallocating the
3324+
* last extent from a block group - which never happens since
3325+
* we never allocate from them in the first place. The few
3326+
* exceptions are when mounting a filesystem or running scrub,
3327+
* which add unused block groups to the list of unused block
3328+
* groups, to be deleted by the cleaner kthread.
3329+
* And even when they are added to the list of unused block
3330+
* groups, it can take a long time until they get deleted,
3331+
* since the cleaner kthread might be sleeping or busy with
3332+
* other work (deleting subvolumes, running delayed iputs,
3333+
* defrag scheduling, etc);
3334+
*
3335+
* This is rare in practice, but can happen when too many tasks
3336+
* are allocating blocks groups in parallel (via fallocate())
3337+
* and before the one that reserved space for a new system block
3338+
* group finishes the block group creation and releases the space
3339+
* reserved in excess (at btrfs_create_pending_block_groups()),
3340+
* other tasks end up here and see free system space temporarily
3341+
* not enough for updating the chunk tree.
3342+
*
3343+
* We unlock the chunk mutex before waiting for such tasks and
3344+
* lock it again after the wait, otherwise we would deadlock.
3345+
* It is safe to do so because allocating a system chunk is the
3346+
* first thing done while allocating a new block group.
3347+
*/
3348+
if (reserved > trans->chunk_bytes_reserved) {
3349+
const u64 min_needed = reserved - thresh;
3350+
3351+
mutex_unlock(&fs_info->chunk_mutex);
3352+
wait_event(cur_trans->chunk_reserve_wait,
3353+
atomic64_read(&cur_trans->chunk_bytes_reserved) <=
3354+
min_needed);
3355+
mutex_lock(&fs_info->chunk_mutex);
3356+
goto again;
3357+
}
33043358

33053359
/*
33063360
* Ignore failure to create system chunk. We might end up not
@@ -3315,8 +3369,10 @@ void check_system_chunk(struct btrfs_trans_handle *trans, u64 type)
33153369
ret = btrfs_block_rsv_add(fs_info->chunk_root,
33163370
&fs_info->chunk_block_rsv,
33173371
thresh, BTRFS_RESERVE_NO_FLUSH);
3318-
if (!ret)
3372+
if (!ret) {
3373+
atomic64_add(thresh, &cur_trans->chunk_bytes_reserved);
33193374
trans->chunk_bytes_reserved += thresh;
3375+
}
33203376
}
33213377
}
33223378

fs/btrfs/transaction.c

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -260,6 +260,7 @@ static inline int extwriter_counter_read(struct btrfs_transaction *trans)
260260
void btrfs_trans_release_chunk_metadata(struct btrfs_trans_handle *trans)
261261
{
262262
struct btrfs_fs_info *fs_info = trans->fs_info;
263+
struct btrfs_transaction *cur_trans = trans->transaction;
263264

264265
if (!trans->chunk_bytes_reserved)
265266
return;
@@ -268,6 +269,8 @@ void btrfs_trans_release_chunk_metadata(struct btrfs_trans_handle *trans)
268269

269270
btrfs_block_rsv_release(fs_info, &fs_info->chunk_block_rsv,
270271
trans->chunk_bytes_reserved, NULL);
272+
atomic64_sub(trans->chunk_bytes_reserved, &cur_trans->chunk_bytes_reserved);
273+
cond_wake_up(&cur_trans->chunk_reserve_wait);
271274
trans->chunk_bytes_reserved = 0;
272275
}
273276

@@ -383,6 +386,8 @@ static noinline int join_transaction(struct btrfs_fs_info *fs_info,
383386
spin_lock_init(&cur_trans->dropped_roots_lock);
384387
INIT_LIST_HEAD(&cur_trans->releasing_ebs);
385388
spin_lock_init(&cur_trans->releasing_ebs_lock);
389+
atomic64_set(&cur_trans->chunk_bytes_reserved, 0);
390+
init_waitqueue_head(&cur_trans->chunk_reserve_wait);
386391
list_add_tail(&cur_trans->list, &fs_info->trans_list);
387392
extent_io_tree_init(fs_info, &cur_trans->dirty_pages,
388393
IO_TREE_TRANS_DIRTY_PAGES, fs_info->btree_inode);

fs/btrfs/transaction.h

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -96,6 +96,13 @@ struct btrfs_transaction {
9696

9797
spinlock_t releasing_ebs_lock;
9898
struct list_head releasing_ebs;
99+
100+
/*
101+
* The number of bytes currently reserved, by all transaction handles
102+
* attached to this transaction, for metadata extents of the chunk tree.
103+
*/
104+
atomic64_t chunk_bytes_reserved;
105+
wait_queue_head_t chunk_reserve_wait;
99106
};
100107

101108
#define __TRANS_FREEZABLE (1U << 0)

0 commit comments

Comments
 (0)