Skip to content

Commit 97306be

Browse files
author
Alexei Starovoitov
committed
Merge branch 'switch to memcg-based memory accounting'
Roman Gushchin says: ==================== Currently bpf is using the memlock rlimit for the memory accounting. This approach has its downsides and over time has created a significant amount of problems: 1) The limit is per-user, but because most bpf operations are performed as root, the limit has a little value. 2) It's hard to come up with a specific maximum value. Especially because the counter is shared with non-bpf use cases (e.g. memlock()). Any specific value is either too low and creates false failures or is too high and useless. 3) Charging is not connected to the actual memory allocation. Bpf code should manually calculate the estimated cost and charge the counter, and then take care of uncharging, including all fail paths. It adds to the code complexity and makes it easy to leak a charge. 4) There is no simple way of getting the current value of the counter. We've used drgn for it, but it's far from being convenient. 5) Cryptic -EPERM is returned on exceeding the limit. Libbpf even had a function to "explain" this case for users. 6) rlimits are generally considered as (at least partially) obsolete. They do not provide a comprehensive system for the control of physical resources: memory, cpu, io etc. All resource control developments in the recent years were related to cgroups. In order to overcome these problems let's switch to the memory cgroup-based memory accounting of bpf objects. With the recent addition of the percpu memory accounting, now it's possible to provide a comprehensive accounting of the memory used by bpf programs and maps. This approach has the following advantages: 1) The limit is per-cgroup and hierarchical. It's way more flexible and allows a better control over memory usage by different workloads. 2) The actual memory consumption is taken into account. It happens automatically on the allocation time if __GFP_ACCOUNT flags is passed. Uncharging is also performed automatically on releasing the memory. So the code on the bpf side becomes simpler and safer. 3) There is a simple way to get the current value and statistics. Cgroup-based accounting adds new requirements: 1) The kernel config should have CONFIG_CGROUPS and CONFIG_MEMCG_KMEM enabled. These options are usually enabled, maybe excluding tiny builds for embedded devices. 2) The system should have a configured cgroup hierarchy, including reasonable memory limits and/or guarantees. Modern systems usually delegate this task to systemd or similar task managers. Without meeting these requirements there are no limits on how much memory bpf can use and a non-root user is able to hurt the system by allocating too much. But because per-user rlimits do not provide a functional system to protect and manage physical resources anyway, anyone who seriously depends on it, should use cgroups. When a bpf map is created, the memory cgroup of the process which creates the map is recorded. Subsequently all memory allocation related to the bpf map are charged to the same cgroup. It includes allocations made from interrupts and by any processes. Bpf program memory is charged to the memory cgroup of a process which loads the program. The patchset consists of the following parts: 1) 4 mm patches are required on the mm side, otherwise vmallocs cannot be mapped to userspace 2) memcg-based accounting for various bpf objects: progs and maps 3) removal of the rlimit-based accounting 4) removal of rlimit adjustments in userspace samples v9: - always charge the saved memory cgroup, by Daniel, Toke and Alexei - added bpf_map_kzalloc() - rebase and minor fixes v8: - extended the cover letter to be more clear on new requirements, by Daniel - an approximate value is provided by map memlock info, by Alexei v7: - introduced bpf_map_kmalloc_node() and bpf_map_alloc_percpu(), by Alexei - switched allocations made from an interrupt context to new helpers, by Daniel - rebase and minor fixes v6: - rebased to the latest version of the remote charging API - fixed signatures, added acks v5: - rebased to the latest version of the remote charging API - implemented kmem accounting from an interrupt context, by Shakeel - rebased to latest changes in mm allowed to map vmallocs to userspace - fixed a build issue in kselftests, by Alexei - fixed a use-after-free bug in bpf_map_free_deferred() - added bpf line info coverage, by Shakeel - split bpf map charging preparations into a separate patch v4: - covered allocations made from an interrupt context, by Daniel - added some clarifications to the cover letter v3: - droped the userspace part for further discussions/refinements, by Andrii and Song v2: - fixed build issue, caused by the remaining rlimit-based accounting for sockhash maps ==================== Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2 parents 9e83f54 + 5b0764b commit 97306be

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

61 files changed

+533
-762
lines changed

fs/buffer.c

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -657,7 +657,7 @@ int __set_page_dirty_buffers(struct page *page)
657657
} while (bh != head);
658658
}
659659
/*
660-
* Lock out page->mem_cgroup migration to keep PageDirty
660+
* Lock out page's memcg migration to keep PageDirty
661661
* synchronized with per-memcg dirty page counters.
662662
*/
663663
lock_page_memcg(page);

fs/iomap/buffered-io.c

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -650,7 +650,7 @@ iomap_set_page_dirty(struct page *page)
650650
return !TestSetPageDirty(page);
651651

652652
/*
653-
* Lock out page->mem_cgroup migration to keep PageDirty
653+
* Lock out page's memcg migration to keep PageDirty
654654
* synchronized with per-memcg dirty page counters.
655655
*/
656656
lock_page_memcg(page);

include/linux/bpf.h

Lines changed: 34 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,8 @@
2020
#include <linux/module.h>
2121
#include <linux/kallsyms.h>
2222
#include <linux/capability.h>
23+
#include <linux/sched/mm.h>
24+
#include <linux/slab.h>
2325

2426
struct bpf_verifier_env;
2527
struct bpf_verifier_log;
@@ -37,6 +39,7 @@ struct bpf_iter_aux_info;
3739
struct bpf_local_storage;
3840
struct bpf_local_storage_map;
3941
struct kobject;
42+
struct mem_cgroup;
4043

4144
extern struct idr btf_idr;
4245
extern spinlock_t btf_idr_lock;
@@ -135,11 +138,6 @@ struct bpf_map_ops {
135138
const struct bpf_iter_seq_info *iter_seq_info;
136139
};
137140

138-
struct bpf_map_memory {
139-
u32 pages;
140-
struct user_struct *user;
141-
};
142-
143141
struct bpf_map {
144142
/* The first two cachelines with read-mostly members of which some
145143
* are also accessed in fast-path (e.g. ops, max_entries).
@@ -160,7 +158,9 @@ struct bpf_map {
160158
u32 btf_key_type_id;
161159
u32 btf_value_type_id;
162160
struct btf *btf;
163-
struct bpf_map_memory memory;
161+
#ifdef CONFIG_MEMCG_KMEM
162+
struct mem_cgroup *memcg;
163+
#endif
164164
char name[BPF_OBJ_NAME_LEN];
165165
u32 btf_vmlinux_value_type_id;
166166
bool bypass_spec_v1;
@@ -1202,8 +1202,6 @@ void bpf_prog_sub(struct bpf_prog *prog, int i);
12021202
void bpf_prog_inc(struct bpf_prog *prog);
12031203
struct bpf_prog * __must_check bpf_prog_inc_not_zero(struct bpf_prog *prog);
12041204
void bpf_prog_put(struct bpf_prog *prog);
1205-
int __bpf_prog_charge(struct user_struct *user, u32 pages);
1206-
void __bpf_prog_uncharge(struct user_struct *user, u32 pages);
12071205
void __bpf_free_used_maps(struct bpf_prog_aux *aux,
12081206
struct bpf_map **used_maps, u32 len);
12091207

@@ -1218,12 +1216,6 @@ void bpf_map_inc_with_uref(struct bpf_map *map);
12181216
struct bpf_map * __must_check bpf_map_inc_not_zero(struct bpf_map *map);
12191217
void bpf_map_put_with_uref(struct bpf_map *map);
12201218
void bpf_map_put(struct bpf_map *map);
1221-
int bpf_map_charge_memlock(struct bpf_map *map, u32 pages);
1222-
void bpf_map_uncharge_memlock(struct bpf_map *map, u32 pages);
1223-
int bpf_map_charge_init(struct bpf_map_memory *mem, u64 size);
1224-
void bpf_map_charge_finish(struct bpf_map_memory *mem);
1225-
void bpf_map_charge_move(struct bpf_map_memory *dst,
1226-
struct bpf_map_memory *src);
12271219
void *bpf_map_area_alloc(u64 size, int numa_node);
12281220
void *bpf_map_area_mmapable_alloc(u64 size, int numa_node);
12291221
void bpf_map_area_free(void *base);
@@ -1240,6 +1232,34 @@ int generic_map_delete_batch(struct bpf_map *map,
12401232
struct bpf_map *bpf_map_get_curr_or_next(u32 *id);
12411233
struct bpf_prog *bpf_prog_get_curr_or_next(u32 *id);
12421234

1235+
#ifdef CONFIG_MEMCG_KMEM
1236+
void *bpf_map_kmalloc_node(const struct bpf_map *map, size_t size, gfp_t flags,
1237+
int node);
1238+
void *bpf_map_kzalloc(const struct bpf_map *map, size_t size, gfp_t flags);
1239+
void __percpu *bpf_map_alloc_percpu(const struct bpf_map *map, size_t size,
1240+
size_t align, gfp_t flags);
1241+
#else
1242+
static inline void *
1243+
bpf_map_kmalloc_node(const struct bpf_map *map, size_t size, gfp_t flags,
1244+
int node)
1245+
{
1246+
return kmalloc_node(size, flags, node);
1247+
}
1248+
1249+
static inline void *
1250+
bpf_map_kzalloc(const struct bpf_map *map, size_t size, gfp_t flags)
1251+
{
1252+
return kzalloc(size, flags);
1253+
}
1254+
1255+
static inline void __percpu *
1256+
bpf_map_alloc_percpu(const struct bpf_map *map, size_t size, size_t align,
1257+
gfp_t flags)
1258+
{
1259+
return __alloc_percpu_gfp(size, align, flags);
1260+
}
1261+
#endif
1262+
12431263
extern int sysctl_unprivileged_bpf_disabled;
12441264

12451265
static inline bool bpf_allow_ptr_leaks(void)
@@ -1490,15 +1510,6 @@ bpf_prog_inc_not_zero(struct bpf_prog *prog)
14901510
return ERR_PTR(-EOPNOTSUPP);
14911511
}
14921512

1493-
static inline int __bpf_prog_charge(struct user_struct *user, u32 pages)
1494-
{
1495-
return 0;
1496-
}
1497-
1498-
static inline void __bpf_prog_uncharge(struct user_struct *user, u32 pages)
1499-
{
1500-
}
1501-
15021513
static inline void bpf_link_init(struct bpf_link *link, enum bpf_link_type type,
15031514
const struct bpf_link_ops *ops,
15041515
struct bpf_prog *prog)

0 commit comments

Comments
 (0)