Skip to content

Commit

Permalink
Browse files Browse the repository at this point in the history
…/git/bpf/bpf-next

Daniel Borkmann says:

====================
pull-request: bpf-next 2024-02-29

We've added 119 non-merge commits during the last 32 day(s) which contain
a total of 150 files changed, 3589 insertions(+), 995 deletions(-).

The main changes are:

1) Extend the BPF verifier to enable static subprog calls in spin lock
   critical sections, from Kumar Kartikeya Dwivedi.

2) Fix confusing and incorrect inference of PTR_TO_CTX argument type
   in BPF global subprogs, from Andrii Nakryiko.

3) Larger batch of riscv BPF JIT improvements and enabling inlining
   of the bpf_kptr_xchg() for RV64, from Pu Lehui.

4) Allow skeleton users to change the values of the fields in struct_ops
   maps at runtime, from Kui-Feng Lee.

5) Extend the verifier's capabilities of tracking scalars when they
   are spilled to stack, especially when the spill or fill is narrowing,
   from Maxim Mikityanskiy & Eduard Zingerman.

6) Various BPF selftest improvements to fix errors under gcc BPF backend,
   from Jose E. Marchesi.

7) Avoid module loading failure when the module trying to register
   a struct_ops has its BTF section stripped, from Geliang Tang.

8) Annotate all kfuncs in .BTF_ids section which eventually allows
   for automatic kfunc prototype generation from bpftool, from Daniel Xu.

9) Several updates to the instruction-set.rst IETF standardization
   document, from Dave Thaler.

10) Shrink the size of struct bpf_map resp. bpf_array,
    from Alexei Starovoitov.

11) Initial small subset of BPF verifier prepwork for sleepable bpf_timer,
    from Benjamin Tissoires.

12) Fix bpftool to be more portable to musl libc by using POSIX's
    basename(), from Arnaldo Carvalho de Melo.

13) Add libbpf support to gcc in CORE macro definitions,
    from Cupertino Miranda.

14) Remove a duplicate type check in perf_event_bpf_event,
    from Florian Lehner.

15) Fix bpf_spin_{un,}lock BPF helpers to actually annotate them
    with notrace correctly, from Yonghong Song.

16) Replace the deprecated bpf_lpm_trie_key 0-length array with flexible
    array to fix build warnings, from Kees Cook.

17) Fix resolve_btfids cross-compilation to non host-native endianness,
    from Viktor Malik.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (119 commits)
  selftests/bpf: Test if shadow types work correctly.
  bpftool: Add an example for struct_ops map and shadow type.
  bpftool: Generated shadow variables for struct_ops maps.
  libbpf: Convert st_ops->data to shadow type.
  libbpf: Set btf_value_type_id of struct bpf_map for struct_ops.
  bpf: Replace bpf_lpm_trie_key 0-length array with flexible array
  bpf, arm64: use bpf_prog_pack for memory management
  arm64: patching: implement text_poke API
  bpf, arm64: support exceptions
  arm64: stacktrace: Implement arch_bpf_stack_walk() for the BPF JIT
  bpf: add is_async_callback_calling_insn() helper
  bpf: introduce in_sleepable() helper
  bpf: allow more maps in sleepable bpf programs
  selftests/bpf: Test case for lacking CFI stub functions.
  bpf: Check cfi_stubs before registering a struct_ops type.
  bpf: Clarify batch lookup/lookup_and_delete semantics
  bpf, docs: specify which BPF_ABS and BPF_IND fields were zero
  bpf, docs: Fix typos in instruction-set.rst
  selftests/bpf: update tcp_custom_syncookie to use scalar packet offset
  bpf: Shrink size of struct bpf_map/bpf_array.
  ...
====================

Link: https://lore.kernel.org/r/20240301001625.8800-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
  • Loading branch information
kuba-moo committed Mar 3, 2024
2 parents e960825 + 0270d69 commit 4b2765a
Show file tree
Hide file tree
Showing 150 changed files with 3,589 additions and 995 deletions.
8 changes: 4 additions & 4 deletions Documentation/bpf/kfuncs.rst
Original file line number Diff line number Diff line change
Expand Up @@ -177,10 +177,10 @@ In addition to kfuncs' arguments, verifier may need more information about the
type of kfunc(s) being registered with the BPF subsystem. To do so, we define
flags on a set of kfuncs as follows::

BTF_SET8_START(bpf_task_set)
BTF_KFUNCS_START(bpf_task_set)
BTF_ID_FLAGS(func, bpf_get_task_pid, KF_ACQUIRE | KF_RET_NULL)
BTF_ID_FLAGS(func, bpf_put_pid, KF_RELEASE)
BTF_SET8_END(bpf_task_set)
BTF_KFUNCS_END(bpf_task_set)

This set encodes the BTF ID of each kfunc listed above, and encodes the flags
along with it. Ofcourse, it is also allowed to specify no flags.
Expand Down Expand Up @@ -347,10 +347,10 @@ Once the kfunc is prepared for use, the final step to making it visible is
registering it with the BPF subsystem. Registration is done per BPF program
type. An example is shown below::

BTF_SET8_START(bpf_task_set)
BTF_KFUNCS_START(bpf_task_set)
BTF_ID_FLAGS(func, bpf_get_task_pid, KF_ACQUIRE | KF_RET_NULL)
BTF_ID_FLAGS(func, bpf_put_pid, KF_RELEASE)
BTF_SET8_END(bpf_task_set)
BTF_KFUNCS_END(bpf_task_set)

static const struct btf_kfunc_id_set bpf_task_kfunc_set = {
.owner = THIS_MODULE,
Expand Down
2 changes: 1 addition & 1 deletion Documentation/bpf/map_lpm_trie.rst
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ significant byte.

LPM tries may be created with a maximum prefix length that is a multiple
of 8, in the range from 8 to 2048. The key used for lookup and update
operations is a ``struct bpf_lpm_trie_key``, extended by
operations is a ``struct bpf_lpm_trie_key_u8``, extended by
``max_prefixlen/8`` bytes.

- For IPv4 addresses the data length is 4 bytes
Expand Down
155 changes: 89 additions & 66 deletions Documentation/bpf/standardization/instruction-set.rst
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
.. contents::
.. sectnum::

=======================================
BPF Instruction Set Specification, v1.0
=======================================
======================================
BPF Instruction Set Architecture (ISA)
======================================

This document specifies version 1.0 of the BPF instruction set.
This document specifies the BPF instruction set architecture (ISA).

Documentation conventions
=========================
Expand Down Expand Up @@ -102,7 +102,7 @@ Conformance groups

An implementation does not need to support all instructions specified in this
document (e.g., deprecated instructions). Instead, a number of conformance
groups are specified. An implementation must support the "basic" conformance
groups are specified. An implementation must support the base32 conformance
group and may support additional conformance groups, where supporting a
conformance group means it must support all instructions in that conformance
group.
Expand All @@ -112,12 +112,22 @@ that executes instructions, and tools as such compilers that generate
instructions for the runtime. Thus, capability discovery in terms of
conformance groups might be done manually by users or automatically by tools.

Each conformance group has a short ASCII label (e.g., "basic") that
Each conformance group has a short ASCII label (e.g., "base32") that
corresponds to a set of instructions that are mandatory. That is, each
instruction has one or more conformance groups of which it is a member.

The "basic" conformance group includes all instructions defined in this
specification unless otherwise noted.
This document defines the following conformance groups:

* base32: includes all instructions defined in this
specification unless otherwise noted.
* base64: includes base32, plus instructions explicitly noted
as being in the base64 conformance group.
* atomic32: includes 32-bit atomic operation instructions (see `Atomic operations`_).
* atomic64: includes atomic32, plus 64-bit atomic operation instructions.
* divmul32: includes 32-bit division, multiplication, and modulo instructions.
* divmul64: includes divmul32, plus 64-bit division, multiplication,
and modulo instructions.
* legacy: deprecated packet access instructions.

Instruction encoding
====================
Expand Down Expand Up @@ -166,10 +176,10 @@ Note that most instructions do not use all of the fields.
Unused fields shall be cleared to zero.

As discussed below in `64-bit immediate instructions`_, a 64-bit immediate
instruction uses a 64-bit immediate value that is constructed as follows.
instruction uses two 32-bit immediate values that are constructed as follows.
The 64 bits following the basic instruction contain a pseudo instruction
using the same format but with opcode, dst_reg, src_reg, and offset all set to zero,
and imm containing the high 32 bits of the immediate value.
using the same format but with 'opcode', 'dst_reg', 'src_reg', and 'offset' all
set to zero, and imm containing the high 32 bits of the immediate value.

This is depicted in the following figure::

Expand All @@ -181,13 +191,8 @@ This is depicted in the following figure::
'--------------'
pseudo instruction

Thus the 64-bit immediate value is constructed as follows:

imm64 = (next_imm << 32) | imm

where 'next_imm' refers to the imm value of the pseudo instruction
following the basic instruction. The unused bytes in the pseudo
instruction are reserved and shall be cleared to zero.
Here, the imm value of the pseudo instruction is called 'next_imm'. The unused
bytes in the pseudo instruction are reserved and shall be cleared to zero.

Instruction classes
-------------------
Expand Down Expand Up @@ -239,7 +244,8 @@ Arithmetic instructions
-----------------------

``BPF_ALU`` uses 32-bit wide operands while ``BPF_ALU64`` uses 64-bit wide operands for
otherwise identical operations.
otherwise identical operations. ``BPF_ALU64`` instructions belong to the
base64 conformance group unless noted otherwise.
The 'code' field encodes the operation as below, where 'src' and 'dst' refer
to the values of the source and destination registers, respectively.

Expand Down Expand Up @@ -284,15 +290,19 @@ where '(u32)' indicates that the upper 32 bits are zeroed.

``BPF_XOR | BPF_K | BPF_ALU`` means::

dst = (u32) dst ^ (u32) imm32
dst = (u32) dst ^ (u32) imm

``BPF_XOR | BPF_K | BPF_ALU64`` means::

dst = dst ^ imm32
dst = dst ^ imm

Note that most instructions have instruction offset of 0. Only three instructions
(``BPF_SDIV``, ``BPF_SMOD``, ``BPF_MOVSX``) have a non-zero offset.

Division, multiplication, and modulo operations for ``BPF_ALU`` are part
of the "divmul32" conformance group, and division, multiplication, and
modulo operations for ``BPF_ALU64`` are part of the "divmul64" conformance
group.
The division and modulo operations support both unsigned and signed flavors.

For unsigned operations (``BPF_DIV`` and ``BPF_MOD``), for ``BPF_ALU``,
Expand Down Expand Up @@ -349,7 +359,9 @@ BPF_ALU64 Reserved 0x00 do byte swap unconditionally
========= ========= ===== =================================================

The 'imm' field encodes the width of the swap operations. The following widths
are supported: 16, 32 and 64.
are supported: 16, 32 and 64. Width 64 operations belong to the base64
conformance group and other swap operations belong to the base32
conformance group.

Examples:

Expand All @@ -374,31 +386,33 @@ Examples:
Jump instructions
-----------------

``BPF_JMP32`` uses 32-bit wide operands while ``BPF_JMP`` uses 64-bit wide operands for
otherwise identical operations.
``BPF_JMP32`` uses 32-bit wide operands and indicates the base32
conformance group, while ``BPF_JMP`` uses 64-bit wide operands for
otherwise identical operations, and indicates the base64 conformance
group unless otherwise specified.
The 'code' field encodes the operation as below:

======== ===== === =============================== =============================================
code value src description notes
======== ===== === =============================== =============================================
BPF_JA 0x0 0x0 PC += offset BPF_JMP | BPF_K only
BPF_JA 0x0 0x0 PC += imm BPF_JMP32 | BPF_K only
BPF_JEQ 0x1 any PC += offset if dst == src
BPF_JGT 0x2 any PC += offset if dst > src unsigned
BPF_JGE 0x3 any PC += offset if dst >= src unsigned
BPF_JSET 0x4 any PC += offset if dst & src
BPF_JNE 0x5 any PC += offset if dst != src
BPF_JSGT 0x6 any PC += offset if dst > src signed
BPF_JSGE 0x7 any PC += offset if dst >= src signed
BPF_CALL 0x8 0x0 call helper function by address BPF_JMP | BPF_K only, see `Helper functions`_
BPF_CALL 0x8 0x1 call PC += imm BPF_JMP | BPF_K only, see `Program-local functions`_
BPF_CALL 0x8 0x2 call helper function by BTF ID BPF_JMP | BPF_K only, see `Helper functions`_
BPF_EXIT 0x9 0x0 return BPF_JMP | BPF_K only
BPF_JLT 0xa any PC += offset if dst < src unsigned
BPF_JLE 0xb any PC += offset if dst <= src unsigned
BPF_JSLT 0xc any PC += offset if dst < src signed
BPF_JSLE 0xd any PC += offset if dst <= src signed
======== ===== === =============================== =============================================
======== ===== ======= =============================== =============================================
code value src_reg description notes
======== ===== ======= =============================== =============================================
BPF_JA 0x0 0x0 PC += offset BPF_JMP | BPF_K only
BPF_JA 0x0 0x0 PC += imm BPF_JMP32 | BPF_K only
BPF_JEQ 0x1 any PC += offset if dst == src
BPF_JGT 0x2 any PC += offset if dst > src unsigned
BPF_JGE 0x3 any PC += offset if dst >= src unsigned
BPF_JSET 0x4 any PC += offset if dst & src
BPF_JNE 0x5 any PC += offset if dst != src
BPF_JSGT 0x6 any PC += offset if dst > src signed
BPF_JSGE 0x7 any PC += offset if dst >= src signed
BPF_CALL 0x8 0x0 call helper function by address BPF_JMP | BPF_K only, see `Helper functions`_
BPF_CALL 0x8 0x1 call PC += imm BPF_JMP | BPF_K only, see `Program-local functions`_
BPF_CALL 0x8 0x2 call helper function by BTF ID BPF_JMP | BPF_K only, see `Helper functions`_
BPF_EXIT 0x9 0x0 return BPF_JMP | BPF_K only
BPF_JLT 0xa any PC += offset if dst < src unsigned
BPF_JLE 0xb any PC += offset if dst <= src unsigned
BPF_JSLT 0xc any PC += offset if dst < src signed
BPF_JSLE 0xd any PC += offset if dst <= src signed
======== ===== ======= =============================== =============================================

The BPF program needs to store the return value into register R0 before doing a
``BPF_EXIT``.
Expand All @@ -424,6 +438,9 @@ specified by the 'imm' field. A > 16-bit conditional jump may be
converted to a < 16-bit conditional jump plus a 32-bit unconditional
jump.

All ``BPF_CALL`` and ``BPF_JA`` instructions belong to the
base32 conformance group.

Helper functions
~~~~~~~~~~~~~~~~

Expand Down Expand Up @@ -481,6 +498,8 @@ The size modifier is one of:
BPF_DW 0x18 double word (8 bytes)
============= ===== =====================

Instructions using ``BPF_DW`` belong to the base64 conformance group.

Regular load and store operations
---------------------------------

Expand All @@ -493,7 +512,7 @@ instructions that transfer data between a register and memory.

``BPF_MEM | <size> | BPF_ST`` means::

*(size *) (dst + offset) = imm32
*(size *) (dst + offset) = imm

``BPF_MEM | <size> | BPF_LDX`` means::

Expand Down Expand Up @@ -525,8 +544,10 @@ by other BPF programs or means outside of this specification.
All atomic operations supported by BPF are encoded as store operations
that use the ``BPF_ATOMIC`` mode modifier as follows:

* ``BPF_ATOMIC | BPF_W | BPF_STX`` for 32-bit operations
* ``BPF_ATOMIC | BPF_DW | BPF_STX`` for 64-bit operations
* ``BPF_ATOMIC | BPF_W | BPF_STX`` for 32-bit operations, which are
part of the "atomic32" conformance group.
* ``BPF_ATOMIC | BPF_DW | BPF_STX`` for 64-bit operations, which are
part of the "atomic64" conformance group.
* 8-bit and 16-bit wide atomic operations are not supported.

The 'imm' field is used to encode the actual atomic operation.
Expand All @@ -547,7 +568,7 @@ BPF_XOR 0xa0 atomic xor

*(u32 *)(dst + offset) += src

``BPF_ATOMIC | BPF_DW | BPF_STX`` with 'imm' = BPF ADD means::
``BPF_ATOMIC | BPF_DW | BPF_STX`` with 'imm' = BPF_ADD means::

*(u64 *)(dst + offset) += src

Expand Down Expand Up @@ -580,24 +601,24 @@ and loaded back to ``R0``.
-----------------------------

Instructions with the ``BPF_IMM`` 'mode' modifier use the wide instruction
encoding defined in `Instruction encoding`_, and use the 'src' field of the
encoding defined in `Instruction encoding`_, and use the 'src_reg' field of the
basic instruction to hold an opcode subtype.

The following table defines a set of ``BPF_IMM | BPF_DW | BPF_LD`` instructions
with opcode subtypes in the 'src' field, using new terms such as "map"
with opcode subtypes in the 'src_reg' field, using new terms such as "map"
defined further below:

========================= ====== === ========================================= =========== ==============
opcode construction opcode src pseudocode imm type dst type
========================= ====== === ========================================= =========== ==============
BPF_IMM | BPF_DW | BPF_LD 0x18 0x0 dst = imm64 integer integer
BPF_IMM | BPF_DW | BPF_LD 0x18 0x1 dst = map_by_fd(imm) map fd map
BPF_IMM | BPF_DW | BPF_LD 0x18 0x2 dst = map_val(map_by_fd(imm)) + next_imm map fd data pointer
BPF_IMM | BPF_DW | BPF_LD 0x18 0x3 dst = var_addr(imm) variable id data pointer
BPF_IMM | BPF_DW | BPF_LD 0x18 0x4 dst = code_addr(imm) integer code pointer
BPF_IMM | BPF_DW | BPF_LD 0x18 0x5 dst = map_by_idx(imm) map index map
BPF_IMM | BPF_DW | BPF_LD 0x18 0x6 dst = map_val(map_by_idx(imm)) + next_imm map index data pointer
========================= ====== === ========================================= =========== ==============
========================= ====== ======= ========================================= =========== ==============
opcode construction opcode src_reg pseudocode imm type dst type
========================= ====== ======= ========================================= =========== ==============
BPF_IMM | BPF_DW | BPF_LD 0x18 0x0 dst = (next_imm << 32) | imm integer integer
BPF_IMM | BPF_DW | BPF_LD 0x18 0x1 dst = map_by_fd(imm) map fd map
BPF_IMM | BPF_DW | BPF_LD 0x18 0x2 dst = map_val(map_by_fd(imm)) + next_imm map fd data pointer
BPF_IMM | BPF_DW | BPF_LD 0x18 0x3 dst = var_addr(imm) variable id data pointer
BPF_IMM | BPF_DW | BPF_LD 0x18 0x4 dst = code_addr(imm) integer code pointer
BPF_IMM | BPF_DW | BPF_LD 0x18 0x5 dst = map_by_idx(imm) map index map
BPF_IMM | BPF_DW | BPF_LD 0x18 0x6 dst = map_val(map_by_idx(imm)) + next_imm map index data pointer
========================= ====== ======= ========================================= =========== ==============

where

Expand Down Expand Up @@ -635,7 +656,9 @@ Legacy BPF Packet access instructions
-------------------------------------

BPF previously introduced special instructions for access to packet data that were
carried over from classic BPF. However, these instructions are
deprecated and should no longer be used. All legacy packet access
instructions belong to the "legacy" conformance group instead of the "basic"
conformance group.
carried over from classic BPF. These instructions used an instruction
class of BPF_LD, a size modifier of BPF_W, BPF_H, or BPF_B, and a
mode modifier of BPF_ABS or BPF_IND. The 'dst_reg' and 'offset' fields were
set to zero, and 'src_reg' was set to zero for BPF_ABS. However, these
instructions are deprecated and should no longer be used. All legacy packet
access instructions belong to the "legacy" conformance group.
33 changes: 19 additions & 14 deletions Documentation/networking/af_xdp.rst
Original file line number Diff line number Diff line change
Expand Up @@ -329,23 +329,24 @@ XDP_SHARED_UMEM option and provide the initial socket's fd in the
sxdp_shared_umem_fd field as you registered the UMEM on that
socket. These two sockets will now share one and the same UMEM.

There is no need to supply an XDP program like the one in the previous
case where sockets were bound to the same queue id and
device. Instead, use the NIC's packet steering capabilities to steer
the packets to the right queue. In the previous example, there is only
one queue shared among sockets, so the NIC cannot do this steering. It
can only steer between queues.

In libbpf, you need to use the xsk_socket__create_shared() API as it
takes a reference to a FILL ring and a COMPLETION ring that will be
created for you and bound to the shared UMEM. You can use this
function for all the sockets you create, or you can use it for the
second and following ones and use xsk_socket__create() for the first
one. Both methods yield the same result.
In this case, it is possible to use the NIC's packet steering
capabilities to steer the packets to the right queue. This is not
possible in the previous example as there is only one queue shared
among sockets, so the NIC cannot do this steering as it can only steer
between queues.

In libxdp (or libbpf prior to version 1.0), you need to use the
xsk_socket__create_shared() API as it takes a reference to a FILL ring
and a COMPLETION ring that will be created for you and bound to the
shared UMEM. You can use this function for all the sockets you create,
or you can use it for the second and following ones and use
xsk_socket__create() for the first one. Both methods yield the same
result.

Note that a UMEM can be shared between sockets on the same queue id
and device, as well as between queues on the same device and between
devices at the same time.
devices at the same time. It is also possible to redirect to any
socket as long as it is bound to the same umem with XDP_SHARED_UMEM.

XDP_USE_NEED_WAKEUP bind flag
-----------------------------
Expand Down Expand Up @@ -822,6 +823,10 @@ A: The short answer is no, that is not supported at the moment. The
switch, or other distribution mechanism, in your NIC to direct
traffic to the correct queue id and socket.

Note that if you are using the XDP_SHARED_UMEM option, it is
possible to switch traffic between any socket bound to the same
umem.

Q: My packets are sometimes corrupted. What is wrong?

A: Care has to be taken not to feed the same buffer in the UMEM into
Expand Down
2 changes: 2 additions & 0 deletions arch/arm64/include/asm/patching.h
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,8 @@ int aarch64_insn_read(void *addr, u32 *insnp);
int aarch64_insn_write(void *addr, u32 insn);

int aarch64_insn_write_literal_u64(void *addr, u64 val);
void *aarch64_insn_set(void *dst, u32 insn, size_t len);
void *aarch64_insn_copy(void *dst, void *src, size_t len);

int aarch64_insn_patch_text_nosync(void *addr, u32 insn);
int aarch64_insn_patch_text(void *addrs[], u32 insns[], int cnt);
Expand Down
Loading

0 comments on commit 4b2765a

Please sign in to comment.