Skip to content

Commit

Permalink
Merge tag 'sched_ext-for-6.12' of git://git.kernel.org/pub/scm/linux/…
Browse files Browse the repository at this point in the history
…kernel/git/tj/sched_ext

Pull sched_ext support from Tejun Heo:
 "This implements a new scheduler class called ‘ext_sched_class’, or
  sched_ext, which allows scheduling policies to be implemented as BPF
  programs.

  The goals of this are:

   - Ease of experimentation and exploration: Enabling rapid iteration
     of new scheduling policies.

   - Customization: Building application-specific schedulers which
     implement policies that are not applicable to general-purpose
     schedulers.

   - Rapid scheduler deployments: Non-disruptive swap outs of scheduling
     policies in production environments"

See individual commits for more documentation, but also the cover letter
for the latest series:

Link: https://lore.kernel.org/all/20240618212056.2833381-1-tj@kernel.org/

* tag 'sched_ext-for-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: (110 commits)
  sched: Move update_other_load_avgs() to kernel/sched/pelt.c
  sched_ext: Don't trigger ops.quiescent/runnable() on migrations
  sched_ext: Synchronize bypass state changes with rq lock
  scx_qmap: Implement highpri boosting
  sched_ext: Implement scx_bpf_dispatch[_vtime]_from_dsq()
  sched_ext: Compact struct bpf_iter_scx_dsq_kern
  sched_ext: Replace consume_local_task() with move_local_task_to_local_dsq()
  sched_ext: Move consume_local_task() upward
  sched_ext: Move sanity check and dsq_mod_nr() into task_unlink_from_dsq()
  sched_ext: Reorder args for consume_local/remote_task()
  sched_ext: Restructure dispatch_to_local_dsq()
  sched_ext: Fix processs_ddsp_deferred_locals() by unifying DTL_INVALID handling
  sched_ext: Make find_dsq_for_dispatch() handle SCX_DSQ_LOCAL_ON
  sched_ext: Refactor consume_remote_task()
  sched_ext: Rename scx_kfunc_set_sleepable to unlocked and relocate
  sched_ext: Add missing static to scx_dump_data
  sched_ext: Add missing static to scx_has_op[]
  sched_ext: Temporarily work around pick_task_scx() being called without balance_scx()
  sched_ext: Add a cgroup scheduler which uses flattened hierarchy
  sched_ext: Add cgroup support
  ...
  • Loading branch information
torvalds committed Sep 21, 2024
2 parents 440b652 + 902d67a commit 8826498
Show file tree
Hide file tree
Showing 99 changed files with 16,029 additions and 126 deletions.
1 change: 1 addition & 0 deletions Documentation/scheduler/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@ Scheduler
sched-nice-design
sched-rt-group
sched-stats
sched-ext
sched-debug

text_files
Expand Down
316 changes: 316 additions & 0 deletions Documentation/scheduler/sched-ext.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,316 @@
==========================
Extensible Scheduler Class
==========================

sched_ext is a scheduler class whose behavior can be defined by a set of BPF
programs - the BPF scheduler.

* sched_ext exports a full scheduling interface so that any scheduling
algorithm can be implemented on top.

* The BPF scheduler can group CPUs however it sees fit and schedule them
together, as tasks aren't tied to specific CPUs at the time of wakeup.

* The BPF scheduler can be turned on and off dynamically anytime.

* The system integrity is maintained no matter what the BPF scheduler does.
The default scheduling behavior is restored anytime an error is detected,
a runnable task stalls, or on invoking the SysRq key sequence
:kbd:`SysRq-S`.

* When the BPF scheduler triggers an error, debug information is dumped to
aid debugging. The debug dump is passed to and printed out by the
scheduler binary. The debug dump can also be accessed through the
`sched_ext_dump` tracepoint. The SysRq key sequence :kbd:`SysRq-D`
triggers a debug dump. This doesn't terminate the BPF scheduler and can
only be read through the tracepoint.

Switching to and from sched_ext
===============================

``CONFIG_SCHED_CLASS_EXT`` is the config option to enable sched_ext and
``tools/sched_ext`` contains the example schedulers. The following config
options should be enabled to use sched_ext:

.. code-block:: none
CONFIG_BPF=y
CONFIG_SCHED_CLASS_EXT=y
CONFIG_BPF_SYSCALL=y
CONFIG_BPF_JIT=y
CONFIG_DEBUG_INFO_BTF=y
CONFIG_BPF_JIT_ALWAYS_ON=y
CONFIG_BPF_JIT_DEFAULT_ON=y
CONFIG_PAHOLE_HAS_SPLIT_BTF=y
CONFIG_PAHOLE_HAS_BTF_TAG=y
sched_ext is used only when the BPF scheduler is loaded and running.

If a task explicitly sets its scheduling policy to ``SCHED_EXT``, it will be
treated as ``SCHED_NORMAL`` and scheduled by CFS until the BPF scheduler is
loaded.

When the BPF scheduler is loaded and ``SCX_OPS_SWITCH_PARTIAL`` is not set
in ``ops->flags``, all ``SCHED_NORMAL``, ``SCHED_BATCH``, ``SCHED_IDLE``, and
``SCHED_EXT`` tasks are scheduled by sched_ext.

However, when the BPF scheduler is loaded and ``SCX_OPS_SWITCH_PARTIAL`` is
set in ``ops->flags``, only tasks with the ``SCHED_EXT`` policy are scheduled
by sched_ext, while tasks with ``SCHED_NORMAL``, ``SCHED_BATCH`` and
``SCHED_IDLE`` policies are scheduled by CFS.

Terminating the sched_ext scheduler program, triggering :kbd:`SysRq-S`, or
detection of any internal error including stalled runnable tasks aborts the
BPF scheduler and reverts all tasks back to CFS.

.. code-block:: none
# make -j16 -C tools/sched_ext
# tools/sched_ext/scx_simple
local=0 global=3
local=5 global=24
local=9 global=44
local=13 global=56
local=17 global=72
^CEXIT: BPF scheduler unregistered
The current status of the BPF scheduler can be determined as follows:

.. code-block:: none
# cat /sys/kernel/sched_ext/state
enabled
# cat /sys/kernel/sched_ext/root/ops
simple
``tools/sched_ext/scx_show_state.py`` is a drgn script which shows more
detailed information:

.. code-block:: none
# tools/sched_ext/scx_show_state.py
ops : simple
enabled : 1
switching_all : 1
switched_all : 1
enable_state : enabled (2)
bypass_depth : 0
nr_rejected : 0
If ``CONFIG_SCHED_DEBUG`` is set, whether a given task is on sched_ext can
be determined as follows:

.. code-block:: none
# grep ext /proc/self/sched
ext.enabled : 1
The Basics
==========

Userspace can implement an arbitrary BPF scheduler by loading a set of BPF
programs that implement ``struct sched_ext_ops``. The only mandatory field
is ``ops.name`` which must be a valid BPF object name. All operations are
optional. The following modified excerpt is from
``tools/sched_ext/scx_simple.bpf.c`` showing a minimal global FIFO scheduler.

.. code-block:: c
/*
* Decide which CPU a task should be migrated to before being
* enqueued (either at wakeup, fork time, or exec time). If an
* idle core is found by the default ops.select_cpu() implementation,
* then dispatch the task directly to SCX_DSQ_LOCAL and skip the
* ops.enqueue() callback.
*
* Note that this implementation has exactly the same behavior as the
* default ops.select_cpu implementation. The behavior of the scheduler
* would be exactly same if the implementation just didn't define the
* simple_select_cpu() struct_ops prog.
*/
s32 BPF_STRUCT_OPS(simple_select_cpu, struct task_struct *p,
s32 prev_cpu, u64 wake_flags)
{
s32 cpu;
/* Need to initialize or the BPF verifier will reject the program */
bool direct = false;
cpu = scx_bpf_select_cpu_dfl(p, prev_cpu, wake_flags, &direct);
if (direct)
scx_bpf_dispatch(p, SCX_DSQ_LOCAL, SCX_SLICE_DFL, 0);
return cpu;
}
/*
* Do a direct dispatch of a task to the global DSQ. This ops.enqueue()
* callback will only be invoked if we failed to find a core to dispatch
* to in ops.select_cpu() above.
*
* Note that this implementation has exactly the same behavior as the
* default ops.enqueue implementation, which just dispatches the task
* to SCX_DSQ_GLOBAL. The behavior of the scheduler would be exactly same
* if the implementation just didn't define the simple_enqueue struct_ops
* prog.
*/
void BPF_STRUCT_OPS(simple_enqueue, struct task_struct *p, u64 enq_flags)
{
scx_bpf_dispatch(p, SCX_DSQ_GLOBAL, SCX_SLICE_DFL, enq_flags);
}
s32 BPF_STRUCT_OPS_SLEEPABLE(simple_init)
{
/*
* By default, all SCHED_EXT, SCHED_OTHER, SCHED_IDLE, and
* SCHED_BATCH tasks should use sched_ext.
*/
return 0;
}
void BPF_STRUCT_OPS(simple_exit, struct scx_exit_info *ei)
{
exit_type = ei->type;
}
SEC(".struct_ops")
struct sched_ext_ops simple_ops = {
.select_cpu = (void *)simple_select_cpu,
.enqueue = (void *)simple_enqueue,
.init = (void *)simple_init,
.exit = (void *)simple_exit,
.name = "simple",
};
Dispatch Queues
---------------

To match the impedance between the scheduler core and the BPF scheduler,
sched_ext uses DSQs (dispatch queues) which can operate as both a FIFO and a
priority queue. By default, there is one global FIFO (``SCX_DSQ_GLOBAL``),
and one local dsq per CPU (``SCX_DSQ_LOCAL``). The BPF scheduler can manage
an arbitrary number of dsq's using ``scx_bpf_create_dsq()`` and
``scx_bpf_destroy_dsq()``.

A CPU always executes a task from its local DSQ. A task is "dispatched" to a
DSQ. A non-local DSQ is "consumed" to transfer a task to the consuming CPU's
local DSQ.

When a CPU is looking for the next task to run, if the local DSQ is not
empty, the first task is picked. Otherwise, the CPU tries to consume the
global DSQ. If that doesn't yield a runnable task either, ``ops.dispatch()``
is invoked.

Scheduling Cycle
----------------

The following briefly shows how a waking task is scheduled and executed.

1. When a task is waking up, ``ops.select_cpu()`` is the first operation
invoked. This serves two purposes. First, CPU selection optimization
hint. Second, waking up the selected CPU if idle.

The CPU selected by ``ops.select_cpu()`` is an optimization hint and not
binding. The actual decision is made at the last step of scheduling.
However, there is a small performance gain if the CPU
``ops.select_cpu()`` returns matches the CPU the task eventually runs on.

A side-effect of selecting a CPU is waking it up from idle. While a BPF
scheduler can wake up any cpu using the ``scx_bpf_kick_cpu()`` helper,
using ``ops.select_cpu()`` judiciously can be simpler and more efficient.

A task can be immediately dispatched to a DSQ from ``ops.select_cpu()`` by
calling ``scx_bpf_dispatch()``. If the task is dispatched to
``SCX_DSQ_LOCAL`` from ``ops.select_cpu()``, it will be dispatched to the
local DSQ of whichever CPU is returned from ``ops.select_cpu()``.
Additionally, dispatching directly from ``ops.select_cpu()`` will cause the
``ops.enqueue()`` callback to be skipped.

Note that the scheduler core will ignore an invalid CPU selection, for
example, if it's outside the allowed cpumask of the task.

2. Once the target CPU is selected, ``ops.enqueue()`` is invoked (unless the
task was dispatched directly from ``ops.select_cpu()``). ``ops.enqueue()``
can make one of the following decisions:

* Immediately dispatch the task to either the global or local DSQ by
calling ``scx_bpf_dispatch()`` with ``SCX_DSQ_GLOBAL`` or
``SCX_DSQ_LOCAL``, respectively.

* Immediately dispatch the task to a custom DSQ by calling
``scx_bpf_dispatch()`` with a DSQ ID which is smaller than 2^63.

* Queue the task on the BPF side.

3. When a CPU is ready to schedule, it first looks at its local DSQ. If
empty, it then looks at the global DSQ. If there still isn't a task to
run, ``ops.dispatch()`` is invoked which can use the following two
functions to populate the local DSQ.

* ``scx_bpf_dispatch()`` dispatches a task to a DSQ. Any target DSQ can
be used - ``SCX_DSQ_LOCAL``, ``SCX_DSQ_LOCAL_ON | cpu``,
``SCX_DSQ_GLOBAL`` or a custom DSQ. While ``scx_bpf_dispatch()``
currently can't be called with BPF locks held, this is being worked on
and will be supported. ``scx_bpf_dispatch()`` schedules dispatching
rather than performing them immediately. There can be up to
``ops.dispatch_max_batch`` pending tasks.

* ``scx_bpf_consume()`` tranfers a task from the specified non-local DSQ
to the dispatching DSQ. This function cannot be called with any BPF
locks held. ``scx_bpf_consume()`` flushes the pending dispatched tasks
before trying to consume the specified DSQ.

4. After ``ops.dispatch()`` returns, if there are tasks in the local DSQ,
the CPU runs the first one. If empty, the following steps are taken:

* Try to consume the global DSQ. If successful, run the task.

* If ``ops.dispatch()`` has dispatched any tasks, retry #3.

* If the previous task is an SCX task and still runnable, keep executing
it (see ``SCX_OPS_ENQ_LAST``).

* Go idle.

Note that the BPF scheduler can always choose to dispatch tasks immediately
in ``ops.enqueue()`` as illustrated in the above simple example. If only the
built-in DSQs are used, there is no need to implement ``ops.dispatch()`` as
a task is never queued on the BPF scheduler and both the local and global
DSQs are consumed automatically.

``scx_bpf_dispatch()`` queues the task on the FIFO of the target DSQ. Use
``scx_bpf_dispatch_vtime()`` for the priority queue. Internal DSQs such as
``SCX_DSQ_LOCAL`` and ``SCX_DSQ_GLOBAL`` do not support priority-queue
dispatching, and must be dispatched to with ``scx_bpf_dispatch()``. See the
function documentation and usage in ``tools/sched_ext/scx_simple.bpf.c`` for
more information.

Where to Look
=============

* ``include/linux/sched/ext.h`` defines the core data structures, ops table
and constants.

* ``kernel/sched/ext.c`` contains sched_ext core implementation and helpers.
The functions prefixed with ``scx_bpf_`` can be called from the BPF
scheduler.

* ``tools/sched_ext/`` hosts example BPF scheduler implementations.

* ``scx_simple[.bpf].c``: Minimal global FIFO scheduler example using a
custom DSQ.

* ``scx_qmap[.bpf].c``: A multi-level FIFO scheduler supporting five
levels of priority implemented with ``BPF_MAP_TYPE_QUEUE``.

ABI Instability
===============

The APIs provided by sched_ext to BPF schedulers programs have no stability
guarantees. This includes the ops table callbacks and constants defined in
``include/linux/sched/ext.h``, as well as the ``scx_bpf_`` kfuncs defined in
``kernel/sched/ext.c``.

While we will attempt to provide a relatively stable API surface when
possible, they are subject to change without warning between kernel
versions.
13 changes: 13 additions & 0 deletions MAINTAINERS
Original file line number Diff line number Diff line change
Expand Up @@ -20511,6 +20511,19 @@ F: include/linux/wait.h
F: include/uapi/linux/sched.h
F: kernel/sched/

SCHEDULER - SCHED_EXT
R: Tejun Heo <tj@kernel.org>
R: David Vernet <void@manifault.com>
L: linux-kernel@vger.kernel.org
S: Maintained
W: https://github.com/sched-ext/scx
T: git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext.git
F: include/linux/sched/ext.h
F: kernel/sched/ext.h
F: kernel/sched/ext.c
F: tools/sched_ext/
F: tools/testing/selftests/sched_ext

SCIOSENSE ENS160 MULTI-GAS SENSOR DRIVER
M: Gustavo Silva <gustavograzs@gmail.com>
S: Maintained
Expand Down
1 change: 1 addition & 0 deletions drivers/tty/sysrq.c
Original file line number Diff line number Diff line change
Expand Up @@ -531,6 +531,7 @@ static const struct sysrq_key_op *sysrq_key_table[62] = {
NULL, /* P */
NULL, /* Q */
&sysrq_replay_logs_op, /* R */
/* S: May be registered by sched_ext for resetting */
NULL, /* S */
NULL, /* T */
NULL, /* U */
Expand Down
1 change: 1 addition & 0 deletions include/asm-generic/vmlinux.lds.h
Original file line number Diff line number Diff line change
Expand Up @@ -133,6 +133,7 @@
*(__dl_sched_class) \
*(__rt_sched_class) \
*(__fair_sched_class) \
*(__ext_sched_class) \
*(__idle_sched_class) \
__sched_class_lowest = .;

Expand Down
4 changes: 2 additions & 2 deletions include/linux/cgroup.h
Original file line number Diff line number Diff line change
Expand Up @@ -29,8 +29,6 @@

struct kernel_clone_args;

#ifdef CONFIG_CGROUPS

/*
* All weight knobs on the default hierarchy should use the following min,
* default and max values. The default value is the logarithmic center of
Expand All @@ -40,6 +38,8 @@ struct kernel_clone_args;
#define CGROUP_WEIGHT_DFL 100
#define CGROUP_WEIGHT_MAX 10000

#ifdef CONFIG_CGROUPS

enum {
CSS_TASK_ITER_PROCS = (1U << 0), /* walk only threadgroup leaders */
CSS_TASK_ITER_THREADED = (1U << 1), /* walk all threaded css_sets in the domain */
Expand Down
5 changes: 5 additions & 0 deletions include/linux/sched.h
Original file line number Diff line number Diff line change
Expand Up @@ -82,6 +82,8 @@ struct task_group;
struct task_struct;
struct user_event_mm;

#include <linux/sched/ext.h>

/*
* Task state bitmask. NOTE! These bits are also
* encoded in fs/proc/array.c: get_task_state().
Expand Down Expand Up @@ -830,6 +832,9 @@ struct task_struct {
struct sched_rt_entity rt;
struct sched_dl_entity dl;
struct sched_dl_entity *dl_server;
#ifdef CONFIG_SCHED_CLASS_EXT
struct sched_ext_entity scx;
#endif
const struct sched_class *sched_class;

#ifdef CONFIG_SCHED_CORE
Expand Down
Loading

0 comments on commit 8826498

Please sign in to comment.