Skip to content

Commit

Permalink
Merge tag 'sched-core-2023-06-27' of git://git.kernel.org/pub/scm/lin…
Browse files Browse the repository at this point in the history
…ux/kernel/git/tip/tip

Pull scheduler updates from Ingo Molnar:
 "Scheduler SMP load-balancer improvements:

   - Avoid unnecessary migrations within SMT domains on hybrid systems.

     Problem:

        On hybrid CPU systems, (processors with a mixture of
        higher-frequency SMT cores and lower-frequency non-SMT cores),
        under the old code lower-priority CPUs pulled tasks from the
        higher-priority cores if more than one SMT sibling was busy -
        resulting in many unnecessary task migrations.

     Solution:

        The new code improves the load balancer to recognize SMT cores
        with more than one busy sibling and allows lower-priority CPUs
        to pull tasks, which avoids superfluous migrations and lets
        lower-priority cores inspect all SMT siblings for the busiest
        queue.

   - Implement the 'runnable boosting' feature in the EAS balancer:
     consider CPU contention in frequency, EAS max util & load-balance
     busiest CPU selection.

     This improves CPU utilization for certain workloads, while leaves
     other key workloads unchanged.

  Scheduler infrastructure improvements:

   - Rewrite the scheduler topology setup code by consolidating it into
     the build_sched_topology() helper function and building it
     dynamically on the fly.

   - Resolve the local_clock() vs. noinstr complications by rewriting
     the code: provide separate sched_clock_noinstr() and
     local_clock_noinstr() functions to be used in instrumentation code,
     and make sure it is all instrumentation-safe.

  Fixes:

   - Fix a kthread_park() race with wait_woken()

   - Fix misc wait_task_inactive() bugs unearthed by the -rt merge:
       - Fix UP PREEMPT bug by unifying the SMP and UP implementations
       - Fix task_struct::saved_state handling

   - Fix various rq clock update bugs, unearthed by turning on the rq
     clock debugging code.

   - Fix the PSI WINDOW_MIN_US trigger limit, which was easy to trigger
     by creating enough cgroups, by removing the warnign and restricting
     window size triggers to PSI file write-permission or
     CAP_SYS_RESOURCE.

   - Propagate SMT flags in the topology when removing degenerate domain

   - Fix grub_reclaim() calculation bug in the deadline scheduler code

   - Avoid resetting the min update period when it is unnecessary, in
     psi_trigger_destroy().

   - Don't balance a task to its current running CPU in load_balance(),
     which was possible on certain NUMA topologies with overlapping
     groups.

   - Fix the sched-debug printing of rq->nr_uninterruptible

  Cleanups:

   - Address various -Wmissing-prototype warnings, as a preparation to
     (maybe) enable this warning in the future.

   - Remove unused code

   - Mark more functions __init

   - Fix shadow-variable warnings"

* tag 'sched-core-2023-06-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (50 commits)
  sched/core: Avoid multiple calling update_rq_clock() in __cfsb_csd_unthrottle()
  sched/core: Avoid double calling update_rq_clock() in __balance_push_cpu_stop()
  sched/core: Fixed missing rq clock update before calling set_rq_offline()
  sched/deadline: Update GRUB description in the documentation
  sched/deadline: Fix bandwidth reclaim equation in GRUB
  sched/wait: Fix a kthread_park race with wait_woken()
  sched/topology: Mark set_sched_topology() __init
  sched/fair: Rename variable cpu_util eff_util
  arm64/arch_timer: Fix MMIO byteswap
  sched/fair, cpufreq: Introduce 'runnable boosting'
  sched/fair: Refactor CPU utilization functions
  cpuidle: Use local_clock_noinstr()
  sched/clock: Provide local_clock_noinstr()
  x86/tsc: Provide sched_clock_noinstr()
  clocksource: hyper-v: Provide noinstr sched_clock()
  clocksource: hyper-v: Adjust hv_read_tsc_page_tsc() to avoid special casing U64_MAX
  x86/vdso: Fix gettimeofday masking
  math64: Always inline u128 version of mul_u64_u64_shr()
  s390/time: Provide sched_clock_noinstr()
  loongarch: Provide noinstr sched_clock_read()
  ...
  • Loading branch information
torvalds committed Jun 27, 2023
2 parents e8f75c0 + ebb83d8 commit ed3b792
Show file tree
Hide file tree
Showing 43 changed files with 773 additions and 564 deletions.
5 changes: 4 additions & 1 deletion Documentation/scheduler/sched-deadline.rst
Original file line number Diff line number Diff line change
Expand Up @@ -203,12 +203,15 @@ Deadline Task Scheduling
- Total bandwidth (this_bw): this is the sum of all tasks "belonging" to the
runqueue, including the tasks in Inactive state.

- Maximum usable bandwidth (max_bw): This is the maximum bandwidth usable by
deadline tasks and is currently set to the RT capacity.


The algorithm reclaims the bandwidth of the tasks in Inactive state.
It does so by decrementing the runtime of the executing task Ti at a pace equal
to

dq = -max{ Ui / Umax, (1 - Uinact - Uextra) } dt
dq = -(max{ Ui, (Umax - Uinact - Uextra) } / Umax) dt

where:

Expand Down
8 changes: 1 addition & 7 deletions arch/arm64/include/asm/arch_timer.h
Original file line number Diff line number Diff line change
Expand Up @@ -88,13 +88,7 @@ static inline notrace u64 arch_timer_read_cntvct_el0(void)

#define arch_timer_reg_read_stable(reg) \
({ \
u64 _val; \
\
preempt_disable_notrace(); \
_val = erratum_handler(read_ ## reg)(); \
preempt_enable_notrace(); \
\
_val; \
erratum_handler(read_ ## reg)(); \
})

/*
Expand Down
12 changes: 6 additions & 6 deletions arch/arm64/include/asm/io.h
Original file line number Diff line number Diff line change
Expand Up @@ -22,13 +22,13 @@
* Generic IO read/write. These perform native-endian accesses.
*/
#define __raw_writeb __raw_writeb
static inline void __raw_writeb(u8 val, volatile void __iomem *addr)
static __always_inline void __raw_writeb(u8 val, volatile void __iomem *addr)
{
asm volatile("strb %w0, [%1]" : : "rZ" (val), "r" (addr));
}

#define __raw_writew __raw_writew
static inline void __raw_writew(u16 val, volatile void __iomem *addr)
static __always_inline void __raw_writew(u16 val, volatile void __iomem *addr)
{
asm volatile("strh %w0, [%1]" : : "rZ" (val), "r" (addr));
}
Expand All @@ -40,13 +40,13 @@ static __always_inline void __raw_writel(u32 val, volatile void __iomem *addr)
}

#define __raw_writeq __raw_writeq
static inline void __raw_writeq(u64 val, volatile void __iomem *addr)
static __always_inline void __raw_writeq(u64 val, volatile void __iomem *addr)
{
asm volatile("str %x0, [%1]" : : "rZ" (val), "r" (addr));
}

#define __raw_readb __raw_readb
static inline u8 __raw_readb(const volatile void __iomem *addr)
static __always_inline u8 __raw_readb(const volatile void __iomem *addr)
{
u8 val;
asm volatile(ALTERNATIVE("ldrb %w0, [%1]",
Expand All @@ -57,7 +57,7 @@ static inline u8 __raw_readb(const volatile void __iomem *addr)
}

#define __raw_readw __raw_readw
static inline u16 __raw_readw(const volatile void __iomem *addr)
static __always_inline u16 __raw_readw(const volatile void __iomem *addr)
{
u16 val;

Expand All @@ -80,7 +80,7 @@ static __always_inline u32 __raw_readl(const volatile void __iomem *addr)
}

#define __raw_readq __raw_readq
static inline u64 __raw_readq(const volatile void __iomem *addr)
static __always_inline u64 __raw_readq(const volatile void __iomem *addr)
{
u64 val;
asm volatile(ALTERNATIVE("ldr %0, [%1]",
Expand Down
2 changes: 1 addition & 1 deletion arch/loongarch/include/asm/loongarch.h
Original file line number Diff line number Diff line change
Expand Up @@ -1167,7 +1167,7 @@ static __always_inline void iocsr_write64(u64 val, u32 reg)

#ifndef __ASSEMBLY__

static inline u64 drdtime(void)
static __always_inline u64 drdtime(void)
{
int rID = 0;
u64 val = 0;
Expand Down
6 changes: 3 additions & 3 deletions arch/loongarch/kernel/time.c
Original file line number Diff line number Diff line change
Expand Up @@ -190,9 +190,9 @@ static u64 read_const_counter(struct clocksource *clk)
return drdtime();
}

static u64 native_sched_clock(void)
static noinstr u64 sched_clock_read(void)
{
return read_const_counter(NULL);
return drdtime();
}

static struct clocksource clocksource_const = {
Expand All @@ -211,7 +211,7 @@ int __init constant_clocksource_init(void)

res = clocksource_register_hz(&clocksource_const, freq);

sched_clock_register(native_sched_clock, 64, freq);
sched_clock_register(sched_clock_read, 64, freq);

pr_info("Constant clock source device register\n");

Expand Down
13 changes: 9 additions & 4 deletions arch/s390/include/asm/timex.h
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,7 @@ static inline int store_tod_clock_ext_cc(union tod_clock *clk)
return cc;
}

static inline void store_tod_clock_ext(union tod_clock *tod)
static __always_inline void store_tod_clock_ext(union tod_clock *tod)
{
asm volatile("stcke %0" : "=Q" (*tod) : : "cc");
}
Expand Down Expand Up @@ -177,7 +177,7 @@ static inline void local_tick_enable(unsigned long comp)

typedef unsigned long cycles_t;

static inline unsigned long get_tod_clock(void)
static __always_inline unsigned long get_tod_clock(void)
{
union tod_clock clk;

Expand All @@ -204,6 +204,11 @@ void init_cpu_timer(void);

extern union tod_clock tod_clock_base;

static __always_inline unsigned long __get_tod_clock_monotonic(void)
{
return get_tod_clock() - tod_clock_base.tod;
}

/**
* get_clock_monotonic - returns current time in clock rate units
*
Expand All @@ -216,7 +221,7 @@ static inline unsigned long get_tod_clock_monotonic(void)
unsigned long tod;

preempt_disable_notrace();
tod = get_tod_clock() - tod_clock_base.tod;
tod = __get_tod_clock_monotonic();
preempt_enable_notrace();
return tod;
}
Expand All @@ -240,7 +245,7 @@ static inline unsigned long get_tod_clock_monotonic(void)
* -> ns = (th * 125) + ((tl * 125) >> 9);
*
*/
static inline unsigned long tod_to_ns(unsigned long todval)
static __always_inline unsigned long tod_to_ns(unsigned long todval)
{
return ((todval >> 9) * 125) + (((todval & 0x1ff) * 125) >> 9);
}
Expand Down
5 changes: 5 additions & 0 deletions arch/s390/kernel/time.c
Original file line number Diff line number Diff line change
Expand Up @@ -102,6 +102,11 @@ void __init time_early_init(void)
((long) qui.old_leap * 4096000000L);
}

unsigned long long noinstr sched_clock_noinstr(void)
{
return tod_to_ns(__get_tod_clock_monotonic());
}

/*
* Scheduler clock - returns current time in nanosec units.
*/
Expand Down
5 changes: 5 additions & 0 deletions arch/x86/include/asm/mshyperv.h
Original file line number Diff line number Diff line change
Expand Up @@ -257,6 +257,11 @@ void hv_set_register(unsigned int reg, u64 value);
u64 hv_get_non_nested_register(unsigned int reg);
void hv_set_non_nested_register(unsigned int reg, u64 value);

static __always_inline u64 hv_raw_get_register(unsigned int reg)
{
return __rdmsr(reg);
}

#else /* CONFIG_HYPERV */
static inline void hyperv_init(void) {}
static inline void hyperv_setup_mmu_ops(void) {}
Expand Down
41 changes: 30 additions & 11 deletions arch/x86/include/asm/vdso/gettimeofday.h
Original file line number Diff line number Diff line change
Expand Up @@ -231,22 +231,27 @@ static u64 vread_pvclock(void)
ret = __pvclock_read_cycles(pvti, rdtsc_ordered());
} while (pvclock_read_retry(pvti, version));

return ret;
return ret & S64_MAX;
}
#endif

#ifdef CONFIG_HYPERV_TIMER
static u64 vread_hvclock(void)
{
return hv_read_tsc_page(&hvclock_page);
u64 tsc, time;

if (hv_read_tsc_page_tsc(&hvclock_page, &tsc, &time))
return time & S64_MAX;

return U64_MAX;
}
#endif

static inline u64 __arch_get_hw_counter(s32 clock_mode,
const struct vdso_data *vd)
{
if (likely(clock_mode == VDSO_CLOCKMODE_TSC))
return (u64)rdtsc_ordered();
return (u64)rdtsc_ordered() & S64_MAX;
/*
* For any memory-mapped vclock type, we need to make sure that gcc
* doesn't cleverly hoist a load before the mode check. Otherwise we
Expand Down Expand Up @@ -284,6 +289,9 @@ static inline bool arch_vdso_clocksource_ok(const struct vdso_data *vd)
* which can be invalidated asynchronously and indicate invalidation by
* returning U64_MAX, which can be effectively tested by checking for a
* negative value after casting it to s64.
*
* This effectively forces a S64_MAX mask on the calculations, unlike the
* U64_MAX mask normally used by x86 clocksources.
*/
static inline bool arch_vdso_cycles_ok(u64 cycles)
{
Expand All @@ -303,18 +311,29 @@ static inline bool arch_vdso_cycles_ok(u64 cycles)
* @last. If not then use @last, which is the base time of the current
* conversion period.
*
* This variant also removes the masking of the subtraction because the
* clocksource mask of all VDSO capable clocksources on x86 is U64_MAX
* which would result in a pointless operation. The compiler cannot
* optimize it away as the mask comes from the vdso data and is not compile
* time constant.
* This variant also uses a custom mask because while the clocksource mask of
* all the VDSO capable clocksources on x86 is U64_MAX, the above code uses
* U64_MASK as an exception value, additionally arch_vdso_cycles_ok() above
* declares everything with the MSB/Sign-bit set as invalid. Therefore the
* effective mask is S64_MAX.
*/
static __always_inline
u64 vdso_calc_delta(u64 cycles, u64 last, u64 mask, u32 mult)
{
if (cycles > last)
return (cycles - last) * mult;
return 0;
/*
* Due to the MSB/Sign-bit being used as invald marker (see
* arch_vdso_cycles_valid() above), the effective mask is S64_MAX.
*/
u64 delta = (cycles - last) & S64_MAX;

/*
* Due to the above mentioned TSC wobbles, filter out negative motion.
* Per the above masking, the effective sign bit is now bit 62.
*/
if (unlikely(delta & (1ULL << 62)))
return 0;

return delta * mult;
}
#define vdso_calc_delta vdso_calc_delta

Expand Down
23 changes: 5 additions & 18 deletions arch/x86/kernel/itmt.c
Original file line number Diff line number Diff line change
Expand Up @@ -165,32 +165,19 @@ int arch_asym_cpu_priority(int cpu)

/**
* sched_set_itmt_core_prio() - Set CPU priority based on ITMT
* @prio: Priority of cpu core
* @core_cpu: The cpu number associated with the core
* @prio: Priority of @cpu
* @cpu: The CPU number
*
* The pstate driver will find out the max boost frequency
* and call this function to set a priority proportional
* to the max boost frequency. CPU with higher boost
* to the max boost frequency. CPUs with higher boost
* frequency will receive higher priority.
*
* No need to rebuild sched domain after updating
* the CPU priorities. The sched domains have no
* dependency on CPU priorities.
*/
void sched_set_itmt_core_prio(int prio, int core_cpu)
void sched_set_itmt_core_prio(int prio, int cpu)
{
int cpu, i = 1;

for_each_cpu(cpu, topology_sibling_cpumask(core_cpu)) {
int smt_prio;

/*
* Ensure that the siblings are moved to the end
* of the priority chain and only used when
* all other high priority cpus are out of capacity.
*/
smt_prio = prio * smp_num_siblings / (i * i);
per_cpu(sched_core_priority, cpu) = smt_prio;
i++;
}
per_cpu(sched_core_priority, cpu) = prio;
}
4 changes: 2 additions & 2 deletions arch/x86/kernel/kvmclock.c
Original file line number Diff line number Diff line change
Expand Up @@ -71,7 +71,7 @@ static int kvm_set_wallclock(const struct timespec64 *now)
return -ENODEV;
}

static noinstr u64 kvm_clock_read(void)
static u64 kvm_clock_read(void)
{
u64 ret;

Expand All @@ -88,7 +88,7 @@ static u64 kvm_clock_get_cycles(struct clocksource *cs)

static noinstr u64 kvm_sched_clock_read(void)
{
return kvm_clock_read() - kvm_sched_clock_offset;
return pvclock_clocksource_read_nowd(this_cpu_pvti()) - kvm_sched_clock_offset;
}

static inline void kvm_sched_clock_init(bool stable)
Expand Down
Loading

0 comments on commit ed3b792

Please sign in to comment.