Skip to content

Commit 8d59d22

Browse files
yosrym93akpm00
authored andcommitted
mm: memcg: make stats flushing threshold per-memcg
A global counter for the magnitude of memcg stats update is maintained on the memcg side to avoid invoking rstat flushes when the pending updates are not significant. This avoids unnecessary flushes, which are not very cheap even if there isn't a lot of stats to flush. It also avoids unnecessary lock contention on the underlying global rstat lock. Make this threshold per-memcg. The scheme is followed where percpu (now also per-memcg) counters are incremented in the update path, and only propagated to per-memcg atomics when they exceed a certain threshold. This provides two benefits: (a) On large machines with a lot of memcgs, the global threshold can be reached relatively fast, so guarding the underlying lock becomes less effective. Making the threshold per-memcg avoids this. (b) Having a global threshold makes it hard to do subtree flushes, as we cannot reset the global counter except for a full flush. Per-memcg counters removes this as a blocker from doing subtree flushes, which helps avoid unnecessary work when the stats of a small subtree are needed. Nothing is free, of course. This comes at a cost: (a) A new per-cpu counter per memcg, consuming NR_CPUS * NR_MEMCGS * 4 bytes. The extra memory usage is insigificant. (b) More work on the update side, although in the common case it will only be percpu counter updates. The amount of work scales with the number of ancestors (i.e. tree depth). This is not a new concept, adding a cgroup to the rstat tree involves a parent loop, so is charging. Testing results below show no significant regressions. (c) The error margin in the stats for the system as a whole increases from NR_CPUS * MEMCG_CHARGE_BATCH to NR_CPUS * MEMCG_CHARGE_BATCH * NR_MEMCGS. This is probably fine because we have a similar per-memcg error in charges coming from percpu stocks, and we have a periodic flusher that makes sure we always flush all the stats every 2s anyway. This patch was tested to make sure no significant regressions are introduced on the update path as follows. The following benchmarks were ran in a cgroup that is 2 levels deep (/sys/fs/cgroup/a/b/): (1) Running 22 instances of netperf on a 44 cpu machine with hyperthreading disabled. All instances are run in a level 2 cgroup, as well as netserver: # netserver -6 # netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K Averaging 20 runs, the numbers are as follows: Base: 40198.0 mbps Patched: 38629.7 mbps (-3.9%) The regression is minimal, especially for 22 instances in the same cgroup sharing all ancestors (so updating the same atomics). (2) will-it-scale page_fault tests. These tests (specifically per_process_ops in page_fault3 test) detected a 25.9% regression before for a change in the stats update path [1]. These are the numbers from 10 runs (+ is good) on a machine with 256 cpus: LABEL | MEAN | MEDIAN | STDDEV | ------------------------------+-------------+-------------+------------- page_fault1_per_process_ops | | | | (A) base | 270249.164 | 265437.000 | 13451.836 | (B) patched | 261368.709 | 255725.000 | 13394.767 | | -3.29% | -3.66% | | page_fault1_per_thread_ops | | | | (A) base | 242111.345 | 239737.000 | 10026.031 | (B) patched | 237057.109 | 235305.000 | 9769.687 | | -2.09% | -1.85% | | page_fault1_scalability | | | (A) base | 0.034387 | 0.035168 | 0.0018283 | (B) patched | 0.033988 | 0.034573 | 0.0018056 | | -1.16% | -1.69% | | page_fault2_per_process_ops | | | (A) base | 203561.836 | 203301.000 | 2550.764 | (B) patched | 197195.945 | 197746.000 | 2264.263 | | -3.13% | -2.73% | | page_fault2_per_thread_ops | | | (A) base | 171046.473 | 170776.000 | 1509.679 | (B) patched | 166626.327 | 166406.000 | 768.753 | | -2.58% | -2.56% | | page_fault2_scalability | | | (A) base | 0.054026 | 0.053821 | 0.00062121 | (B) patched | 0.053329 | 0.05306 | 0.00048394 | | -1.29% | -1.41% | | page_fault3_per_process_ops | | | (A) base | 1295807.782 | 1297550.000 | 5907.585 | (B) patched | 1275579.873 | 1273359.000 | 8759.160 | | -1.56% | -1.86% | | page_fault3_per_thread_ops | | | (A) base | 391234.164 | 390860.000 | 1760.720 | (B) patched | 377231.273 | 376369.000 | 1874.971 | | -3.58% | -3.71% | | page_fault3_scalability | | | (A) base | 0.60369 | 0.60072 | 0.0083029 | (B) patched | 0.61733 | 0.61544 | 0.009855 | | +2.26% | +2.45% | | All regressions seem to be minimal, and within the normal variance for the benchmark. The fix for [1] assumes that 3% is noise -- and there were no further practical complaints), so hopefully this means that such variations in these microbenchmarks do not reflect on practical workloads. (3) I also ran stress-ng in a nested cgroup and did not observe any obvious regressions. [1]https://lore.kernel.org/all/20190520063534.GB19312@shao2-debian/ Link: https://lkml.kernel.org/r/20231129032154.3710765-4-yosryahmed@google.com Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Suggested-by: Johannes Weiner <hannes@cmpxchg.org> Tested-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com> Acked-by: Shakeel Butt <shakeelb@google.com> Cc: Chris Li <chrisl@kernel.org> Cc: Greg Thelen <gthelen@google.com> Cc: Ivan Babrou <ivan@cloudflare.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michal Koutny <mkoutny@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Tejun Heo <tj@kernel.org> Cc: Waiman Long <longman@redhat.com> Cc: Wei Xu <weixugc@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
1 parent e0bf1dc commit 8d59d22

File tree

1 file changed

+34
-16
lines changed

1 file changed

+34
-16
lines changed

mm/memcontrol.c

Lines changed: 34 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -631,6 +631,9 @@ struct memcg_vmstats_percpu {
631631
/* Cgroup1: threshold notifications & softlimit tree updates */
632632
unsigned long nr_page_events;
633633
unsigned long targets[MEM_CGROUP_NTARGETS];
634+
635+
/* Stats updates since the last flush */
636+
unsigned int stats_updates;
634637
};
635638

636639
struct memcg_vmstats {
@@ -645,6 +648,9 @@ struct memcg_vmstats {
645648
/* Pending child counts during tree propagation */
646649
long state_pending[MEMCG_NR_STAT];
647650
unsigned long events_pending[NR_MEMCG_EVENTS];
651+
652+
/* Stats updates since the last flush */
653+
atomic64_t stats_updates;
648654
};
649655

650656
/*
@@ -664,9 +670,7 @@ struct memcg_vmstats {
664670
*/
665671
static void flush_memcg_stats_dwork(struct work_struct *w);
666672
static DECLARE_DEFERRABLE_WORK(stats_flush_dwork, flush_memcg_stats_dwork);
667-
static DEFINE_PER_CPU(unsigned int, stats_updates);
668673
static atomic_t stats_flush_ongoing = ATOMIC_INIT(0);
669-
static atomic_t stats_flush_threshold = ATOMIC_INIT(0);
670674
static u64 flush_last_time;
671675

672676
#define FLUSH_TIME (2UL*HZ)
@@ -693,26 +697,37 @@ static void memcg_stats_unlock(void)
693697
preempt_enable_nested();
694698
}
695699

700+
701+
static bool memcg_should_flush_stats(struct mem_cgroup *memcg)
702+
{
703+
return atomic64_read(&memcg->vmstats->stats_updates) >
704+
MEMCG_CHARGE_BATCH * num_online_cpus();
705+
}
706+
696707
static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val)
697708
{
709+
int cpu = smp_processor_id();
698710
unsigned int x;
699711

700712
if (!val)
701713
return;
702714

703-
cgroup_rstat_updated(memcg->css.cgroup, smp_processor_id());
715+
cgroup_rstat_updated(memcg->css.cgroup, cpu);
716+
717+
for (; memcg; memcg = parent_mem_cgroup(memcg)) {
718+
x = __this_cpu_add_return(memcg->vmstats_percpu->stats_updates,
719+
abs(val));
720+
721+
if (x < MEMCG_CHARGE_BATCH)
722+
continue;
704723

705-
x = __this_cpu_add_return(stats_updates, abs(val));
706-
if (x > MEMCG_CHARGE_BATCH) {
707724
/*
708-
* If stats_flush_threshold exceeds the threshold
709-
* (>num_online_cpus()), cgroup stats update will be triggered
710-
* in __mem_cgroup_flush_stats(). Increasing this var further
711-
* is redundant and simply adds overhead in atomic update.
725+
* If @memcg is already flush-able, increasing stats_updates is
726+
* redundant. Avoid the overhead of the atomic update.
712727
*/
713-
if (atomic_read(&stats_flush_threshold) <= num_online_cpus())
714-
atomic_add(x / MEMCG_CHARGE_BATCH, &stats_flush_threshold);
715-
__this_cpu_write(stats_updates, 0);
728+
if (!memcg_should_flush_stats(memcg))
729+
atomic64_add(x, &memcg->vmstats->stats_updates);
730+
__this_cpu_write(memcg->vmstats_percpu->stats_updates, 0);
716731
}
717732
}
718733

@@ -731,13 +746,12 @@ static void do_flush_stats(void)
731746

732747
cgroup_rstat_flush(root_mem_cgroup->css.cgroup);
733748

734-
atomic_set(&stats_flush_threshold, 0);
735749
atomic_set(&stats_flush_ongoing, 0);
736750
}
737751

738752
void mem_cgroup_flush_stats(void)
739753
{
740-
if (atomic_read(&stats_flush_threshold) > num_online_cpus())
754+
if (memcg_should_flush_stats(root_mem_cgroup))
741755
do_flush_stats();
742756
}
743757

@@ -751,8 +765,8 @@ void mem_cgroup_flush_stats_ratelimited(void)
751765
static void flush_memcg_stats_dwork(struct work_struct *w)
752766
{
753767
/*
754-
* Always flush here so that flushing in latency-sensitive paths is
755-
* as cheap as possible.
768+
* Deliberately ignore memcg_should_flush_stats() here so that flushing
769+
* in latency-sensitive paths is as cheap as possible.
756770
*/
757771
do_flush_stats();
758772
queue_delayed_work(system_unbound_wq, &stats_flush_dwork, FLUSH_TIME);
@@ -5788,6 +5802,10 @@ static void mem_cgroup_css_rstat_flush(struct cgroup_subsys_state *css, int cpu)
57885802
}
57895803
}
57905804
}
5805+
statc->stats_updates = 0;
5806+
/* We are in a per-cpu loop here, only do the atomic write once */
5807+
if (atomic64_read(&memcg->vmstats->stats_updates))
5808+
atomic64_set(&memcg->vmstats->stats_updates, 0);
57915809
}
57925810

57935811
#ifdef CONFIG_MMU

0 commit comments

Comments
 (0)