Skip to content

Commit

Permalink
Merge branch 'akpm' (patches from Andrew)
Browse files Browse the repository at this point in the history
Merge more updates from Andrew Morton:
 "More MM work: a memcg scalability improvememt"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  mm/lru: revise the comments of lru_lock
  mm/lru: introduce relock_page_lruvec()
  mm/lru: replace pgdat lru_lock with lruvec lock
  mm/swap.c: serialize memcg changes in pagevec_lru_move_fn
  mm/compaction: do page isolation first in compaction
  mm/lru: introduce TestClearPageLRU()
  mm/mlock: remove __munlock_isolate_lru_page()
  mm/mlock: remove lru_lock on TestClearPageMlocked
  mm/vmscan: remove lruvec reget in move_pages_to_lru
  mm/lru: move lock into lru_note_cost
  mm/swap.c: fold vm event PGROTATED into pagevec_move_tail_fn
  mm/memcg: add debug checking in lock_page_memcg
  mm: page_idle_get_page() does not need lru_lock
  mm/rmap: stop store reordering issue on page->mapping
  mm/vmscan: remove unnecessary lruvec adding
  mm/thp: narrow lru locking
  mm/thp: simplify lru_add_page_tail()
  mm/thp: use head for head page in lru_add_page_tail()
  mm/thp: move lru_add_page_tail() to huge_memory.c
  • Loading branch information
torvalds committed Dec 15, 2020
2 parents 3db1a3f + 15b4473 commit 5b200f5
Show file tree
Hide file tree
Showing 21 changed files with 536 additions and 372 deletions.
15 changes: 3 additions & 12 deletions Documentation/admin-guide/cgroup-v1/memcg_test.rst
Original file line number Diff line number Diff line change
Expand Up @@ -133,18 +133,9 @@ Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y.

8. LRU
======
Each memcg has its own private LRU. Now, its handling is under global
VM's control (means that it's handled under global pgdat->lru_lock).
Almost all routines around memcg's LRU is called by global LRU's
list management functions under pgdat->lru_lock.

A special function is mem_cgroup_isolate_pages(). This scans
memcg's private LRU and call __isolate_lru_page() to extract a page
from LRU.

(By __isolate_lru_page(), the page is removed from both of global and
private LRU.)

Each memcg has its own vector of LRUs (inactive anon, active anon,
inactive file, active file, unevictable) of pages from each node,
each LRU handled under a single lru_lock for that memcg and node.

9. Typical Tests.
=================
Expand Down
21 changes: 9 additions & 12 deletions Documentation/admin-guide/cgroup-v1/memory.rst
Original file line number Diff line number Diff line change
Expand Up @@ -287,20 +287,17 @@ When oom event notifier is registered, event will be delivered.
2.6 Locking
-----------

lock_page_cgroup()/unlock_page_cgroup() should not be called under
the i_pages lock.
Lock order is as follows:

Other lock order is following:
Page lock (PG_locked bit of page->flags)
mm->page_table_lock or split pte_lock
lock_page_memcg (memcg->move_lock)
mapping->i_pages lock
lruvec->lru_lock.

PG_locked.
mm->page_table_lock
pgdat->lru_lock
lock_page_cgroup.

In many cases, just lock_page_cgroup() is called.

per-zone-per-cgroup LRU (cgroup's private LRU) is just guarded by
pgdat->lru_lock, it has no lock of its own.
Per-node-per-memcgroup LRU (cgroup's private LRU) is guarded by
lruvec->lru_lock; PG_lru bit of page->flags is cleared before
isolating a page from its LRU under lruvec->lru_lock.

2.7 Kernel Memory Extension (CONFIG_MEMCG_KMEM)
-----------------------------------------------
Expand Down
2 changes: 1 addition & 1 deletion Documentation/trace/events-kmem.rst
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,7 @@ When pages are freed in batch, the also mm_page_free_batched is triggered.
Broadly speaking, pages are taken off the LRU lock in bulk and
freed in batch with a page list. Significant amounts of activity here could
indicate that the system is under memory pressure and can also indicate
contention on the zone->lru_lock.
contention on the lruvec->lru_lock.

4. Per-CPU Allocator Activity
=============================
Expand Down
22 changes: 8 additions & 14 deletions Documentation/vm/unevictable-lru.rst
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ reclaim in Linux. The problems have been observed at customer sites on large
memory x86_64 systems.

To illustrate this with an example, a non-NUMA x86_64 platform with 128GB of
main memory will have over 32 million 4k pages in a single zone. When a large
main memory will have over 32 million 4k pages in a single node. When a large
fraction of these pages are not evictable for any reason [see below], vmscan
will spend a lot of time scanning the LRU lists looking for the small fraction
of pages that are evictable. This can result in a situation where all CPUs are
Expand All @@ -55,7 +55,7 @@ unevictable, either by definition or by circumstance, in the future.
The Unevictable Page List
-------------------------

The Unevictable LRU infrastructure consists of an additional, per-zone, LRU list
The Unevictable LRU infrastructure consists of an additional, per-node, LRU list
called the "unevictable" list and an associated page flag, PG_unevictable, to
indicate that the page is being managed on the unevictable list.

Expand Down Expand Up @@ -84,15 +84,9 @@ The unevictable list does not differentiate between file-backed and anonymous,
swap-backed pages. This differentiation is only important while the pages are,
in fact, evictable.

The unevictable list benefits from the "arrayification" of the per-zone LRU
The unevictable list benefits from the "arrayification" of the per-node LRU
lists and statistics originally proposed and posted by Christoph Lameter.

The unevictable list does not use the LRU pagevec mechanism. Rather,
unevictable pages are placed directly on the page's zone's unevictable list
under the zone lru_lock. This allows us to prevent the stranding of pages on
the unevictable list when one task has the page isolated from the LRU and other
tasks are changing the "evictability" state of the page.


Memory Control Group Interaction
--------------------------------
Expand All @@ -101,8 +95,8 @@ The unevictable LRU facility interacts with the memory control group [aka
memory controller; see Documentation/admin-guide/cgroup-v1/memory.rst] by extending the
lru_list enum.

The memory controller data structure automatically gets a per-zone unevictable
list as a result of the "arrayification" of the per-zone LRU lists (one per
The memory controller data structure automatically gets a per-node unevictable
list as a result of the "arrayification" of the per-node LRU lists (one per
lru_list enum element). The memory controller tracks the movement of pages to
and from the unevictable list.

Expand Down Expand Up @@ -196,7 +190,7 @@ for the sake of expediency, to leave a unevictable page on one of the regular
active/inactive LRU lists for vmscan to deal with. vmscan checks for such
pages in all of the shrink_{active|inactive|page}_list() functions and will
"cull" such pages that it encounters: that is, it diverts those pages to the
unevictable list for the zone being scanned.
unevictable list for the node being scanned.

There may be situations where a page is mapped into a VM_LOCKED VMA, but the
page is not marked as PG_mlocked. Such pages will make it all the way to
Expand Down Expand Up @@ -328,7 +322,7 @@ If the page was NOT already mlocked, mlock_vma_page() attempts to isolate the
page from the LRU, as it is likely on the appropriate active or inactive list
at that time. If the isolate_lru_page() succeeds, mlock_vma_page() will put
back the page - by calling putback_lru_page() - which will notice that the page
is now mlocked and divert the page to the zone's unevictable list. If
is now mlocked and divert the page to the node's unevictable list. If
mlock_vma_page() is unable to isolate the page from the LRU, vmscan will handle
it later if and when it attempts to reclaim the page.

Expand Down Expand Up @@ -603,7 +597,7 @@ Some examples of these unevictable pages on the LRU lists are:
unevictable list in mlock_vma_page().

shrink_inactive_list() also diverts any unevictable pages that it finds on the
inactive lists to the appropriate zone's unevictable list.
inactive lists to the appropriate node's unevictable list.

shrink_inactive_list() should only see SHM_LOCK'd pages that became SHM_LOCK'd
after shrink_active_list() had moved them to the inactive list, or pages mapped
Expand Down
110 changes: 110 additions & 0 deletions include/linux/memcontrol.h
Original file line number Diff line number Diff line change
Expand Up @@ -654,12 +654,41 @@ static inline struct lruvec *mem_cgroup_lruvec(struct mem_cgroup *memcg,

struct lruvec *mem_cgroup_page_lruvec(struct page *, struct pglist_data *);

static inline bool lruvec_holds_page_lru_lock(struct page *page,
struct lruvec *lruvec)
{
pg_data_t *pgdat = page_pgdat(page);
const struct mem_cgroup *memcg;
struct mem_cgroup_per_node *mz;

if (mem_cgroup_disabled())
return lruvec == &pgdat->__lruvec;

mz = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
memcg = page_memcg(page) ? : root_mem_cgroup;

return lruvec->pgdat == pgdat && mz->memcg == memcg;
}

struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);

struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm);

struct mem_cgroup *get_mem_cgroup_from_page(struct page *page);

struct lruvec *lock_page_lruvec(struct page *page);
struct lruvec *lock_page_lruvec_irq(struct page *page);
struct lruvec *lock_page_lruvec_irqsave(struct page *page,
unsigned long *flags);

#ifdef CONFIG_DEBUG_VM
void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page);
#else
static inline void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page)
{
}
#endif

static inline
struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css){
return css ? container_of(css, struct mem_cgroup, css) : NULL;
Expand Down Expand Up @@ -1167,6 +1196,14 @@ static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page,
return &pgdat->__lruvec;
}

static inline bool lruvec_holds_page_lru_lock(struct page *page,
struct lruvec *lruvec)
{
pg_data_t *pgdat = page_pgdat(page);

return lruvec == &pgdat->__lruvec;
}

static inline struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *memcg)
{
return NULL;
Expand All @@ -1192,6 +1229,31 @@ static inline void mem_cgroup_put(struct mem_cgroup *memcg)
{
}

static inline struct lruvec *lock_page_lruvec(struct page *page)
{
struct pglist_data *pgdat = page_pgdat(page);

spin_lock(&pgdat->__lruvec.lru_lock);
return &pgdat->__lruvec;
}

static inline struct lruvec *lock_page_lruvec_irq(struct page *page)
{
struct pglist_data *pgdat = page_pgdat(page);

spin_lock_irq(&pgdat->__lruvec.lru_lock);
return &pgdat->__lruvec;
}

static inline struct lruvec *lock_page_lruvec_irqsave(struct page *page,
unsigned long *flagsp)
{
struct pglist_data *pgdat = page_pgdat(page);

spin_lock_irqsave(&pgdat->__lruvec.lru_lock, *flagsp);
return &pgdat->__lruvec;
}

static inline struct mem_cgroup *
mem_cgroup_iter(struct mem_cgroup *root,
struct mem_cgroup *prev,
Expand Down Expand Up @@ -1411,6 +1473,10 @@ static inline
void count_memcg_event_mm(struct mm_struct *mm, enum vm_event_item idx)
{
}

static inline void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page)
{
}
#endif /* CONFIG_MEMCG */

/* idx can be of type enum memcg_stat_item or node_stat_item */
Expand Down Expand Up @@ -1492,6 +1558,50 @@ static inline struct lruvec *parent_lruvec(struct lruvec *lruvec)
return mem_cgroup_lruvec(memcg, lruvec_pgdat(lruvec));
}

static inline void unlock_page_lruvec(struct lruvec *lruvec)
{
spin_unlock(&lruvec->lru_lock);
}

static inline void unlock_page_lruvec_irq(struct lruvec *lruvec)
{
spin_unlock_irq(&lruvec->lru_lock);
}

static inline void unlock_page_lruvec_irqrestore(struct lruvec *lruvec,
unsigned long flags)
{
spin_unlock_irqrestore(&lruvec->lru_lock, flags);
}

/* Don't lock again iff page's lruvec locked */
static inline struct lruvec *relock_page_lruvec_irq(struct page *page,
struct lruvec *locked_lruvec)
{
if (locked_lruvec) {
if (lruvec_holds_page_lru_lock(page, locked_lruvec))
return locked_lruvec;

unlock_page_lruvec_irq(locked_lruvec);
}

return lock_page_lruvec_irq(page);
}

/* Don't lock again iff page's lruvec locked */
static inline struct lruvec *relock_page_lruvec_irqsave(struct page *page,
struct lruvec *locked_lruvec, unsigned long *flags)
{
if (locked_lruvec) {
if (lruvec_holds_page_lru_lock(page, locked_lruvec))
return locked_lruvec;

unlock_page_lruvec_irqrestore(locked_lruvec, *flags);
}

return lock_page_lruvec_irqsave(page, flags);
}

#ifdef CONFIG_CGROUP_WRITEBACK

struct wb_domain *mem_cgroup_wb_domain(struct bdi_writeback *wb);
Expand Down
2 changes: 1 addition & 1 deletion include/linux/mm_types.h
Original file line number Diff line number Diff line change
Expand Up @@ -79,7 +79,7 @@ struct page {
struct { /* Page cache and anonymous pages */
/**
* @lru: Pageout list, eg. active_list protected by
* pgdat->lru_lock. Sometimes used as a generic list
* lruvec->lru_lock. Sometimes used as a generic list
* by the page owner.
*/
struct list_head lru;
Expand Down
6 changes: 3 additions & 3 deletions include/linux/mmzone.h
Original file line number Diff line number Diff line change
Expand Up @@ -113,8 +113,7 @@ static inline bool free_area_empty(struct free_area *area, int migratetype)
struct pglist_data;

/*
* zone->lock and the zone lru_lock are two of the hottest locks in the kernel.
* So add a wild amount of padding here to ensure that they fall into separate
* Add a wild amount of padding here to ensure datas fall into separate
* cachelines. There are very few zone structures in the machine, so space
* consumption is not a concern here.
*/
Expand Down Expand Up @@ -276,6 +275,8 @@ enum lruvec_flags {

struct lruvec {
struct list_head lists[NR_LRU_LISTS];
/* per lruvec lru_lock for memcg */
spinlock_t lru_lock;
/*
* These track the cost of reclaiming one LRU - file or anon -
* over the other. As the observed cost of reclaiming one LRU
Expand Down Expand Up @@ -782,7 +783,6 @@ typedef struct pglist_data {

/* Write-intensive fields used by page reclaim */
ZONE_PADDING(_pad1_)
spinlock_t lru_lock;

#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
/*
Expand Down
1 change: 1 addition & 0 deletions include/linux/page-flags.h
Original file line number Diff line number Diff line change
Expand Up @@ -334,6 +334,7 @@ PAGEFLAG(Referenced, referenced, PF_HEAD)
PAGEFLAG(Dirty, dirty, PF_HEAD) TESTSCFLAG(Dirty, dirty, PF_HEAD)
__CLEARPAGEFLAG(Dirty, dirty, PF_HEAD)
PAGEFLAG(LRU, lru, PF_HEAD) __CLEARPAGEFLAG(LRU, lru, PF_HEAD)
TESTCLEARFLAG(LRU, lru, PF_HEAD)
PAGEFLAG(Active, active, PF_HEAD) __CLEARPAGEFLAG(Active, active, PF_HEAD)
TESTCLEARFLAG(Active, active, PF_HEAD)
PAGEFLAG(Workingset, workingset, PF_HEAD)
Expand Down
4 changes: 1 addition & 3 deletions include/linux/swap.h
Original file line number Diff line number Diff line change
Expand Up @@ -338,8 +338,6 @@ extern void lru_note_cost(struct lruvec *lruvec, bool file,
unsigned int nr_pages);
extern void lru_note_cost_page(struct page *);
extern void lru_cache_add(struct page *);
extern void lru_add_page_tail(struct page *page, struct page *page_tail,
struct lruvec *lruvec, struct list_head *head);
extern void mark_page_accessed(struct page *);
extern void lru_add_drain(void);
extern void lru_add_drain_cpu(int cpu);
Expand All @@ -358,7 +356,7 @@ extern void lru_cache_add_inactive_or_unevictable(struct page *page,
extern unsigned long zone_reclaimable_pages(struct zone *zone);
extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
gfp_t gfp_mask, nodemask_t *mask);
extern int __isolate_lru_page(struct page *page, isolate_mode_t mode);
extern int __isolate_lru_page_prepare(struct page *page, isolate_mode_t mode);
extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
unsigned long nr_pages,
gfp_t gfp_mask,
Expand Down
Loading

0 comments on commit 5b200f5

Please sign in to comment.