Skip to content

Commit

Permalink
mm, THP, swap: delay splitting THP during swap out
Browse files Browse the repository at this point in the history
Patch series "THP swap: Delay splitting THP during swapping out", v11.

This patchset is to optimize the performance of Transparent Huge Page
(THP) swap.

Recently, the performance of the storage devices improved so fast that
we cannot saturate the disk bandwidth with single logical CPU when do
page swap out even on a high-end server machine.  Because the
performance of the storage device improved faster than that of single
logical CPU.  And it seems that the trend will not change in the near
future.  On the other hand, the THP becomes more and more popular
because of increased memory size.  So it becomes necessary to optimize
THP swap performance.

The advantages of the THP swap support include:

 - Batch the swap operations for the THP to reduce lock
   acquiring/releasing, including allocating/freeing the swap space,
   adding/deleting to/from the swap cache, and writing/reading the swap
   space, etc. This will help improve the performance of the THP swap.

 - The THP swap space read/write will be 2M sequential IO. It is
   particularly helpful for the swap read, which are usually 4k random
   IO. This will improve the performance of the THP swap too.

 - It will help the memory fragmentation, especially when the THP is
   heavily used by the applications. The 2M continuous pages will be
   free up after THP swapping out.

 - It will improve the THP utilization on the system with the swap
   turned on. Because the speed for khugepaged to collapse the normal
   pages into the THP is quite slow. After the THP is split during the
   swapping out, it will take quite long time for the normal pages to
   collapse back into the THP after being swapped in. The high THP
   utilization helps the efficiency of the page based memory management
   too.

There are some concerns regarding THP swap in, mainly because possible
enlarged read/write IO size (for swap in/out) may put more overhead on
the storage device.  To deal with that, the THP swap in should be turned
on only when necessary.  For example, it can be selected via
"always/never/madvise" logic, to be turned on globally, turned off
globally, or turned on only for VMA with MADV_HUGEPAGE, etc.

This patchset is the first step for the THP swap support.  The plan is
to delay splitting THP step by step, finally avoid splitting THP during
the THP swapping out and swap out/in the THP as a whole.

As the first step, in this patchset, the splitting huge page is delayed
from almost the first step of swapping out to after allocating the swap
space for the THP and adding the THP into the swap cache.  This will
reduce lock acquiring/releasing for the locks used for the swap cache
management.

With the patchset, the swap out throughput improves 15.5% (from about
3.73GB/s to about 4.31GB/s) in the vm-scalability swap-w-seq test case
with 8 processes.  The test is done on a Xeon E5 v3 system.  The swap
device used is a RAM simulated PMEM (persistent memory) device.  To test
the sequential swapping out, the test case creates 8 processes, which
sequentially allocate and write to the anonymous pages until the RAM and
part of the swap device is used up.

This patch (of 5):

In this patch, splitting huge page is delayed from almost the first step
of swapping out to after allocating the swap space for the THP
(Transparent Huge Page) and adding the THP into the swap cache.  This
will batch the corresponding operation, thus improve THP swap out
throughput.

This is the first step for the THP swap optimization.  The plan is to
delay splitting the THP step by step and avoid splitting the THP
finally.

In this patch, one swap cluster is used to hold the contents of each THP
swapped out.  So, the size of the swap cluster is changed to that of the
THP (Transparent Huge Page) on x86_64 architecture (512).  For other
architectures which want such THP swap optimization,
ARCH_USES_THP_SWAP_CLUSTER needs to be selected in the Kconfig file for
the architecture.  In effect, this will enlarge swap cluster size by 2
times on x86_64.  Which may make it harder to find a free cluster when
the swap space becomes fragmented.  So that, this may reduce the
continuous swap space allocation and sequential write in theory.  The
performance test in 0day shows no regressions caused by this.

In the future of THP swap optimization, some information of the swapped
out THP (such as compound map count) will be recorded in the
swap_cluster_info data structure.

The mem cgroup swap accounting functions are enhanced to support charge
or uncharge a swap cluster backing a THP as a whole.

The swap cluster allocate/free functions are added to allocate/free a
swap cluster for a THP.  A fair simple algorithm is used for swap
cluster allocation, that is, only the first swap device in priority list
will be tried to allocate the swap cluster.  The function will fail if
the trying is not successful, and the caller will fallback to allocate a
single swap slot instead.  This works good enough for normal cases.  If
the difference of the number of the free swap clusters among multiple
swap devices is significant, it is possible that some THPs are split
earlier than necessary.  For example, this could be caused by big size
difference among multiple swap devices.

The swap cache functions is enhanced to support add/delete THP to/from
the swap cache as a set of (HPAGE_PMD_NR) sub-pages.  This may be
enhanced in the future with multi-order radix tree.  But because we will
split the THP soon during swapping out, that optimization doesn't make
much sense for this first step.

The THP splitting functions are enhanced to support to split THP in swap
cache during swapping out.  The page lock will be held during allocating
the swap cluster, adding the THP into the swap cache and splitting the
THP.  So in the code path other than swapping out, if the THP need to be
split, the PageSwapCache(THP) will be always false.

The swap cluster is only available for SSD, so the THP swap optimization
in this patchset has no effect for HDD.

[ying.huang@intel.com: fix two issues in THP optimize patch]
  Link: http://lkml.kernel.org/r/87k25ed8zo.fsf@yhuang-dev.intel.com
[hannes@cmpxchg.org: extensive cleanups and simplifications, reduce code size]
Link: http://lkml.kernel.org/r/20170515112522.32457-2-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Suggested-by: Andrew Morton <akpm@linux-foundation.org> [for config option]
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> [for changes in huge_memory.c and huge_mm.h]
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Ebru Akagunduz <ebru.akagunduz@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
  • Loading branch information
yhuang-intel authored and torvalds committed Jul 6, 2017
1 parent 9d85e15 commit 38d8b4e
Show file tree
Hide file tree
Showing 12 changed files with 373 additions and 164 deletions.
1 change: 1 addition & 0 deletions arch/x86/Kconfig
Original file line number Diff line number Diff line change
Expand Up @@ -72,6 +72,7 @@ config X86
select ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
select ARCH_WANT_FRAME_POINTERS
select ARCH_WANTS_DYNAMIC_TASK_STRUCT
select ARCH_WANTS_THP_SWAP if X86_64
select BUILDTIME_EXTABLE_SORT
select CLKEVT_I8253
select CLOCKSOURCE_VALIDATE_LAST_CYCLE
Expand Down
7 changes: 5 additions & 2 deletions include/linux/page-flags.h
Original file line number Diff line number Diff line change
Expand Up @@ -326,11 +326,14 @@ PAGEFLAG_FALSE(HighMem)
#ifdef CONFIG_SWAP
static __always_inline int PageSwapCache(struct page *page)
{
#ifdef CONFIG_THP_SWAP
page = compound_head(page);
#endif
return PageSwapBacked(page) && test_bit(PG_swapcache, &page->flags);

}
SETPAGEFLAG(SwapCache, swapcache, PF_NO_COMPOUND)
CLEARPAGEFLAG(SwapCache, swapcache, PF_NO_COMPOUND)
SETPAGEFLAG(SwapCache, swapcache, PF_NO_TAIL)
CLEARPAGEFLAG(SwapCache, swapcache, PF_NO_TAIL)
#else
PAGEFLAG_FALSE(SwapCache)
#endif
Expand Down
19 changes: 14 additions & 5 deletions include/linux/swap.h
Original file line number Diff line number Diff line change
Expand Up @@ -386,9 +386,9 @@ static inline long get_nr_swap_pages(void)
}

extern void si_swapinfo(struct sysinfo *);
extern swp_entry_t get_swap_page(void);
extern swp_entry_t get_swap_page(struct page *page);
extern swp_entry_t get_swap_page_of_type(int);
extern int get_swap_pages(int n, swp_entry_t swp_entries[]);
extern int get_swap_pages(int n, bool cluster, swp_entry_t swp_entries[]);
extern int add_swap_count_continuation(swp_entry_t, gfp_t);
extern void swap_shmem_alloc(swp_entry_t);
extern int swap_duplicate(swp_entry_t);
Expand Down Expand Up @@ -515,7 +515,7 @@ static inline int try_to_free_swap(struct page *page)
return 0;
}

static inline swp_entry_t get_swap_page(void)
static inline swp_entry_t get_swap_page(struct page *page)
{
swp_entry_t entry;
entry.val = 0;
Expand Down Expand Up @@ -548,7 +548,7 @@ static inline int mem_cgroup_swappiness(struct mem_cgroup *mem)
#ifdef CONFIG_MEMCG_SWAP
extern void mem_cgroup_swapout(struct page *page, swp_entry_t entry);
extern int mem_cgroup_try_charge_swap(struct page *page, swp_entry_t entry);
extern void mem_cgroup_uncharge_swap(swp_entry_t entry);
extern void mem_cgroup_uncharge_swap(swp_entry_t entry, unsigned int nr_pages);
extern long mem_cgroup_get_nr_swap_pages(struct mem_cgroup *memcg);
extern bool mem_cgroup_swap_full(struct page *page);
#else
Expand All @@ -562,7 +562,8 @@ static inline int mem_cgroup_try_charge_swap(struct page *page,
return 0;
}

static inline void mem_cgroup_uncharge_swap(swp_entry_t entry)
static inline void mem_cgroup_uncharge_swap(swp_entry_t entry,
unsigned int nr_pages)
{
}

Expand All @@ -577,5 +578,13 @@ static inline bool mem_cgroup_swap_full(struct page *page)
}
#endif

#ifdef CONFIG_THP_SWAP
extern void swapcache_free_cluster(swp_entry_t entry);
#else
static inline void swapcache_free_cluster(swp_entry_t entry)
{
}
#endif

#endif /* __KERNEL__*/
#endif /* _LINUX_SWAP_H */
6 changes: 4 additions & 2 deletions include/linux/swap_cgroup.h
Original file line number Diff line number Diff line change
Expand Up @@ -7,15 +7,17 @@

extern unsigned short swap_cgroup_cmpxchg(swp_entry_t ent,
unsigned short old, unsigned short new);
extern unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id);
extern unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id,
unsigned int nr_ents);
extern unsigned short lookup_swap_cgroup_id(swp_entry_t ent);
extern int swap_cgroup_swapon(int type, unsigned long max_pages);
extern void swap_cgroup_swapoff(int type);

#else

static inline
unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id)
unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id,
unsigned int nr_ents)
{
return 0;
}
Expand Down
12 changes: 12 additions & 0 deletions mm/Kconfig
Original file line number Diff line number Diff line change
Expand Up @@ -446,6 +446,18 @@ choice
benefit.
endchoice

config ARCH_WANTS_THP_SWAP
def_bool n

config THP_SWAP
def_bool y
depends on TRANSPARENT_HUGEPAGE && ARCH_WANTS_THP_SWAP
help
Swap transparent huge pages in one piece, without splitting.
XXX: For now this only does clustered swap space allocation.

For selection by architectures with reasonable THP sizes.

config TRANSPARENT_HUGE_PAGECACHE
def_bool y
depends on TRANSPARENT_HUGEPAGE
Expand Down
11 changes: 8 additions & 3 deletions mm/huge_memory.c
Original file line number Diff line number Diff line change
Expand Up @@ -2203,7 +2203,7 @@ static void __split_huge_page_tail(struct page *head, int tail,
* atomic_set() here would be safe on all archs (and not only on x86),
* it's safer to use atomic_inc()/atomic_add().
*/
if (PageAnon(head)) {
if (PageAnon(head) && !PageSwapCache(head)) {
page_ref_inc(page_tail);
} else {
/* Additional pin to radix tree */
Expand All @@ -2214,6 +2214,7 @@ static void __split_huge_page_tail(struct page *head, int tail,
page_tail->flags |= (head->flags &
((1L << PG_referenced) |
(1L << PG_swapbacked) |
(1L << PG_swapcache) |
(1L << PG_mlocked) |
(1L << PG_uptodate) |
(1L << PG_active) |
Expand Down Expand Up @@ -2276,7 +2277,11 @@ static void __split_huge_page(struct page *page, struct list_head *list,
ClearPageCompound(head);
/* See comment in __split_huge_page_tail() */
if (PageAnon(head)) {
page_ref_inc(head);
/* Additional pin to radix tree of swap cache */
if (PageSwapCache(head))
page_ref_add(head, 2);
else
page_ref_inc(head);
} else {
/* Additional pin to radix tree */
page_ref_add(head, 2);
Expand Down Expand Up @@ -2432,7 +2437,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
ret = -EBUSY;
goto out;
}
extra_pins = 0;
extra_pins = PageSwapCache(page) ? HPAGE_PMD_NR : 0;
mapping = NULL;
anon_vma_lock_write(anon_vma);
} else {
Expand Down
50 changes: 26 additions & 24 deletions mm/memcontrol.c
Original file line number Diff line number Diff line change
Expand Up @@ -2376,10 +2376,9 @@ void mem_cgroup_split_huge_fixup(struct page *head)

#ifdef CONFIG_MEMCG_SWAP
static void mem_cgroup_swap_statistics(struct mem_cgroup *memcg,
bool charge)
int nr_entries)
{
int val = (charge) ? 1 : -1;
this_cpu_add(memcg->stat->count[MEMCG_SWAP], val);
this_cpu_add(memcg->stat->count[MEMCG_SWAP], nr_entries);
}

/**
Expand All @@ -2405,8 +2404,8 @@ static int mem_cgroup_move_swap_account(swp_entry_t entry,
new_id = mem_cgroup_id(to);

if (swap_cgroup_cmpxchg(entry, old_id, new_id) == old_id) {
mem_cgroup_swap_statistics(from, false);
mem_cgroup_swap_statistics(to, true);
mem_cgroup_swap_statistics(from, -1);
mem_cgroup_swap_statistics(to, 1);
return 0;
}
return -EINVAL;
Expand Down Expand Up @@ -5445,7 +5444,7 @@ void mem_cgroup_commit_charge(struct page *page, struct mem_cgroup *memcg,
* let's not wait for it. The page already received a
* memory+swap charge, drop the swap entry duplicate.
*/
mem_cgroup_uncharge_swap(entry);
mem_cgroup_uncharge_swap(entry, nr_pages);
}
}

Expand Down Expand Up @@ -5873,9 +5872,9 @@ void mem_cgroup_swapout(struct page *page, swp_entry_t entry)
* ancestor for the swap instead and transfer the memory+swap charge.
*/
swap_memcg = mem_cgroup_id_get_online(memcg);
oldid = swap_cgroup_record(entry, mem_cgroup_id(swap_memcg));
oldid = swap_cgroup_record(entry, mem_cgroup_id(swap_memcg), 1);
VM_BUG_ON_PAGE(oldid, page);
mem_cgroup_swap_statistics(swap_memcg, true);
mem_cgroup_swap_statistics(swap_memcg, 1);

page->mem_cgroup = NULL;

Expand All @@ -5902,19 +5901,20 @@ void mem_cgroup_swapout(struct page *page, swp_entry_t entry)
css_put(&memcg->css);
}

/*
* mem_cgroup_try_charge_swap - try charging a swap entry
/**
* mem_cgroup_try_charge_swap - try charging swap space for a page
* @page: page being added to swap
* @entry: swap entry to charge
*
* Try to charge @entry to the memcg that @page belongs to.
* Try to charge @page's memcg for the swap space at @entry.
*
* Returns 0 on success, -ENOMEM on failure.
*/
int mem_cgroup_try_charge_swap(struct page *page, swp_entry_t entry)
{
struct mem_cgroup *memcg;
unsigned int nr_pages = hpage_nr_pages(page);
struct page_counter *counter;
struct mem_cgroup *memcg;
unsigned short oldid;

if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) || !do_swap_account)
Expand All @@ -5929,44 +5929,46 @@ int mem_cgroup_try_charge_swap(struct page *page, swp_entry_t entry)
memcg = mem_cgroup_id_get_online(memcg);

if (!mem_cgroup_is_root(memcg) &&
!page_counter_try_charge(&memcg->swap, 1, &counter)) {
!page_counter_try_charge(&memcg->swap, nr_pages, &counter)) {
mem_cgroup_id_put(memcg);
return -ENOMEM;
}

oldid = swap_cgroup_record(entry, mem_cgroup_id(memcg));
/* Get references for the tail pages, too */
if (nr_pages > 1)
mem_cgroup_id_get_many(memcg, nr_pages - 1);
oldid = swap_cgroup_record(entry, mem_cgroup_id(memcg), nr_pages);
VM_BUG_ON_PAGE(oldid, page);
mem_cgroup_swap_statistics(memcg, true);
mem_cgroup_swap_statistics(memcg, nr_pages);

return 0;
}

/**
* mem_cgroup_uncharge_swap - uncharge a swap entry
* mem_cgroup_uncharge_swap - uncharge swap space
* @entry: swap entry to uncharge
*
* Drop the swap charge associated with @entry.
* @nr_pages: the amount of swap space to uncharge
*/
void mem_cgroup_uncharge_swap(swp_entry_t entry)
void mem_cgroup_uncharge_swap(swp_entry_t entry, unsigned int nr_pages)
{
struct mem_cgroup *memcg;
unsigned short id;

if (!do_swap_account)
return;

id = swap_cgroup_record(entry, 0);
id = swap_cgroup_record(entry, 0, nr_pages);
rcu_read_lock();
memcg = mem_cgroup_from_id(id);
if (memcg) {
if (!mem_cgroup_is_root(memcg)) {
if (cgroup_subsys_on_dfl(memory_cgrp_subsys))
page_counter_uncharge(&memcg->swap, 1);
page_counter_uncharge(&memcg->swap, nr_pages);
else
page_counter_uncharge(&memcg->memsw, 1);
page_counter_uncharge(&memcg->memsw, nr_pages);
}
mem_cgroup_swap_statistics(memcg, false);
mem_cgroup_id_put(memcg);
mem_cgroup_swap_statistics(memcg, -nr_pages);
mem_cgroup_id_put_many(memcg, nr_pages);
}
rcu_read_unlock();
}
Expand Down
2 changes: 1 addition & 1 deletion mm/shmem.c
Original file line number Diff line number Diff line change
Expand Up @@ -1291,7 +1291,7 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
SetPageUptodate(page);
}

swap = get_swap_page();
swap = get_swap_page(page);
if (!swap.val)
goto redirty;

Expand Down
40 changes: 30 additions & 10 deletions mm/swap_cgroup.c
Original file line number Diff line number Diff line change
Expand Up @@ -61,21 +61,27 @@ static int swap_cgroup_prepare(int type)
return -ENOMEM;
}

static struct swap_cgroup *__lookup_swap_cgroup(struct swap_cgroup_ctrl *ctrl,
pgoff_t offset)
{
struct page *mappage;
struct swap_cgroup *sc;

mappage = ctrl->map[offset / SC_PER_PAGE];
sc = page_address(mappage);
return sc + offset % SC_PER_PAGE;
}

static struct swap_cgroup *lookup_swap_cgroup(swp_entry_t ent,
struct swap_cgroup_ctrl **ctrlp)
{
pgoff_t offset = swp_offset(ent);
struct swap_cgroup_ctrl *ctrl;
struct page *mappage;
struct swap_cgroup *sc;

ctrl = &swap_cgroup_ctrl[swp_type(ent)];
if (ctrlp)
*ctrlp = ctrl;

mappage = ctrl->map[offset / SC_PER_PAGE];
sc = page_address(mappage);
return sc + offset % SC_PER_PAGE;
return __lookup_swap_cgroup(ctrl, offset);
}

/**
Expand Down Expand Up @@ -108,25 +114,39 @@ unsigned short swap_cgroup_cmpxchg(swp_entry_t ent,
}

/**
* swap_cgroup_record - record mem_cgroup for this swp_entry.
* @ent: swap entry to be recorded into
* swap_cgroup_record - record mem_cgroup for a set of swap entries
* @ent: the first swap entry to be recorded into
* @id: mem_cgroup to be recorded
* @nr_ents: number of swap entries to be recorded
*
* Returns old value at success, 0 at failure.
* (Of course, old value can be 0.)
*/
unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id)
unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id,
unsigned int nr_ents)
{
struct swap_cgroup_ctrl *ctrl;
struct swap_cgroup *sc;
unsigned short old;
unsigned long flags;
pgoff_t offset = swp_offset(ent);
pgoff_t end = offset + nr_ents;

sc = lookup_swap_cgroup(ent, &ctrl);

spin_lock_irqsave(&ctrl->lock, flags);
old = sc->id;
sc->id = id;
for (;;) {
VM_BUG_ON(sc->id != old);
sc->id = id;
offset++;
if (offset == end)
break;
if (offset % SC_PER_PAGE)
sc++;
else
sc = __lookup_swap_cgroup(ctrl, offset);
}
spin_unlock_irqrestore(&ctrl->lock, flags);

return old;
Expand Down
Loading

0 comments on commit 38d8b4e

Please sign in to comment.