Skip to content

Commit a08a2ae

Browse files
osalvadorvilardagatorvalds
authored andcommitted
mm,memory_hotplug: allocate memmap from the added memory range
Physical memory hotadd has to allocate a memmap (struct page array) for the newly added memory section. Currently, alloc_pages_node() is used for those allocations. This has some disadvantages: a) an existing memory is consumed for that purpose (eg: ~2MB per 128MB memory section on x86_64) This can even lead to extreme cases where system goes OOM because the physically hotplugged memory depletes the available memory before it is onlined. b) if the whole node is movable then we have off-node struct pages which has performance drawbacks. c) It might be there are no PMD_ALIGNED chunks so memmap array gets populated with base pages. This can be improved when CONFIG_SPARSEMEM_VMEMMAP is enabled. Vmemap page tables can map arbitrary memory. That means that we can reserve a part of the physically hotadded memory to back vmemmap page tables. This implementation uses the beginning of the hotplugged memory for that purpose. There are some non-obviously things to consider though. Vmemmap pages are allocated/freed during the memory hotplug events (add_memory_resource(), try_remove_memory()) when the memory is added/removed. This means that the reserved physical range is not online although it is used. The most obvious side effect is that pfn_to_online_page() returns NULL for those pfns. The current design expects that this should be OK as the hotplugged memory is considered a garbage until it is onlined. For example hibernation wouldn't save the content of those vmmemmaps into the image so it wouldn't be restored on resume but this should be OK as there no real content to recover anyway while metadata is reachable from other data structures (e.g. vmemmap page tables). The reserved space is therefore (de)initialized during the {on,off}line events (mhp_{de}init_memmap_on_memory). That is done by extracting page allocator independent initialization from the regular onlining path. The primary reason to handle the reserved space outside of {on,off}line_pages is to make each initialization specific to the purpose rather than special case them in a single function. As per above, the functions that are introduced are: - mhp_init_memmap_on_memory: Initializes vmemmap pages by calling move_pfn_range_to_zone(), calls kasan_add_zero_shadow(), and onlines as many sections as vmemmap pages fully span. - mhp_deinit_memmap_on_memory: Offlines as many sections as vmemmap pages fully span, removes the range from zhe zone by remove_pfn_range_from_zone(), and calls kasan_remove_zero_shadow() for the range. The new function memory_block_online() calls mhp_init_memmap_on_memory() before doing the actual online_pages(). Should online_pages() fail, we clean up by calling mhp_deinit_memmap_on_memory(). Adjusting of present_pages is done at the end once we know that online_pages() succedeed. On offline, memory_block_offline() needs to unaccount vmemmap pages from present_pages() before calling offline_pages(). This is necessary because offline_pages() tears down some structures based on the fact whether the node or the zone become empty. If offline_pages() fails, we account back vmemmap pages. If it succeeds, we call mhp_deinit_memmap_on_memory(). Hot-remove: We need to be careful when removing memory, as adding and removing memory needs to be done with the same granularity. To check that this assumption is not violated, we check the memory range we want to remove and if a) any memory block has vmemmap pages and b) the range spans more than a single memory block, we scream out loud and refuse to proceed. If all is good and the range was using memmap on memory (aka vmemmap pages), we construct an altmap structure so free_hugepage_table does the right thing and calls vmem_altmap_free instead of free_pagetable. Link: https://lkml.kernel.org/r/20210421102701.25051-5-osalvador@suse.de Signed-off-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: David Hildenbrand <david@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
1 parent f990114 commit a08a2ae

File tree

8 files changed

+250
-22
lines changed

8 files changed

+250
-22
lines changed

drivers/base/memory.c

Lines changed: 66 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -173,16 +173,73 @@ static int memory_block_online(struct memory_block *mem)
173173
{
174174
unsigned long start_pfn = section_nr_to_pfn(mem->start_section_nr);
175175
unsigned long nr_pages = PAGES_PER_SECTION * sections_per_block;
176+
unsigned long nr_vmemmap_pages = mem->nr_vmemmap_pages;
177+
struct zone *zone;
178+
int ret;
179+
180+
zone = zone_for_pfn_range(mem->online_type, mem->nid, start_pfn, nr_pages);
181+
182+
/*
183+
* Although vmemmap pages have a different lifecycle than the pages
184+
* they describe (they remain until the memory is unplugged), doing
185+
* their initialization and accounting at memory onlining/offlining
186+
* stage helps to keep accounting easier to follow - e.g vmemmaps
187+
* belong to the same zone as the memory they backed.
188+
*/
189+
if (nr_vmemmap_pages) {
190+
ret = mhp_init_memmap_on_memory(start_pfn, nr_vmemmap_pages, zone);
191+
if (ret)
192+
return ret;
193+
}
194+
195+
ret = online_pages(start_pfn + nr_vmemmap_pages,
196+
nr_pages - nr_vmemmap_pages, zone);
197+
if (ret) {
198+
if (nr_vmemmap_pages)
199+
mhp_deinit_memmap_on_memory(start_pfn, nr_vmemmap_pages);
200+
return ret;
201+
}
202+
203+
/*
204+
* Account once onlining succeeded. If the zone was unpopulated, it is
205+
* now already properly populated.
206+
*/
207+
if (nr_vmemmap_pages)
208+
adjust_present_page_count(zone, nr_vmemmap_pages);
176209

177-
return online_pages(start_pfn, nr_pages, mem->online_type, mem->nid);
210+
return ret;
178211
}
179212

180213
static int memory_block_offline(struct memory_block *mem)
181214
{
182215
unsigned long start_pfn = section_nr_to_pfn(mem->start_section_nr);
183216
unsigned long nr_pages = PAGES_PER_SECTION * sections_per_block;
217+
unsigned long nr_vmemmap_pages = mem->nr_vmemmap_pages;
218+
struct zone *zone;
219+
int ret;
220+
221+
zone = page_zone(pfn_to_page(start_pfn));
222+
223+
/*
224+
* Unaccount before offlining, such that unpopulated zone and kthreads
225+
* can properly be torn down in offline_pages().
226+
*/
227+
if (nr_vmemmap_pages)
228+
adjust_present_page_count(zone, -nr_vmemmap_pages);
184229

185-
return offline_pages(start_pfn, nr_pages);
230+
ret = offline_pages(start_pfn + nr_vmemmap_pages,
231+
nr_pages - nr_vmemmap_pages);
232+
if (ret) {
233+
/* offline_pages() failed. Account back. */
234+
if (nr_vmemmap_pages)
235+
adjust_present_page_count(zone, nr_vmemmap_pages);
236+
return ret;
237+
}
238+
239+
if (nr_vmemmap_pages)
240+
mhp_deinit_memmap_on_memory(start_pfn, nr_vmemmap_pages);
241+
242+
return ret;
186243
}
187244

188245
/*
@@ -576,7 +633,8 @@ int register_memory(struct memory_block *memory)
576633
return ret;
577634
}
578635

579-
static int init_memory_block(unsigned long block_id, unsigned long state)
636+
static int init_memory_block(unsigned long block_id, unsigned long state,
637+
unsigned long nr_vmemmap_pages)
580638
{
581639
struct memory_block *mem;
582640
int ret = 0;
@@ -593,6 +651,7 @@ static int init_memory_block(unsigned long block_id, unsigned long state)
593651
mem->start_section_nr = block_id * sections_per_block;
594652
mem->state = state;
595653
mem->nid = NUMA_NO_NODE;
654+
mem->nr_vmemmap_pages = nr_vmemmap_pages;
596655

597656
ret = register_memory(mem);
598657

@@ -612,7 +671,7 @@ static int add_memory_block(unsigned long base_section_nr)
612671
if (section_count == 0)
613672
return 0;
614673
return init_memory_block(memory_block_id(base_section_nr),
615-
MEM_ONLINE);
674+
MEM_ONLINE, 0);
616675
}
617676

618677
static void unregister_memory(struct memory_block *memory)
@@ -634,7 +693,8 @@ static void unregister_memory(struct memory_block *memory)
634693
*
635694
* Called under device_hotplug_lock.
636695
*/
637-
int create_memory_block_devices(unsigned long start, unsigned long size)
696+
int create_memory_block_devices(unsigned long start, unsigned long size,
697+
unsigned long vmemmap_pages)
638698
{
639699
const unsigned long start_block_id = pfn_to_block_id(PFN_DOWN(start));
640700
unsigned long end_block_id = pfn_to_block_id(PFN_DOWN(start + size));
@@ -647,7 +707,7 @@ int create_memory_block_devices(unsigned long start, unsigned long size)
647707
return -EINVAL;
648708

649709
for (block_id = start_block_id; block_id != end_block_id; block_id++) {
650-
ret = init_memory_block(block_id, MEM_OFFLINE);
710+
ret = init_memory_block(block_id, MEM_OFFLINE, vmemmap_pages);
651711
if (ret)
652712
break;
653713
}

include/linux/memory.h

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -29,6 +29,11 @@ struct memory_block {
2929
int online_type; /* for passing data to online routine */
3030
int nid; /* NID for this memory block */
3131
struct device dev;
32+
/*
33+
* Number of vmemmap pages. These pages
34+
* lay at the beginning of the memory block.
35+
*/
36+
unsigned long nr_vmemmap_pages;
3237
};
3338

3439
int arch_get_memory_phys_device(unsigned long start_pfn);
@@ -80,7 +85,8 @@ static inline int memory_notify(unsigned long val, void *v)
8085
#else
8186
extern int register_memory_notifier(struct notifier_block *nb);
8287
extern void unregister_memory_notifier(struct notifier_block *nb);
83-
int create_memory_block_devices(unsigned long start, unsigned long size);
88+
int create_memory_block_devices(unsigned long start, unsigned long size,
89+
unsigned long vmemmap_pages);
8490
void remove_memory_block_devices(unsigned long start, unsigned long size);
8591
extern void memory_dev_init(void);
8692
extern int memory_notify(unsigned long val, void *v);

include/linux/memory_hotplug.h

Lines changed: 14 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -55,6 +55,14 @@ typedef int __bitwise mhp_t;
5555
*/
5656
#define MHP_MERGE_RESOURCE ((__force mhp_t)BIT(0))
5757

58+
/*
59+
* We want memmap (struct page array) to be self contained.
60+
* To do so, we will use the beginning of the hot-added range to build
61+
* the page tables for the memmap array that describes the entire range.
62+
* Only selected architectures support it with SPARSE_VMEMMAP.
63+
*/
64+
#define MHP_MEMMAP_ON_MEMORY ((__force mhp_t)BIT(1))
65+
5866
/*
5967
* Extended parameters for memory hotplug:
6068
* altmap: alternative allocator for memmap array (optional)
@@ -99,9 +107,13 @@ static inline void zone_seqlock_init(struct zone *zone)
99107
extern int zone_grow_free_lists(struct zone *zone, unsigned long new_nr_pages);
100108
extern int zone_grow_waitqueues(struct zone *zone, unsigned long nr_pages);
101109
extern int add_one_highpage(struct page *page, int pfn, int bad_ppro);
110+
extern void adjust_present_page_count(struct zone *zone, long nr_pages);
102111
/* VM interface that may be used by firmware interface */
112+
extern int mhp_init_memmap_on_memory(unsigned long pfn, unsigned long nr_pages,
113+
struct zone *zone);
114+
extern void mhp_deinit_memmap_on_memory(unsigned long pfn, unsigned long nr_pages);
103115
extern int online_pages(unsigned long pfn, unsigned long nr_pages,
104-
int online_type, int nid);
116+
struct zone *zone);
105117
extern struct zone *test_pages_in_a_zone(unsigned long start_pfn,
106118
unsigned long end_pfn);
107119
extern void __offline_isolated_pages(unsigned long start_pfn,
@@ -359,6 +371,7 @@ extern struct zone *zone_for_pfn_range(int online_type, int nid, unsigned start_
359371
extern int arch_create_linear_mapping(int nid, u64 start, u64 size,
360372
struct mhp_params *params);
361373
void arch_remove_linear_mapping(u64 start, u64 size);
374+
extern bool mhp_supports_memmap_on_memory(unsigned long size);
362375
#endif /* CONFIG_MEMORY_HOTPLUG */
363376

364377
#endif /* __LINUX_MEMORY_HOTPLUG_H */

include/linux/memremap.h

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@ struct device;
1717
* @alloc: track pages consumed, private to vmemmap_populate()
1818
*/
1919
struct vmem_altmap {
20-
const unsigned long base_pfn;
20+
unsigned long base_pfn;
2121
const unsigned long end_pfn;
2222
const unsigned long reserve;
2323
unsigned long free;

include/linux/mmzone.h

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -436,6 +436,11 @@ enum zone_type {
436436
* situations where ZERO_PAGE(0) which is allocated differently
437437
* on different platforms may end up in a movable zone. ZERO_PAGE(0)
438438
* cannot be migrated.
439+
* 7. Memory-hotplug: when using memmap_on_memory and onlining the
440+
* memory to the MOVABLE zone, the vmemmap pages are also placed in
441+
* such zone. Such pages cannot be really moved around as they are
442+
* self-stored in the range, but they are treated as movable when
443+
* the range they describe is about to be offlined.
439444
*
440445
* In general, no unmovable allocations that degrade memory offlining
441446
* should end up in ZONE_MOVABLE. Allocators (like alloc_contig_range())
@@ -1392,10 +1397,8 @@ static inline int online_section_nr(unsigned long nr)
13921397

13931398
#ifdef CONFIG_MEMORY_HOTPLUG
13941399
void online_mem_sections(unsigned long start_pfn, unsigned long end_pfn);
1395-
#ifdef CONFIG_MEMORY_HOTREMOVE
13961400
void offline_mem_sections(unsigned long start_pfn, unsigned long end_pfn);
13971401
#endif
1398-
#endif
13991402

14001403
static inline struct mem_section *__pfn_to_section(unsigned long pfn)
14011404
{

mm/Kconfig

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -188,6 +188,11 @@ config MEMORY_HOTREMOVE
188188
depends on MEMORY_HOTPLUG && ARCH_ENABLE_MEMORY_HOTREMOVE
189189
depends on MIGRATION
190190

191+
config MHP_MEMMAP_ON_MEMORY
192+
def_bool y
193+
depends on MEMORY_HOTPLUG && SPARSEMEM_VMEMMAP
194+
depends on ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE
195+
191196
# Heavily threaded applications may benefit from splitting the mm-wide
192197
# page_table_lock, so that faults on different parts of the user address
193198
# space can be handled with less contention: split it at this NR_CPUS.

0 commit comments

Comments
 (0)