Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wasted memory in early malloc() #270

Closed
nyh opened this issue Apr 13, 2014 · 8 comments
Closed

Wasted memory in early malloc() #270

nyh opened this issue Apr 13, 2014 · 8 comments

Comments

@nyh
Copy link
Contributor

nyh commented Apr 13, 2014

Our malloc() is wasteful in may places, but here is one I just noticed now.

std_malloc() during early initialization (!smp_allocator) allocates a whole page for every small allocation. Printing out what actually happens in an example run, I saw 819 of these early allocations, most of them asking for 24-72 bytes, but we allocate 4096 bytes for each of them. This is wasting about 0.7 MB of memory. Not huge, but these wasteages add up :(

It would be nice to make the early allocator less wasteful, or alternatively see if we can use the dynamic allocator less during initialization.

@nyh
Copy link
Contributor Author

nyh commented Apr 28, 2014

BTW, another bad thing about the !smp_allocator case in std_malloc() is that it doesn't respect the requested alignment: It always returns addresses 8 bytes into a page.

@wkozaczuk
Copy link
Collaborator

wkozaczuk commented Sep 26, 2018

I have recently spent some time analyzing how memory is allocated and managed in OSv. I added all bunch of metrics and had chance to look at this particular issue.

Unless I mismeasured OSv is making on average 1800 (1 thousand 8 hundred) early malloc allocations which amounts to 7.2 MB. What is worse, the total memory requested (size argument passed to std_malloc) for all these single-page allocations in ONLY around 120,000 bytes (~110KB). So the effectiveness ratio is roughly 1.5%. I think the math in the original 4 years ago description is wrong because 819 allocations is roughly 3.2MB.

Here is the build command and run command:
./scripts/build -j6 fs=rofs image=native-example
./scripts/run.py -c 1 -m 50M -V

Here are the changes are made to mempool.cc to capture the number of pages allocated and bytes requested:

@@ -1549,11 +1584,28 @@ static inline void* std_malloc(size_t size, size_t alignment)
     if (size <= memory::pool::max_object_size && alignment <= size && smp_allocator) {
         size = std::max(size, memory::pool::min_object_size);
         unsigned n = ilog2_roundup(size);
-        ret = memory::malloc_pools[n].alloc();
+        memory::malloc_pool &_pool = memory::malloc_pools[n];
+        memory::malloc_memory_pool_bytes_allocated.fetch_add(_pool.get_size());
+        memory::malloc_memory_pool_bytes_requested.fetch_add(size);
+        ret = _pool.alloc();
         ret = translate_mem_area(mmu::mem_area::main, mmu::mem_area::mempool,
                                  ret);
         trace_memory_malloc_mempool(ret, size, 1 << n, alignment);
     } else if (size <= mmu::page_size && alignment <= mmu::page_size) {
+        if(smp_allocator) {
+            memory::malloc_smp_full_pages_allocated.fetch_add(1);
+            memory::malloc_smp_full_pages_bytes_requested.fetch_add(size);
+        }
+        else {
+            memory::malloc_non_smp_full_pages_allocated.fetch_add(1);
+            memory::malloc_non_smp_full_pages_bytes_requested.fetch_add(size);
+        }
+
         ret = mmu::translate_mem_area(mmu::mem_area::main, mmu::mem_area::page,
                                        memory::alloc_page());
         trace_memory_malloc_page(ret, size, mmu::page_size, alignment);

@nyh
Copy link
Contributor Author

nyh commented Sep 26, 2018

One thing you should confirm before being too happy about the possibility of saving 7 MB of memory, is that most of this memory really remains allocated, and isn't just temporary memory which is immediately freed after being allocated. You can easily check that by also instrumenting the free() code to reduce the counter when called in those early times.

I'm curious where these 1800 early allocations come from, and whether we could easily reduce their number significantly. Perhaps we could reuse the code of the leak detector (see include/osv/alloctracker.hh and scripts/loader.py's show_leak) to show a summary of where in the code (in the form of stack backtraces) the early allocations come from. Another option is to use a tracepoint instead of your ad-hoc counting code, and for tracepoints we already have a way to see a list of their occurances, including backtraces.

@wkozaczuk
Copy link
Collaborator

wkozaczuk commented Sep 26, 2018 via email

@wkozaczuk
Copy link
Collaborator

I added another metric to count early deallocations (in early_free_page):

---------> In non-SMP mode allocated 1810 pages in order to allocate 120060 bytes AND deallocated 229 pages

The math seems to be off though. This suggests that 1810-229=1581 pages should be left whereas the difference reported based on memory::stats - 9806-8121=1685 - is higher. Is it possible that early allocated page gets de-allocated later in SMP-mode.

I did play with gdb osv leak but I did not cross-examine against it.

@wkozaczuk
Copy link
Collaborator

wkozaczuk commented Sep 27, 2018

I found at least one culprit - drivers/acpi.cc:acpi::early_init() function that I have counted made 600 malloc calls. Each call resulted in full page allocated - total 2.4MB. The total data however requested was only 36K.

---acpi::early_init---> In non-SMP mode allocated 606 pages in order to allocate 36450 bytes AND deallocated 3 pages (delta 600)

I used osv leak to track the problem but these particular stack traces never showed up:

(gdb) bt
#0  std_malloc (size=size@entry=72, alignment=alignment@entry=16) at core/mempool.cc:1612
#1  0x00000000003e0ae7 in malloc (size=72) at core/mempool.cc:1823
#2  0x00000000004940dc in operator new(unsigned long) ()
#3  0x000000000037b0ad in AcpiOsCreateSemaphore () at drivers/acpi.cc:115
#4  0x00000000003b2979 in AcpiUtCreateMutex (MutexId=<optimized out>) at external/x64/acpica/source/components/utilities/utmutex.c:267
#5  AcpiUtMutexInitialize () at external/x64/acpica/source/components/utilities/utmutex.c:164
#6  0x00000000003b3bdc in AcpiInitializeSubsystem () at external/x64/acpica/source/components/utilities/utxfinit.c:176
#7  0x000000000037b891 in acpi::early_init () at drivers/acpi.cc:553
#8  0x00000000002294a2 in premain () at loader.cc:123
#9  0x0000000000228f3e in start64 () at arch/x64/boot.S:104
(gdb) cont

When I was debugging on many iterations (over 10 at least) I would see identical stack trace as above. And I do not understand why same malloc from the same place is being called so many times. As if it keeps trying to initialize same semaphore object over and over again. On other hand I saw a loop in AcpiUtMutexInitialize creating 8 mutex objects. I do not think these allocated objects get deallocated in this code so I wonder why osv leak does not capture it.

@wkozaczuk
Copy link
Collaborator

Couple of updates to this issue:

  1. The reason I did not see both ACPI early allocations and other early one in osv leak report is because I was doing it in release mode which has many optimizations on and makes backtrace analysis in alloc_tracker nor working correctly. Once I switched to debug mode I could see all 600 ACPI allocations and this is in line with my other std atomic counter. Here are some stack traces:
Merging 603 allocations by identical call chain... generated 55 records.

Allocations still in memory at this time (seq=606):

Found 10600 bytes in 265 allocations [size 40, birth 58/275.1/453]
from:
        std_malloc(unsigned long, unsigned long) (core/mempool.cc:1653)
        malloc (core/mempool.cc:1833)
        AcpiOsAllocate (drivers/acpi.cc:158)
        AcpiUtAllocate (external/x64/acpica/source/components/utilities/utalloc.c:435)
        AcpiUtAllocateZeroed (external/x64/acpica/source/components/utilities/utalloc.c:478)
        AcpiOsAcquireObject (external/x64/acpica/source/components/utilities/utcache.c:422)
        AcpiNsCreateNode (external/x64/acpica/source/components/namespace/nsalloc.c:153)
        AcpiNsSearchAndEnter (external/x64/acpica/source/components/namespace/nssearch.c:468)
        AcpiNsLookup (external/x64/acpica/source/components/namespace/nsaccess.c:658)
        AcpiDsLoad1BeginOp (external/x64/acpica/source/components/dispatcher/dswload.c:402)
        AcpiPsBuildNamedOp (external/x64/acpica/source/components/parser/psobject.c:301)
        AcpiPsCreateOp (external/x64/acpica/source/components/parser/psobject.c:394)
        AcpiPsParseLoop (external/x64/acpica/source/components/parser/psloop.c:551)
        AcpiPsParseAml (external/x64/acpica/source/components/parser/psparse.c:597)
        AcpiNsOneCompleteParse (external/x64/acpica/source/components/namespace/nsparse.c:227)
        AcpiNsParseTable (external/x64/acpica/source/components/namespace/nsparse.c:270)
        AcpiNsLoadTable (external/x64/acpica/source/components/namespace/nsload.c:198)
        AcpiTbLoadNamespace (external/x64/acpica/source/components/tables/tbxfload.c:246)

Found 8712 bytes in 121 allocations [size 72, birth 456/537.3/605]
from:
        std_malloc(unsigned long, unsigned long) (core/mempool.cc:1653)
        malloc (core/mempool.cc:1833)
        AcpiOsAllocate (drivers/acpi.cc:158)
        AcpiUtAllocate (external/x64/acpica/source/components/utilities/utalloc.c:435)
        AcpiUtAllocateZeroed (external/x64/acpica/source/components/utilities/utalloc.c:478)
        AcpiOsAcquireObject (external/x64/acpica/source/components/utilities/utcache.c:422)
        AcpiUtAllocateObjectDescDbg (external/x64/acpica/source/components/utilities/utobject.c:509)
        AcpiUtCreateInternalObjectDbg (external/x64/acpica/source/components/utilities/utobject.c:184)
        AcpiDsBuildInternalObject (external/x64/acpica/source/components/dispatcher/dsobject.c:298)
        AcpiDsCreateNode (external/x64/acpica/source/components/dispatcher/dsobject.c:682)
        AcpiDsLoad2EndOp (external/x64/acpica/source/components/dispatcher/dswload2.c:730)
        AcpiPsParseLoop (external/x64/acpica/source/components/parser/psloop.c:685)
        AcpiPsParseAml (external/x64/acpica/source/components/parser/psparse.c:597)
osv_memory_dump.log
---acpi::early_init---> In non-SMP mode allocated 606 pages (delta 600) in order to allocate 36450 bytes AND deallocated 3 pages
  1. Secondly other almost 800 allocations (not releases) are coming from initializations of around 390 trace points:
Breakpoint 3, std_malloc (size=size@entry=24, alignment=alignment@entry=16) at core/mempool.cc:1616
1616	    if (size <= memory::pool::max_object_size && alignment <= size && smp_allocator) {
(gdb) bt
#0  std_malloc (size=size@entry=24, alignment=alignment@entry=16) at core/mempool.cc:1616
#1  0x00000000003e3717 in malloc (size=24) at core/mempool.cc:1827
#2  0x0000000000497778 in operator new(unsigned long) ()
#3  0x00000000003fc95c in tracepoint_base::tracepoint_base (this=0xb2b2a0 <_ZN3mmuL15trace_clear_pteE>, 
    _id=<optimized out>, tp_type=..., _name=<optimized out>, _format=<optimized out>) at core/trace.cc:215
#4  0x0000000000202e85 in (anonymous namespace)::tracepointv<1, std::tuple<void*, bool, long unsigned int>(void*, bool,
 long unsigned int), identity_assign<void*, bool, long unsigned int> >::tracepointv (
    format=0x6b786f "ptep=%p, cow=%d, pte=%x", name=<optimized out>, 
    this=0xb2b2a0 <_ZN3mmuL15trace_clear_pteE>) at include/osv/trace.hh:341
#5  __static_initialization_and_destruction_0 (__initialize_p=1, __priority=65535)
    at include/osv/mmu.hh:177
#6  _GLOBAL__sub_I_elf_header () at loader.cc:684
#7  0x0000000000212ca2 in premain () at loader.cc:123
#8  0x00000000002122c2 in start64 () at arch/x64/boot.S:104

and

#7  std::_Hashtable<std::tuple<std::type_info const*, unsigned long>, std::tuple<std::type_info const*, unsigned long>,
 std::allocator<std::tuple<std::type_info const*, unsigned long> >, std::__detail::_Identity, std::equal_to<std::tuple<
std::type_info const*, unsigned long> >, std::hash<std::tuple<std::type_info const*, unsigned long> >, std::__detail::_
Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_
traits<true, true, true> >::_M_insert<std::tuple<std::type_info const*, unsigned long> const&, std::__detail::_AllocNod
e<std::allocator<std::__detail::_Hash_node<std::tuple<std::type_info const*, unsigned long>, true> > > > (__node_gen=<s
ynthetic pointer>..., 
    __v=std::tuple containing = {...}, this=0xb7ee20 <tracepoint_base::known_ids()::_known_ids>)
    at /usr/include/c++/7/bits/hashtable.h:1818
#8  std::__detail::_Insert_base<std::tuple<std::type_info const*, unsigned long>, std::tuple<std::type_info const*, uns
igned long>, std::allocator<std::tuple<std::type_info const*, unsigned long> >, std::__detail::_Identity, std::equal_to
<std::tuple<std::type_info const*, unsigned long> >, std::hash<std::tuple<std::type_info const*, unsigned long> >, std:
:__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail:
:_Hashtable_traits<true, true, true> >::insert (
    __v=std::tuple containing = {...}, this=0xb7ee20 <tracepoint_base::known_ids()::_known_ids>)
    at /usr/include/c++/7/bits/hashtable_policy.h:843
#9  std::unordered_set<std::tuple<std::type_info const*, unsigned long>, std::hash<std::tuple<std::type_info const*, un
signed long> >, std::equal_to<std::tuple<std::type_info const*, unsigned long> >, std::allocator<std::tuple<std::type_i
nfo const*, unsigned long> > >::insert (__x=std::tuple containing = {...}, 
    this=0xb7ee20 <tracepoint_base::known_ids()::_known_ids>)
---Type <return> to continue, or q <return> to quit---
    at /usr/include/c++/7/bits/unordered_set.h:420
#10 tracepoint_base::tracepoint_base (this=0xb2b940 <_ZN3mmuL15trace_clear_pteE>, _id=<optimized out>, 
    tp_type=..., _name=<optimized out>, _format=<optimized out>) at core/trace.cc:210
#11 0x000000000020329d in (anonymous namespace)::tracepointv<0, std::tuple<void*, bool, long unsigned int>(void*, bool,
 long unsigned int), identity_assign<void*, bool, long unsigned int> >::tracepointv (
    format=0x6b786f "ptep=%p, cow=%d, pte=%x", name=<optimized out>, 
    this=0xb2b940 <_ZN3mmuL15trace_clear_pteE>) at include/osv/trace.hh:341
#12 __static_initialization_and_destruction_0 (__initialize_p=1, __priority=65535)
    at include/osv/mmu.hh:177
#13 _GLOBAL__sub_I_taskqueue_thread () at bsd/init.cc:57
#14 0x0000000000212ca2 in premain () at loader.cc:123
#15 0x00000000002122c2 in start64 () at arch/x64/boot.S:104

@nyh nyh closed this as completed in fbc22bd Dec 11, 2018
@nyh
Copy link
Contributor Author

nyh commented Dec 11, 2018

Committed a patch from @wkozaczuk that implements a simplistic but effective allocator for early boot. It saves much more than 0.7MB - in my tests (and in @wkozaczuk 's) a whopping 6MB are saved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants