Skip to content

Update emmalloc #10145

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 43 commits into from
Jan 9, 2020
Merged

Update emmalloc #10145

merged 43 commits into from
Jan 9, 2020

Conversation

juj
Copy link
Collaborator

@juj juj commented Jan 4, 2020

This PR updates our emmalloc implementation. The basic idea is the same as original dlmalloc/emmalloc, i.e. memory is grouped into adjacent regions where each region is traversable back and forth, and free regions go into arrays of linked lists for quick lookup.

What is different compared to original emmalloc:

  • instead of grouping free memory regions to power-of-2 buckets, rely heavily on clz/ctz instructions to generate a bucket distribution that favors more buckets to small memory regions.
  • uses 64 buckets instead of 32, and uses 64-bit arithmetic if targeting only wasm to compute the buckets. When doing a dual wasm+js build or a js build, use 32-bit arithmetic only.
  • an allocation takes a constant number of operations in the regular case. Only when memory is getting really scarce, there is a linear behavior (w.r.t # free regions) in allocation speed.
  • multithreading capable (using a global lock - more finegrained locking would be possible to implement, but it is unclear whether that is faster, would need good test cases for both contended and uncontended cases to warrant investigation)
  • adds support for Emscripten tracing builds
  • emmalloc is now aware/compliant that other sources may call to sbrk(). If nobody else does, emmalloc still does the optimal unfragmenting thing.
  • adds a new header <emscripten/emmalloc.h> which provides an expanded memory allocation API:
    • in particular functions emmalloc_realloc_try(), emmalloc_realloc_uninitialized() and emmalloc_aligned_realloc_uninitialized() which enable much more efficient C++ STL container implementation than standard std::vector/std::list etc can do.
    • in addition to implementing mallinfo(), provides functions emmalloc_free_dynamic_memory(), emmalloc_unclaimed_heap_memory(), emmalloc_dynamic_heap_size() and emmalloc_compute_free_dynamic_memory_fragmentation_map() to get more precise allocation statistics.
    • supports memory freeing via malloc_trim()

Comparisons:

  • New emmalloc is a hair bit smaller compared to old emmalloc according to benchmark.test_havlak, dlmalloc was 37022 bytes, old emmalloc was 31394 bytes, and new emmalloc is 31180 bytes.
  • New emmalloc is about ~+30% faster than old emmalloc, and ~+1.5% faster than dlmalloc in benchmark.test_havlak:

dlmalloc:
Node.js: mean: 3.427 (+-0.045) secs median: 3.437 range: 3.362-3.487 (noise: 1.304%) (5 runs)
size: 37022, compressed: 3543

old emmalloc:
Node.js: mean: 4.414 (+-0.143) secs median: 4.348 range: 4.295-4.689 (noise: 3.237%) (5 runs)
size: 31394, compressed: 3560

new emmalloc:
Node.js: mean: 3.376 (+-0.032) secs median: 3.391 range: 3.338-3.419 (noise: 0.939%) (5 runs)
size: 31180, compressed: 3602

  • Changing benchmark.test_havlak to multithreaded build with -s USE_PTHREADS=1 so that locking is in place, the now multithreaded emmalloc is ~+7.1% faster than multithreaded dlmalloc:

mt emmalloc:
Node.js: mean: 3.728 (+-0.049) secs median: 3.728 range: 3.642-3.792 (noise: 1.324%) (5 runs)
size: 62711, compressed: 9664

mt dlmalloc:
Node.js: mean: 3.994 (+-0.067) secs median: 3.966 range: 3.915-4.106 (noise: 1.665%) (5 runs)
size: 70338, compressed: 9602

The multithreaded benchmark.test_havlak still runs in a single thread, so the above stresses a noncontended case.

  • According to tests/malloc_bench.cpp (em++ tests\malloc_bench.cpp -o a.html -O3), new emmalloc has less overhead (12.16 vs 12.20) than dlmalloc:

new emmalloc:
checksum: 316dd259
allocations: 4200847
mean alloc size: 31.51
max allocated: 699744
allocations #max 22209.92
sbrk chng: 969720
sbrk chng/allocs: 43.66
overhead: 12.16
sbrk top now: 0x5ede68

dlmalloc:
checksum: 32788815
allocations: 4200847
mean alloc size: 31.51
max allocated: 699744
allocations #max 22209.92
sbrk chng: 970720
sbrk chng/allocs: 43.71
overhead: 12.20
sbrk top now: 0x5ee000

and considerably less overhead than old emmalloc (12.16 vs 15.23):

old emmalloc:
checksum: 362013f1
allocations: 4200847
mean alloc size: 31.51
max allocated: 699744
allocations #max 22209.92
sbrk chng: 1037968
sbrk chng/allocs: 46.73
overhead: 15.23
sbrk top now: 0x5fe540

In summary, it is faster, smaller and has less overhead than dlmalloc and old emmalloc, when tested against benchmark.test_havlak and tests/malloc_bench.cpp.

There is https://github.com/daanx/mimalloc-bench that contains some benchmarks. Looks like most of them would require a bit of porting, but cfrac built quite easily, and

cfrac: dlmalloc is +17% faster than new emmalloc, but new emmalloc is +16.2% faster than old emmalloc.

In this PR src/settings.js is changed to default MALLOC setting from 'dlmalloc' to 'emmalloc' to have the bots run the whole suite through it, to verify that the new emmalloc is a feature complete replacement to dlmalloc. But to land should change back to default to dlmalloc instead.

@juj juj force-pushed the update_emmalloc branch 3 times, most recently from 595e357 to 56e31e3 Compare January 6, 2020 14:27
@juj
Copy link
Collaborator Author

juj commented Jan 7, 2020

Restored dlmalloc in this PR after checking through the suite for everything to pass.

@@ -6,7 +6,6 @@
var LibraryPThread = {
$PThread__postset: 'if (!ENVIRONMENT_IS_PTHREAD) PThread.initMainThreadBlock(); else PThread.initWorker();',
$PThread__deps: ['$PROCINFO', '_register_pthread_ptr',
'emscripten_main_thread_process_queued_calls',
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Split out this change along with src/deps_info.json? Seems unrelated but maybe I'm wrong?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It only started failing in this branch because of the deps structure changing.

// allocated with one of the emmalloc memory allocation functions (malloc, memalign, ...).
// If called with size == 0, the pointer ptr is freed, and a null pointer is returned. If
// called with null ptr, a new pointer is allocated.
void *realloc(void *ptr, size_t size);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to re-declare all the POSIX names here? Would it be better to use the ones from stdlib.h and only declare the specific emmalloc symbols in this header?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's more documentative to re-declare them here. No harm in doing so, in fact, if there ever were a change in signature of one of the names, then this would loudly complain, and it'd trigger to change the signature of the corresponding emmalloc_ name as well.

// Returns 1 on success, 0 otherwise. In order to call this function,
// the page must have been compiled with -s ALLOW_MEMORY_GROWTH=1 linker
// flag.
int emscripten_realloc_buffer(size_t requested_size);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm unclear about heap growth vs growing the underlying wasm memory. Are they always the same thing?

Does emscripten_resize_heap above also require ALLOW_MEMORY_GROWTH? I'm a little confused a bout the use of the term "realloc" which normally implies the buffer could be moved. Maybe there could be a better name for this function? Is it basically like emscripten_resize_heap bug taking an absolute size?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this function doesn't seem very well named perhaps we can avoid exposing it and keep it as an implementation detail. I see its only used in test code anyway. Could the test be modified to not depend on the this currently internal function?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intent is to be able to use emscripten_realloc_buffer in public code to resize the heap. In JS code we can use that to actually shrink the heap and release memory back to the browser.

emscripten_resize_heap only grows the heap with geometric/linear overgrowth, suitable for sbrk()-like allocators. emscripten_realloc_buffer is suitable for someone who is implementing their own malloc() that does not use sbrk().

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current emscripten_resize_heap seems to take an absolute number of bytes not an increment. The fact that it can't shrink is a current implementation detail isn't it? With wasm IIUC there it is never possible to shrink memory I doubt that will ever change.

You are renaming the parameter to requested_growth_bytes but I don't see the implementation changing.

I think the terms heap and buffer and realloc are all a little confusing. IIUC what is really being resized here is the webassembly memory which hold static data, stack and heap.

Maybe we can come up with better names for these and some documentation. Do we really need two different memory resize functions? If we do shouldn't they be called something like:

  • emscripten_memory_grow
  • emscripten_memory_resize

?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are renaming the parameter to requested_growth_bytes but I don't see the implementation changing.

Thanks, good catch, when doing a documentation pass I misremembered the function argument behavior. Updated.

Do we really need two different memory resize functions?

Yes - while the automatic overgrowing is useful for some applications, for others it gets in the way if they don't want to do sbrk() or similar. I like the names emscripten_memory_grow() and emscripten_memory_resize, let's go with those. @kripken wdyt?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, on another thought, I don't want to grow this PR and start renaming functions here, that should be for later. Note that the implementation here does not add any new heap resize/grow functionality, but just exposes the function that has already existed for a long time.

@@ -0,0 +1,24 @@
mergeInto(LibraryManager.library, {
emmalloc_unclaimed_heap_memory__deps: ['emscripten_get_sbrk_ptr'],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also looks like it just used in tests. Is it worth added this JS library just for sake of a test case? Maybe there is some way to avoid it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intent is to expose that function to public API for users of emmalloc, hence testing that it works in that test.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But isn't that function internal-only right now? So the fact that it has a bad name is currently not an issue?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that this should be split out into a separate PR btw. Although I'm still not convinced about the need to shrink memory (since its not possible with wasm).

@juj juj force-pushed the update_emmalloc branch from 9dfbe09 to da90049 Compare January 8, 2020 20:37
@juj juj force-pushed the update_emmalloc branch from da90049 to 5752116 Compare January 9, 2020 08:13
@juj juj merged commit d4c3592 into emscripten-core:incoming Jan 9, 2020
@sbc100
Copy link
Collaborator

sbc100 commented Jan 9, 2020

I'm seeing local failures and some CI failures in wasm0.test_emmalloc_memory_statistics.

Locally:

$ ./tests/runner.py wasm0.test_emmalloc_memory_statistics
./tests/runner.py:WARNING: use EMTEST_ALL_ENGINES=1 in the env to run against all JS engines, which is slower but provides more coverage
Test suites:
['test_core']
Running test_core: (1 tests)
(checking sanity from test runner)
shared:INFO: (Emscripten: Running sanity checks)
test_emmalloc_memory_statistics (test_core.wasm0) ... (test did not pass in JS engine: ['/usr/bin/nodejs'])
FAIL

======================================================================
FAIL: test_emmalloc_memory_statistics (test_core.wasm0)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/google/home/sbc/dev/wasm/emscripten/tests/runner.py", line 116, in decorated
    func(self, *args, **kwargs)
  File "/usr/local/google/home/sbc/dev/wasm/emscripten/tests/test_core.py", line 194, in decorated
    func(self)
  File "/usr/local/google/home/sbc/dev/wasm/emscripten/tests/runner.py", line 116, in decorated
    func(self, *args, **kwargs)
  File "/usr/local/google/home/sbc/dev/wasm/emscripten/tests/test_core.py", line 931, in test_emmalloc_memory_statistics
    self.do_run_in_out_file_test('tests', 'core', 'test_emmalloc_memory_statistics')
  File "/usr/local/google/home/sbc/dev/wasm/emscripten/tests/test_core.py", line 276, in do_run_in_out_file_test
    self.do_run_from_file(src, output, **kwargs)
  File "/usr/local/google/home/sbc/dev/wasm/emscripten/tests/runner.py", line 1120, in do_run_from_file
    self.do_run(open(src).read(), open(expected_output).read(), *args, **kwargs)
  File "/usr/local/google/home/sbc/dev/wasm/emscripten/tests/runner.py", line 1174, in do_run
    self.assertContained(expected_output, js_output, check_all=assert_all)
  File "/usr/local/google/home/sbc/dev/wasm/emscripten/tests/runner.py", line 856, in assertContained
    additional_info
AssertionError: Expected to find '1
0
106971424
4210892
3
0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 
21997632
' in '1
0
106971424
4210892
3
0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 
21997584
', diff:

--- expected
+++ actual
@@ -4,5 +4,5 @@
 4210892
 3
 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
-21997632
+21997584

CI: https://app.circleci.com/jobs/github/emscripten-core/emscripten/227658

Its strange because the CI for this change seems to have passed ..

@@ -35,6 +35,8 @@
}

void *sbrk(intptr_t increment) {
// Enforce preserving a minimal 4-byte alignment for sbrk.
increment = (increment + 3) & ~3;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This breaks the use case of sbrk(0) being used to find the current break point, I think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh no wait, it's still zero then, nm...

@@ -146,7 +146,7 @@ void realloc() {
stage("realloc0");
emmalloc_blank_slate_from_orbit();
for (int i = 0; i < 2; i++) {
char* ptr = (char*)malloc(10);
char* ptr = (char*)malloc(100);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why was this changed?

@Akaricchi
Copy link
Contributor

Changing MALLOC_ALIGNMENT (and ALLOCATION_UNIT in the test) to 16 makes this assertion fail:

assert(size_t(third) == size_t(first) + ((100 + ALLOCATION_UNIT - 1)&(-ALLOCATION_UNIT)) + ALLOCATION_UNIT); // allocation units are multiples of ALLOCATION_UNIT

third - first is still 112, as if ALLOCATION_UNIT == 8. The pointers seem appropriately aligned, however. I'm not sure I understand the purpose of this test.

@Akaricchi
Copy link
Contributor

Seems like ALLOCATION_UNIT is no longer equal to alignment, but to SMALLEST_ALLOCATION_SIZE (which was equal to alignment in the previous implementation). I've changed the test like this:

diff --git a/system/lib/emmalloc.cpp b/system/lib/emmalloc.cpp
index 417ae3f33..27d35dd39 100644
--- a/system/lib/emmalloc.cpp
+++ b/system/lib/emmalloc.cpp
@@ -40,6 +40,7 @@
  *    malloc.
  */
 
+#include <stddef.h>
 #include <stdint.h>
 #include <unistd.h>
 #include <memory.h>
@@ -61,7 +62,7 @@ extern "C"
 
 // Configuration: specifies the minimum alignment that malloc()ed memory outputs. Allocation requests with smaller alignment
 // than this will yield an allocation with this much alignment.
-#define MALLOC_ALIGNMENT 8
+#define MALLOC_ALIGNMENT __alignof__(max_align_t)
 
 // Configuration: If EMMALLOC_USE_64BIT_OPS is specified, emmalloc uses 64 buckets for free memory regions instead of just 32.
 // When building to target asm.js/wasm2js, 64-bit ops are disabled, but in Wasm builds, 64-bit ops are enabled. (this is
diff --git a/tests/core/test_emmalloc.cpp b/tests/core/test_emmalloc.cpp
index 4f86f2f25..9730bb1a2 100644
--- a/tests/core/test_emmalloc.cpp
+++ b/tests/core/test_emmalloc.cpp
@@ -38,7 +38,7 @@ void stage(const char* name) {
   }, name);
 }
 
-const size_t ALLOCATION_UNIT = 8;
+#define ALLOCATION_UNIT SMALLEST_ALLOCATION_SIZE
 
 void basics() {
   stage("basics");
@@ -58,7 +58,7 @@ void basics() {
   assert(size_t(third) == size_t(first) + ((100 + ALLOCATION_UNIT - 1)&(-ALLOCATION_UNIT)) + ALLOCATION_UNIT); // allocation units are multiples of ALLOCATION_UNIT
   stage("allocate 10 more");
   void* four = malloc(10);
-  assert(size_t(four) == size_t(third) + (2*ALLOCATION_UNIT) + ALLOCATION_UNIT); // payload (10 = 2 allocation units) and metadata
+  assert(size_t(four) == size_t(ALIGN_UP((char*)(third) + (2*ALLOCATION_UNIT) + REGION_HEADER_SIZE, MALLOC_ALIGNMENT))); // payload (10 = 2 allocation units) and metadata
   stage("free the first");
   free(second);
   stage("several temp alloc/frees");

However, the min_alloc test fails now. Overall the test seems to conflate REGION_HEADER_SIZE, SMALLEST_ALLOCATION_SIZE, and MALLOC_ALIGNMENT. They all just happen to be 8 by default, so it "works". I don't understand the details enough to unravel this in a reasonable amount of time. Please fix the test to work with arbitrary valid MALLOC_ALIGNMENTs.

@sbc100
Copy link
Collaborator

sbc100 commented Jan 10, 2020

@juj
Copy link
Collaborator Author

juj commented Jan 10, 2020

#10183 should resolve the builder issues. I am not sure why, but the dynamic top sizes end up being slightly different for different systems. Perhaps static data section sizes end up being different on different runs(?)

@sbc100
Copy link
Collaborator

sbc100 commented Jan 10, 2020

Thanks for the quick fix! The non-determinism there is pretty strange. Did you ever see it locally, or is it just between different machines? Is it worth opening a bug and trying to track it down?

@juj
Copy link
Collaborator Author

juj commented Jan 10, 2020

I only saw it on the CI - my first suspect would be \r\n vs \n line endings, so --output-eol might help diagnose there. Though not completely sure. It might be worth a bug entry.

Btw, do you know what would need to be done to make emmalloc compatible with ASan or LSan? That would be quite interesting I think.

@juj
Copy link
Collaborator Author

juj commented Jan 10, 2020

@Akaricchi : dropped that whole part of the test, it was testing invariants that were specific to previous emmalloc version and no longer hold with the updated emmalloc, so that code was good to go away.

belraquib pushed a commit to belraquib/emscripten that referenced this pull request Dec 23, 2020
* Update emmalloc

* Use 64bit ops in emmalloc.

* Add emmalloc.h header

* Ensure test_setjmp_noleak tests against dlmalloc

* Fix emmalloc test and 64-bit ctz

* Update dlfcn test and aligned_alloc linkage

* Adjust last resort lookup for 64-bit buckets build

* Remove support for non-sbrk emmalloc use.

* Add emmalloc_usable_size().

* Revise docs

* Add validation functions and expand emmalloc api.

* Add use of EMMALLOC_DEBUG and EMMALLOC_DEBUG_LOG. Add testing for emmalloc memory statistics.

* Remove old emmalloc code.

* Restore EMMALLOC_USE_64BIT_OPS

* Be diligent at looking at all available memory if sbrk() fails.

* Add emmalloc_unclaimed_heap_memory().

* Implement malloc_trim() and promote emscripten_realloc_buffer() to a public API.

* Micro-optimize compute_free_list_bucket().

* Fix bad order of 32-bit and 64-bit code when looking for free memory buckets

* Add missing include.

* flake

* Fix tests

* Stop using LLVM attribute alias(xxx), as it does not link properly to JS code

* Remove incorrect JS->C dependency in PThread.

* Use emmalloc_trim()

* Update emmalloc_trim() test

* Update tests.

* Restore dlmalloc as default allocator.

* Fix use of 32-bit emmalloc in wasm backend wasm2js mode.

* Fix emmalloc build

* Fix test_emmalloc back to the original form to avoid malloc() calls getting optimized out in -O2 and higher

* CI fix

* Micro-optimize size

* Move emmalloc tests

* Fix upstream wasm test

* Drop malloc impl if building with USES_DYNAMIC_ALLOC=0

* Make emscripten_realloc_buffer internal again

* Update test_emmalloc_trim

* Update test_minimal_runtime_code_size. Emmalloc regresses fastcomp wasm backend sizes by 90 bytes, but improves wasm backend wasm sizes by 100 bytes, so that's ok.

* Update minimal runtime code size tests

* Run tests only in debug

* Work around bug emscripten-core#10173

* Update test
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants