Skip to content

fix(node_loader): plug 48-byte trampoline leak per metacall_await call (#578)#641

Draft
k5602 wants to merge 2 commits into
metacall:developfrom
k5602:fix-node-loader
Draft

fix(node_loader): plug 48-byte trampoline leak per metacall_await call (#578)#641
k5602 wants to merge 2 commits into
metacall:developfrom
k5602:fix-node-loader

Conversation

@k5602
Copy link
Copy Markdown
Contributor

@k5602 k5602 commented Feb 24, 2026

Description

Each call to metacall_await / metacall_await_future allocates a loader_impl_async_future_await_trampoline_type struct (48 bytes) on the C++ heap and passes it into JavaScript as an napi_wrap-ped object (trampoline_ptr). When the JS promise settles, bootstrap.js calls back into node_loader_trampoline_resolve or node_loader_trampoline_reject, which extract the pointer via napi_remove_wrap.

napi_remove_wrap detaches the native pointer from the JS object and suppresses the NAPI finalizer — transferring full ownership to C. Before this fix, the pointer was used and then silently abandoned, producing a 48-byte leak per settled await call. Across the 3 allocations reported in #578 that accounts for the 144-byte figure observed under both ASAN and Heaptrack.

Note: the function-await path (loader_impl_async_func_await_trampoline_type in node_loader_impl.cpp) is not affected — it uses napi_wrap with an explicit node_loader_impl_async_func_await_finalize finalizer, so the GC owns that allocation. The leak is isolated to the future/direct-await path handled by node_loader_trampoline.cpp.

Fix

Wrap the detached pointer in a std::unique_ptr immediately after napi_remove_wrap in both node_loader_trampoline_resolve and node_loader_trampoline_reject:

loader_impl_async_future_await_trampoline trampoline =
    static_cast<loader_impl_async_future_await_trampoline>(result);
std::unique_ptr<loader_impl_async_future_await_trampoline_type> trampoline_guard(trampoline);

// trampoline_guard destructs at end of block, after return value is fully materialised
return trampoline->resolve_trampoline(trampoline->node_impl, env, trampoline->resolve_callback, recv, args[1], trampoline->context);

No behaviour change. The return expression evaluates fully before the unique_ptr destructor runs (end-of-block), so there is no use-after-free.

Fixes #578

Type of change

  • Bug fix (non-breaking change which fixes an issue)

Checklist:

  • I have performed a self-review of my own code.
  • I have commented my code, particularly in hard-to-understand areas.
  • I have made corresponding changes to the documentation. (internal implementation detail; no public API surface changed)
  • My changes generate no new warnings.
  • I have added tests/screenshots (if any) that prove my fix is effective or that my feature works.
  • I have tested the tests implicated (if any) by my own code and they pass (make test or ctest -VV -R <test-name>).
  • If my change is significant or breaking, I have passed all tests with ./docker-compose.sh test &> output and attached the output. (non-breaking single-file fix; full Docker suite not run locally)
  • I have tested my code with OPTION_BUILD_ADDRESS_SANITIZER or ./docker-compose.sh test-address-sanitizer &> output and OPTION_TEST_MEMORYCHECK. (built with -fsanitize=address,leak; AwaitTrampolineOwnershipRegression passes with zero Direct leak / Indirect leak reports)
  • I have tested my code with OPTION_BUILD_THREAD_SANITIZER or ./docker-compose.sh test-thread-sanitizer &> output. (fix is single-threaded ownership; no new synchronisation primitives introduced)
  • I have tested with Helgrind in case my code works with threading. (no threading changes; unique_ptr destructs on the same thread that called napi_remove_wrap)
  • I have run make clang-format in order to format my code and my code follows the style guidelines.

Test: AwaitTrampolineOwnershipRegression

Added to metacall-node-async-test, exercising all three trampoline settlement paths:

# Function JS behaviour Expected C callback
1 f(x)new Promise(r => r(x)) resolves resolve_cb fired
2 g(x)new Promise((_, r) => r(x)) rejects reject_cb fired
3 i()async throw exception → reject reject_cb fired

Post-metacall_destroy() assertions confirm exactly 1 resolve and 2 reject callbacks fired. Under ASAN with detect_leaks=1 the test produces a clean report with the fix applied.

ctest -VV -R metacall-node-async-test
...
[ OK ] metacall_node_async_test.AwaitTrampolineOwnershipRegression (243 ms)
[  PASSED  ] 1 test.

Additional info

this is my first pr in my GSoC campaign so i hope it's good and very welcoming to reviews from mentors.

@k5602 k5602 force-pushed the fix-node-loader branch 2 times, most recently from a9e936b to d977527 Compare February 26, 2026 15:32
@viferga
Copy link
Copy Markdown
Member

viferga commented Mar 2, 2026

@k5602 could you reproduce this bug? Because I have had it inside the linux distributable but it does not happen inside the CI. Does your test reproduce the problem?

@viferga viferga marked this pull request as draft March 4, 2026 14:14
@k5602 k5602 closed this Mar 4, 2026
@k5602 k5602 reopened this Mar 4, 2026
@k5602 k5602 force-pushed the fix-node-loader branch from d977527 to a24e302 Compare March 11, 2026 17:04
@k5602 k5602 force-pushed the fix-node-loader branch from dfd9cbe to 06ab59e Compare March 11, 2026 19:44
@k5602 k5602 force-pushed the fix-node-loader branch from c34e35a to 89b88a6 Compare March 12, 2026 04:27
@k5602 k5602 marked this pull request as ready for review March 12, 2026 07:38
@viferga
Copy link
Copy Markdown
Member

viferga commented Mar 12, 2026

I cannot reproduce it, I am going to leave it as a draft, if we achieve to reproduce it in some way, then we have this as reference. Upload the full logs by the way of all tests:

ctest --output-on-failure -j24 &> output

@viferga viferga marked this pull request as draft March 12, 2026 16:16
@k5602
Copy link
Copy Markdown
Contributor Author

k5602 commented Mar 17, 2026

I cannot reproduce it, I am going to leave it as a draft, if we achieve to reproduce it in some way, then we have this as reference. Upload the full logs by the way of all tests:

ctest --output-on-failure -j24 &> output

image

because it's suppressed, it's not internal but it runs through libnode.so so it's suppressed
so the question is that can i remove suppression rule because "for now" is so ambiguous.
@viferga

@viferga
Copy link
Copy Markdown
Member

viferga commented Mar 26, 2026

I was working on removing the leak suppression recently. In any case right now we also have this CI that may detect that issue better:
https://github.com/metacall/core/actions/runs/23581869489/job/68667222906

@viferga
Copy link
Copy Markdown
Member

viferga commented Mar 26, 2026

In any case we can do peer programming for removing the suppression and review the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Issue on MetaCall CLI Linux

2 participants