Skip to content

Conversation

@rutenkolk
Copy link
Contributor

Hi, In this pull request I propose some performance improvements that mainly target ffi/defcfn defined functions taking struct arguments.

While developing a test application using the latest version of coffi, i noticed that repeatedly calling a function which took arguments that are to be serialized via defstruct defined serdes, performance took a big hit. Profiling resulted that the majority of time is spent in mem/size-of and mem/align-of:

jmc_kpOgARz06T

The reason for this is, that the serde-wrapper for FFI functions called mem/alloc-instance and mem/serialize-into for non-primitive arguments, which results in calls to mem/size-of and mem/align-of, which will actually go through mem/c-layout. mem/c-layout can be a pretty expensive function on top of being a multimethod but here it is called multiple times for every argument.

One optimization proposed here is to memoize calls to mem/size-of and mem/align-of whose argument is not a MemoryLayout.

This improved performance, but unfortunately not by as much as i had hoped.

Therefore, another improvement proposed here is generating a call to mem/alloc with the size and alignment baked in, instead of doing so every time the FFI call is made using mem/alloc-instance.

The next bottleneck was actually the call to mem/serialize-into which had a similar issue as mem/alloc-instance, needing to dispatch on the multimethod mem/type-dispatch with the serde descriptor.

Leveraging the serde registry introduced with the defstruct macro, we can allow for an inline solution, should one exist in the registry. Together with the addition of some type hints for defstruct serdes, this eliminated all calls to mem/size-of, mem/align-of and mem/type-dispatch altogether and is my last proposed optimization.

With these changes in place, the actual allocation of the segments in the confined arena becomes the dominant cost of the whole FFI call, suggesting little other performance gains:

jmc_N4RWEj1EMG

I unfortunately don't have a rigourous benchmark for the impact of the improvements, but in my private raylib example I started with an fps of around 100 and ended somewhere over 4000.

@rutenkolk
Copy link
Contributor Author

There were a few concerns with the proposals.

  • one were scaling concerns with memoizing size-of and align-of. even something like [::mem/array ::mem/int n] would generate a new entry for every n. I have reverted the memoization.

  • another was that it was possibly not guaranteed that size-of and align-of could be deduced at macroexpansion time with the given type in inline-serde-wrapper. i added safeguards that dynamically check if this is possible and only then use this.

Further optimizations that are proposed now:

  • inlining for alloc. it turns out that calls to alloc came with significant overhead due to calling long (Rt.longCast) on the size argument indiscriminately. inlining this with the requirement of the inlined version to produce a long eliminates this overhead.

  • defcfn now inlines serde multimethods if they are available. this eliminates virtually all multimethod lookup when calling a defcfn defined function, which is a big cost saving for functions taking struct arguments.

One point of contention here: I'm personally not perfectly happy with how this is implemented. The function objects themselves are inlined into the forms which is a bit unclean, since functions aren't unfortunately really data in clojure. it works, but it's not the perfect solution. ideally we would wrap the whole expression with a let, define the functions there via get-method and refer to them via the symbols introduced in the let. i'm not personally certain if this is possible or should be done in inline-serde-wrapper itself though or outside of it and it would require a more major rework of this code, since we would have to know a-priori which multimethods we want to call.

  • at this point allocating memory itself, creating confined arenas and doing bounds checks dominates the cost of calling a function heavily. this is already nice but we can do a bit better. after looking a bit into how exactly confined arenas work, they basically don't to any stack allocation because you can allocate as much as you want to so they really only do a malloc behind the scenes. after a confined arena is used though, the memory is instantly released again, only for subsequent FFI calls to malloc again. It would be good to reuse memory, for FFI calls but also in general but it can't be globally since different threads might call the same function. Inspired by https://bugs.openjdk.org/browse/JDK-8348189 I propose the introduction of the thread-local macro and thread-local-arena function which is a thread local confined arena, which reuses memory. the linked issue mentions a potential scaling issue if a lot of threads make FFI calls, especially in the context of virtual threads or short lived ones. For that purpose I also introduce terminating-thread-local which is the same thing, but on thread exit runs a cleanup action from a different thread, mimicking the behavior of the JDK internal class jdk.internal.misc.TerminatingThreadLocal. I consider them two separate things, since the terminating-thread-local is slightly more expensive than just a thread-local, since it needs to do a bit of bookkeeping. With this however, all thread-local-arenas should be closed if a thread that initialized their thread-local-arena exits, preventing a memory leak. the actual Arena class implemented, ThreadLocalConfinedArena is initialized only once per thread itself and returned via thread-local-arena and basically works via a SegmentAllocator.slicingAllocator on a buffer Segment. It works with nested uses in thread-local-arena and it will allocate new buffers when it runs out of memory and on being closed as many times as it was opened, it will free the memory and allocate a new buffer with increased size based on how much total allocation was done during the last allocation cycle.

Before the thread-loca-arena there was a lot of actual allocation when calling a native function:
jmc_HIR7QIAnD4

But with the thread-local-arena the allocation cost disappears. now it's calling set and get on segments that dominates the costs:
jmc_pUiG4BKK6I

Looking at memory profiling, the creation of NativeMemorySegmentImpl objects happened mostly through SegmentFactories.allocateSegment (which is an actual malloc call):
jmc_l8NB2Fc2ZR

and with thread-local-arena it happened mostly through NativeMemorySegmentImpl.dup:
jmc_vldT5zFQK7

NativeMemorySegmentImpl.dup is probably the most efficient way to create a segment, as it is simply increasing the pointer of an existing address as per https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/jdk/internal/foreign/NativeMemorySegmentImpl.java#L72

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant