Description
Currently std/tasks is implemented the following way:
type
Task* = object ## `Task` contains the callback and its arguments.
callback: proc (args: pointer) {.nimcall, gcsafe.}
args: pointer
destroy: proc (args: pointer) {.nimcall, gcsafe.}
Any function call with an argument will trigger an allocation. This is something that runtimes that use std/task can't avoid.
Given that runtimes also have to allocate a future/flowvar, scheduling a task always incur allocation overhead even if there are just 1 or 2 arguments that needs to be passed.[1]
GCC and Clang have an optimization where std:function can store up to 8 (?) bytes intrusive to the data structure and uses the heap otherwise.
It would significantly reduce memory overhead and memory fragmentation for long-running tasking runtimes if this was implemented:
- Tasks are most useful in a multithreaded context and they are moved around threads.
Reclaiming memory allocated from another thread is non-trivial for allocators and requires synchronization and advanced techniques to avoid contention like extreme freelist sharding in the Mimalloc allocator - Tasks may be quickly allocated and deallocated leading to fragmentation especially if args are small.
- As the number of cores grows, fine-grained tasks become more important to exploit multiple cores, leading to more std/tasks overhead
--
[1]: Runtime actually don't have to allocate a future/flowvar if those don't escape, they can use the caller stackframe if the compiler helps with heap allocation elision: (http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2021/p2477r0.html#design-and-example, http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0981r0.html)
Proposal
For a to be decided size N, change the type to:
type
Task* = object ## `Task` contains the callback and its arguments.
callback: proc (args: pointer) {.nimcall, gcsafe.}
args: pointer
destroy: proc (args: pointer) {.nimcall, gcsafe.}
argsBuffer: array[N, byte]
args
would point to either the buffer or the heap, change would be minimal and would mainly require changing this:
let scratchObjPtrType = quote do:
cast[ptr `scratchObjType`](c_calloc(csize_t 1, csize_t sizeof(`scratchObjType`)))
to something in that vein.
let scratchObjPtrType = quote do:
when sizeof(`scratchObjType`) <= N:
cast[ptr `scratchObjType`](task.buffer.addr)
else:
cast[ptr `scratchObjType`](c_calloc(csize_t 1, csize_t sizeof(`scratchObjType`)))
at https://github.com/nim-lang/Nim/blob/d102b2f54c41f19a0545f28109d0a93550b5d886/lib/std/tasks.nim#L187
Discussion of the buffer size
The current task object has a size of 24 bytes.
- Buffer of size 8: this rounds up the total size to 32, a nice power of 2. However the majority of function need more than 1 argument.
- Buffer of size 16 and 24: many functions have the form
proc(a, b): result
orproc(r, a, b)
, this would cover those efficiently. - Buffer of size 40:
- this rounds up the total size to 64, a nice power of 2.
- 64 bytes is the cache-line size on most CPUs (when it's not 128), if a task is aligned, it can be loaded from L1 cache in a single cache read.
- This buffer size allows to store up to 5 ints/pointer arguments or 2 openarrays + 1 pointer/int (for example a callback when doing parallel
map
or a pointer to a result location)
Working on large openarrays is often a very compelling reason to divide work in tasks. - x86-64 can pass up to 6 arguments through registers and ARM64 up to 8 so there won't be stack spilling with that buffer size.
- Also due to false sharing a runtime that stores tasks contiguously for example in a shared task queue must make sure that they each start 64 bytes (or even 128 bytes due to CPU prefetching 2 cache-lines at a time).
Even if tasks were smaller, space would be unused.