Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

std/tasks: "small functions" optimization #443

Open
mratsim opened this issue Jan 16, 2022 · 5 comments
Open

std/tasks: "small functions" optimization #443

mratsim opened this issue Jan 16, 2022 · 5 comments
Assignees

Comments

@mratsim
Copy link
Collaborator

mratsim commented Jan 16, 2022

Currently std/tasks is implemented the following way:

type
  Task* = object ## `Task` contains the callback and its arguments.
    callback: proc (args: pointer) {.nimcall, gcsafe.}
    args: pointer
    destroy: proc (args: pointer) {.nimcall, gcsafe.}

Any function call with an argument will trigger an allocation. This is something that runtimes that use std/task can't avoid.
Given that runtimes also have to allocate a future/flowvar, scheduling a task always incur allocation overhead even if there are just 1 or 2 arguments that needs to be passed.[1]

GCC and Clang have an optimization where std:function can store up to 8 (?) bytes intrusive to the data structure and uses the heap otherwise.

It would significantly reduce memory overhead and memory fragmentation for long-running tasking runtimes if this was implemented:

  • Tasks are most useful in a multithreaded context and they are moved around threads.
    Reclaiming memory allocated from another thread is non-trivial for allocators and requires synchronization and advanced techniques to avoid contention like extreme freelist sharding in the Mimalloc allocator
  • Tasks may be quickly allocated and deallocated leading to fragmentation especially if args are small.
  • As the number of cores grows, fine-grained tasks become more important to exploit multiple cores, leading to more std/tasks overhead

--

[1]: Runtime actually don't have to allocate a future/flowvar if those don't escape, they can use the caller stackframe if the compiler helps with heap allocation elision: (http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2021/p2477r0.html#design-and-example, http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0981r0.html)

Proposal

For a to be decided size N, change the type to:

type
  Task* = object ## `Task` contains the callback and its arguments.
    callback: proc (args: pointer) {.nimcall, gcsafe.}
    args: pointer
    destroy: proc (args: pointer) {.nimcall, gcsafe.}
    argsBuffer: array[N, byte]

args would point to either the buffer or the heap, change would be minimal and would mainly require changing this:

let scratchObjPtrType = quote do:
  cast[ptr `scratchObjType`](c_calloc(csize_t 1, csize_t sizeof(`scratchObjType`)))

to something in that vein.

let scratchObjPtrType = quote do:
  when sizeof(`scratchObjType`) <= N:
    cast[ptr `scratchObjType`](task.buffer.addr)
  else:
    cast[ptr `scratchObjType`](c_calloc(csize_t 1, csize_t sizeof(`scratchObjType`)))

at https://github.com/nim-lang/Nim/blob/d102b2f54c41f19a0545f28109d0a93550b5d886/lib/std/tasks.nim#L187

Discussion of the buffer size

The current task object has a size of 24 bytes.

  • Buffer of size 8: this rounds up the total size to 32, a nice power of 2. However the majority of function need more than 1 argument.
  • Buffer of size 16 and 24: many functions have the form proc(a, b): result or proc(r, a, b), this would cover those efficiently.
  • Buffer of size 40:
    • this rounds up the total size to 64, a nice power of 2.
    • 64 bytes is the cache-line size on most CPUs (when it's not 128), if a task is aligned, it can be loaded from L1 cache in a single cache read.
    • This buffer size allows to store up to 5 ints/pointer arguments or 2 openarrays + 1 pointer/int (for example a callback when doing parallel map or a pointer to a result location)
      Working on large openarrays is often a very compelling reason to divide work in tasks.
    • x86-64 can pass up to 6 arguments through registers and ARM64 up to 8 so there won't be stack spilling with that buffer size.
    • Also due to false sharing a runtime that stores tasks contiguously for example in a shared task queue must make sure that they each start 64 bytes (or even 128 bytes due to CPU prefetching 2 cache-lines at a time).
      Even if tasks were smaller, space would be unused.
@Araq
Copy link
Member

Araq commented Jan 17, 2022

This will probably only work with ARC/ORC but then so be it.

@mratsim
Copy link
Collaborator Author

mratsim commented Jan 17, 2022

I think std/tasks already requires either non-thread-local GC (so Boehm would work as well) or value types.

Unless you use tasks in a non-multithreading context, but libraries that might want to do that (asyncdispatch/chronos) use closure iterators.

@ringabout
Copy link
Member

ringabout commented Jan 19, 2022

In the current implementation, task is allocated in the stack and is cleaned up before toTask so that the address of task.buffer is lost.

We should probably pass task.buffer in invoke procs, which requires to change the signature of invoke procs (breaking changes)

proc invoke*(task: var Task) {.inline, gcsafe.} =
  ## Invokes the `task`.
  assert task.callback != nil
  if task.args == nil and task.destroy != nil:
    task.callback(addr task.argsBuffer)
  else:
    task.callback(task.args)

(though Task is always passed by reference internally)

ringabout added a commit to ringabout/Nim that referenced this issue Jan 19, 2022
mratsim added a commit to status-im/nim-taskpools that referenced this issue Jan 19, 2022
@mratsim
Copy link
Collaborator Author

mratsim commented Jan 19, 2022

See discussion on Discord https://discord.com/channels/371759389889003530/768367394547957761/933307741462753300

The changes indeed work status-im/nim-taskpools@76a8593 but only if the spawner doesn't exit its scope before returning.

AKA, structured parallelism:
spec #347 (comment)

Structured parallelism [OPTIONAL]

Structured parallelism ensures that all tasks and their descendants created within a scope are completed before exiting that scope.

template syncScope(ex: MyExecutor, scope: untyped): untyped
template syncScope(scope: untyped): untyped

References:

  • X10 Parallel Programming Language (2004), the Async-Finish construct
  • Habanero multithreading runtime for Java and Scala, the Async-Finish construct
  • Habanero C, the Async-Finish construct

More recently, concurrency frameworks also converged to similar "structured concurrency" (https://en.wikipedia.org/wiki/Structured_concurrency).

One way to make it work is in =move: if the task being moved is pointing to its argsBuffer, change to point to the new argsbuffer

proc `=sink`(dst: var Task, src: Task) =
  `=destroy`(dst)
  dst.callback = src.callback
  dst.destroy = src.destroy
  if src.args == cast[pointer](src.argsBuffer.addr):
    # The arguments are stored inline the task
    dst.argsBuffer = src.argsBuffer
    dst.args = cast[pointer](dst.argsBuffer.addr)
  else:
    # The arguments are stored on the heap
    dst.args = src.args

This also implicitly assumes that Tasks are {.byref.} types.

@ringabout
Copy link
Member

block:
  var num = 0
  proc hello(a: int) = inc num, a

  let b = toTask hello(13)
  b.invoke()
  echo num
  assert num == 13

doesn't work with POC

0
Error: execution of an external program failed

It generates c code like this:

image

task is zero memoryed after toTask is called. Then task.argsBuffer become all zeros
And b.args still points to a zeros array (task.argsBuffer)
that seems why echo num gives 0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants