Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize Channel.select_impl #12756

Open
straight-shoota opened this issue Nov 18, 2022 · 5 comments
Open

Optimize Channel.select_impl #12756

straight-shoota opened this issue Nov 18, 2022 · 5 comments

Comments

@straight-shoota
Copy link
Member

Channel.select_impl allocates an array for all action locks, even if there is only a single one (#12694 (comment)). This can be avoided.

Most typically it should only be a handful of actions (< 10 or), so we could consider using StaticArray for stack allocation.

@straight-shoota
Copy link
Member Author

straight-shoota commented Nov 18, 2022

This could be a patch for this, but I'd like to get it a bit more concise and it's not sufficiently tested. The spec suite has only one example with duplicates and removing the uniqueness filter doesn't break it.

--- i/src/channel.cr
+++ w/src/channel.cr
@@ -458,12 +458,30 @@ class Channel(T)
   end

   private def self.select_impl(ops : Indexable(SelectAction), non_blocking)
+    # Avoid heap allocation for typical small ops sizes (< 10 actions)
+    ops_buffer = uninitialized StaticArray(typeof(ops.each{|o| break o}.not_nil!), 10)
+    if ops.size <= ops_buffer.size
+      ops.each_with_index do |op, i|
+        ops_buffer.to_unsafe[i] = op
+      end
+      ops_locks = ops_buffer.to_slice[0, ops.size]
+    else
+      ops_locks = ops.to_a
+    end
+
     # Sort the operations by the channel they contain
     # This is to avoid deadlocks between concurrent `select` calls
-    ops_locks = ops
-      .to_a
-      .uniq!(&.lock_object_id)
-      .sort_by!(&.lock_object_id)
+    ops_locks.unstable_sort_by!(&.lock_object_id)
+    duplicates = 0
+    ops_locks.each_with_index do |lock, i|
+      next if i == 0
+      if lock.lock_object_id == ops_locks[i - duplicates - 1].lock_object_id
+        duplicates += 1
+      else
+        ops_locks[i - duplicates] = lock
+      end
+    end
+    ops_locks = ops_locks[0, ops_locks.size - duplicates]

     ops_locks.each &.lock

We need two separate features for implementing this which I think could be extracted because they are useful on their own:

  • Materialize an Enumerable into a stack-allocated Slice
  • Remove duplicates from a Slice.

@asterite
Copy link
Member

I don't think this is the way to go. The compiler rewrites a select into something else. Let the compiler allocate the static array with exactly the size needed, get a slice to it and pass it to select_impl. Super easy, and it works for all cases.

@straight-shoota
Copy link
Member Author

A bigger refactoring is suggested in #12694 (comment)

Also it seems an Indexable is passed to select_impl. In the actual implementation that's a tuple. But that's not good if we need to sort the values. So I would change the implementation to use a Static Array and pass a slice to the method. Then the slice is mutable so the data can be sorted in-place. And maybe then the data can also be structs to avoid memory allocations. But I see some pointers being passed around, so it's a tricky refactor. But even though it'll take some time to get it right, it's probably the best thing to do.

@straight-shoota
Copy link
Member Author

It shouldn't take much to switch the compiler to emit StaticArray/Slice instead of Tuple. That should be fully compatible with the existing stdlib code because it receives Indexable. So that's good. I'll submit a patch for that shortly.

On its own, this doesn't change anything though. In order to improve performance we have to avoid heap allocations in Channel.select_impl. There is definitely an array allocation for ops_locks. Additionally, if there are more then Array::SMALL_ARRAY_SIZE ops, uniq! would also allocate a hash.

This array is basically an ordered list of distinct lock objects. It prevents deadlocks from trying to lock the same lock twice (uniqueness) as well as deadlocks between concurrent calls (deduplication).
The implementation looks like this:

crystal/src/channel.cr

Lines 423 to 459 in d40b44b

private def self.select_impl(ops : Indexable(SelectAction), non_blocking)
# Sort the operations by the channel they contain
# This is to avoid deadlocks between concurrent `select` calls
ops_locks = ops
.to_a
.uniq!(&.lock_object_id)
.sort_by!(&.lock_object_id)
ops_locks.each &.lock
ops.each_with_index do |op, index|
state = op.execute
case state
in .delivered?
ops_locks.each &.unlock
return index, op.result
in .closed?
ops_locks.each &.unlock
return index, op.default_result
in .none?
# do nothing
end
end
if non_blocking
ops_locks.each &.unlock
return ops.size, NotReady.new
end
# Because `channel#close` may clean up a long list, `select_context.try_trigger` may
# be called after the select return. In order to prevent invalid address access,
# the state is allocated in the heap.
shared_state = SelectContextSharedState.new(SelectState::Active)
contexts = ops.map &.create_context_and_wait(shared_state)
ops_locks.each &.unlock

The uniqueness can easily be avoided when the list is sorted anyways. Identical lock_ids are ordered in sequence and can be skipped on repeat.

If the sorting would apply to the original slice, it disrupts the order of actions so that their index wouldn't correlate to the index of the when branch anymore.
We could prevent that by encoding the index into the SelectAction.
Alternatively, select_impl could receive StaticArray and use another instance of that type as a stack-allocated duplicate for sorting. This should work fine with the proposed compiler change (which goes to StaticArray).

@straight-shoota
Copy link
Member Author

straight-shoota commented Aug 19, 2024

We could take inspiration from the select implementation in Golang: It uses a stack-allocated array of indices into the operations array and orders these indices.
The buffer is apparently provided by the runtime. But its a static array capped at 65_535 max (UInt16::MAX). That should certainly be big enough for any select statement. Each index is a UInt16, so it takes up to 128kiB. When the number of operations is known at compile time, the static array can be smaller of course.

https://github.com/golang/go/blob/527610763b882b56c52d77acf1b75c2f4233c973/src/runtime/select.go#L188-L222

I believe this could be a good solution. It's comparatively trivial to implement (compared to the current complexity of chosing the appropriate data type) and doesn't need a compiler change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants