Should Barrier.wait() return as soon as a task fails? #272

emiltin · 2023-08-24T07:44:22Z

emiltin
Aug 24, 2023

When calling wait() on a barrier, it will return exceptions from any of the tasks in the barrier, but apparently not until all tasks added before the failing task completes.

Here two seconds pass before the barrier reports the error:

require 'async/barrier'
Async do
  barrier = Async::Barrier.new
  barrier.async { sleep 2 }
  barrier.async { RuntimeError.new }
  barrier.wait # => 2 seconds later.... RuntimeError
ensure
  barrier.stop
end

But just but changing the order the tasks are created, the barrier now returns the error immediately:

require 'async/barrier'
Async do
  barrier = Async::Barrier.new
  barrier.async { RuntimeError.new }
  barrier.async { sleep 2 }
  barrier.wait # => immediately: RuntimeError
ensure
  barrier.stop
end

I would prefer that it returns exceptions as soon as any task fails.
It also seems odd to me that the order of the tasks matter, even though they run asynchronously.

ioquatix · 2023-08-24T08:21:03Z

ioquatix
Aug 24, 2023
Maintainer

A barrier is supposed to be a synchronisation point of multiple tasks.

Probably Barrier#wait should ignore errors, and just wait until all tasks are completed (success or fail).

However, I suppose what I was thinking, was that if a task has failed with an unhandled error, the entire process might be failed. Order can matter, since you might have inter-dependencies.

I don't mind changing this semantic, but I suppose we just need to be mindful of what problems we are trying to solve and the best way to organise the code/solutions.

0 replies

emiltin · 2023-08-24T08:50:20Z

emiltin
Aug 24, 2023
Author

It's common to start several task and wait for them all to complete. This is what barrier handles.

But if one of the tasks fails with an unhandled error, then there is often no need to wait for all tasks to complete, because we know that we're not going to to able to get a complete result because one of the tasks is corrupted. So we might want to handle this by e.g. stopping all remaining task and starting over, failing the parent task, or perhaps aborting entirely.

In general I think handling errors is one of the most difficult aspects of using Async. My goal is to create fault tolerant code. I'm currently inspired by Erlang/Elixir and supervisor trees, and the idea of restarting failed parts from a known good state.

0 replies

emiltin · 2023-08-24T14:45:03Z

emiltin
Aug 24, 2023
Author

Perhaps you could choose between whether the barrier:

re-raises an uncaught error in a task as soon as it happens
waits for all tasks to complete (ignoring failures)

I think the current behaviour is somewhere between these, but to me seems a bit hard to use because it depends on the ordering of tasks.

Async do
  barrier = Async::Barrier.new
  
  barrier.async { raise 'ups' }
  barrier.async { sleep 1 }

  barrier.wait # raise as soon as any task fails
  barrier.complete # wait for all tasks, ignoring errors

rescue StandardError => e
  # what task caused the error?
ensure
  barrier.stop
end

Task#wait raises if the task fails, so it seems appropriate that Barrier#wait does the same, but for all tasks.

Is there a way to glean form the error what task originally raised it?

0 replies

emiltin · 2023-08-30T10:39:04Z

emiltin
Aug 30, 2023
Author

I tried with this implementation of Async::Barrier#wait:

def wait
  condition = Async::Condition.new
  guard = Async do
    until @tasks.empty?
      result = condition.wait
      raise result if result.is_a? StandardError
    end
  end

  @tasks.each do |waiting|
    Async do
      begin
        task = waiting.task
        task.wait
      ensure
        @tasks.remove?(waiting) unless task.alive?
      end
      condition.signal :ok
    rescue StandardError => e
      condition.signal e
    end
  end

  guard.wait
end

Now the barrier will abort and re-raise as soon as any task fails:

require 'async'
require 'async/barrier'
Async do
  barrier = Async::Barrier.new

  barrier.async do |task1|
    task1.annotate(:task1)
    sleep 1000
  end

  barrier.async do |task2|
    task2.annotate(:task2)
    sleep 0.1
    RuntimeError.new 'boom!'
  end

  barrier.wait
  puts 'All tasks completed'
rescue StandardError => e
  puts "Task error: #{e}"
ensure
  barrier.stop
end

Instead of waiting for each task in turn, we run a separate task the waits for a condition, as long as there are tasks remaining. When a task completes or fails it signals the condition and removes itself. The guard can the abort the wait if a task failed.

But having to run each task inside a task fells clunky, there's problably a better way to do it?

2 replies

ioquatix Aug 30, 2023
Maintainer

A barrier does not impose any order constraints - only that it's a synchronisation point. I think handling errors should be seen as exceptional. There will always be ambiguity in the order of error handling - as soon as you have non-determinism, you cannot make any guarantees about order, even if we do try to be as "predictable" as possible in Async.

The general model for Async::Barrier is this:

  def Barrier(parent: nil, &block)
    Barrier.new(parent: parent).tap do |barrier|
      yield barrier
      barrier.wait
    ensure
      barrier.stop
    end
  end

As I've said before, tasks that raise exceptions are exceptional and the flow control is also exceptional. Applications that use Barrier#wait should probably not raise exceptions as part of normal code execution.

That being said, I do understand your use case and the ideas about robustness. Your point is, as soon as one of the tasks fails, the entire request is essentially a failure, so why wait?

There are several patterns (with overlap), for a given set of tasks:

All tasks failed = total failure.
At least one failure = total failure.
At least one success = total success.
All tasks success = total success.

I think your one is (2), but actually I've also seen (3) and (4) in real code.

I think a barrier implements (2) and (4). It's true that a common pattern might be: fan out and fail fast; or fan out and succeed fast, and cancel the remainder. Or fan out, and wait for at least N successful responses.

I think we can change barrier to fail fast without impacting the interface the user expects. If a task fails, whether it fails now or later is not specified by barrier - just that Barrier#wait will eventually re-raise the exception.

The best way to implement this, is for the task to notify the barrier that it's done - success or failure, and for Barrier#wait to wait on that condition. We can make a few small changes to make this more ergonomic: #276

emiltin Aug 30, 2023
Author

as soon as one of the tasks fails, the entire request is essentially a failure, so why wait?
Exactly.

Isn't 1 and 3 the same? A total success requires just one to succeed; if all fails it's a total failure.
And 2 and 4 the same? A total success requires all to succeed; if just one fails, it's a total failure.

I think the currerent implementation is close to 2 and 4, but might not fail fast, depending on the ordering of tasks.

emiltin · 2023-08-30T12:23:16Z

emiltin
Aug 30, 2023
Author

The best way to implement this, is for the task to notify the barrier that it's done - success or failure, and for Barrier#wait to wait on that condition. We can make a few small changes to make this more ergonomic

I think this is what I attemped - use a notification to let tasks inform the barirer whether they succeed or fail, so the barrier can fail fast. But I'm sure you can improve :-)

0 replies

lwoggardner · 2025-03-28T08:10:42Z

lwoggardner
Mar 28, 2025

Old discussion but I have recently been down a fail-fail Barrier rabbithole.

I was starting tasks for io read and write loops, and waiting on both for graceful (or not) disconnect.
If the writer fails and closes the io, then the reader can be left hanging and needs to be explicitly stopped.
Got caught out by the order of starting the tasks in Barrier impacting whether an IOError would stop the other side.

While investigating how to solve that I came across this discussion and the suggestion from
#276 to use a Queue as the :finished attribute of a set
of Tasks

Which lead to EnumerableWaiter to generalise that approach.

module Async
  class EnumerableWaiter
    include Enumerable
    # @param  [:async | nil] parent the parent task to use for asynchronous operations.
    def initialize(parent: nil)
      @parent = parent
      @queue = Queue.new
      @task_count = 0
    end

    def async(parent: (@parent or Task.current), **options, &block)
      @task_count += 1
      parent.async(**options, finished: @queue, &block)
    end

    # Yield tasks as they complete (including stopped)
    def each_task
      return enum_for(:each_task).lazy unless block_given?

      yield(@queue.dequeue.tap { @task_count -= 1 }) until @task_count.zero?
    end

    # Yield task results as they complete
    def each
      return enum_for(:each).lazy unless block_given?

      each_task { |t| yield t.wait unless t.stopped? }
    end

    # Wait in task completion order, which will raise on first failed task
    def wait
      each(&:itself)
    end

    # eg for parent barrier.stop
    def respond_to_missing?(...)
      @parent.respond_to?(...)
    end

    def method_missing(...)
      @parent.send(...)
    end
  end
end

Can be used as fast fail barrier..

  Async::EnumerableWaiter.new(parent: Async::Barrier.new).tap do |barrier|
      barrier.async {  }
       # ...
      barrier.wait
   ensure
      barrier.stop
   end

or just as an enumerable over the tasks

  Async::EnumerableWaiter.new.tap do |waiter|
    waiter.async { }
    #...
    waiter.first(5) # or .chunk,  .reduce, .to_a etc..
  end

However, using a shared :finished notification for a set of tasks creates an unexpected constraint that you
must not call Task#wait on any individual, unfinished, task.

1 - Task#wait will return early if a shared :finished condition is used
https://github.com/socketry/async/blob/main/lib/async/task.rb#L248

    # `finish!` will set both of these to nil before signaling the condition:
if @block || @fiber
  @finished ||= Condition.new
  @finished.wait
end
# ... proceed to return result or raise

The task calling #wait will be resumed when any task using the shared Queue is signaled by #finish!
And then it does not recheck its own completed state, which means the task might actually still be alive?
Fix can be change if to while, or to loop until the signaled value is the task itself.

if @block || @fiber
  @finished ||= Condition.new
  until @finished.wait.equal?(self); end
end

2 - Task#wait will discard messages from the Queue.

Async::Queue#wait is an alias to #dequeue, so the above call to @finished.wait will silently discard
messages, which will then be skipped by EnumerableWaiter

Potential (breaking change) solution would be to make Queue#wait just delegate to its notification

    def wait
      @available.wait
    end

Finally, I have a suggestion for Async::Barrier to encapsulate the common usage pattern

def wait!
  yield self if block_given? # start tasks
  wait
ensure
  stop
end

used as..

Async::Barrier.new.wait! do |b|
  b.async { }
  b.async { }
end

0 replies

ioquatix · 2025-03-29T11:57:57Z

ioquatix
Mar 29, 2025
Maintainer

After considering these discussions and others, I'm going to revisit how Async::Barrier handles failures.

0 replies

Uh oh!

Should Barrier.wait() return as soon as a task fails? #272

Uh oh!

emiltin Aug 24, 2023

Replies: 7 comments · 2 replies

Uh oh!

ioquatix Aug 24, 2023 Maintainer

Uh oh!

Uh oh!

emiltin Aug 24, 2023 Author

Uh oh!

emiltin Aug 24, 2023 Author

Uh oh!

emiltin Aug 30, 2023 Author

Uh oh!

ioquatix Aug 30, 2023 Maintainer

Uh oh!

emiltin Aug 30, 2023 Author

Uh oh!

Uh oh!

emiltin Aug 30, 2023 Author

Uh oh!

lwoggardner Mar 28, 2025

Uh oh!

ioquatix Mar 29, 2025 Maintainer

emiltin
Aug 24, 2023

Replies: 7 comments 2 replies

ioquatix
Aug 24, 2023
Maintainer

emiltin
Aug 24, 2023
Author

emiltin
Aug 24, 2023
Author

emiltin
Aug 30, 2023
Author

ioquatix Aug 30, 2023
Maintainer

emiltin Aug 30, 2023
Author

emiltin
Aug 30, 2023
Author

lwoggardner
Mar 28, 2025

ioquatix
Mar 29, 2025
Maintainer