[Streaming] Revisiting Ray Core streaming to perform I/O fully async avoiding syncing gRPC client and Python generator #42260

alexeykudinkin · 2024-01-09T07:39:07Z

Why are these changes needed?

Currently, when streaming responses back from one actor/task to another in Ray Core, execution of an actual IO is performed synchronously, ie every object yielded by the generator will be

Serialized (sync)
Dispatched back to the caller (sync)

This PR addresses opportunity to overlap serialization & network IO (holding no GIL), w/ actual execution of the target generator producing target output.

Overlapping network I/O (not blocking GIL) with streaming generator allows us to considerably P50 latencies

P50: ~700ms -> 485ms (30%)
P75: ~720ms -> 600ms (16%)
P99: unchanged

----------------------------------------------------------------
Stream responses asynchronously (IO threads: 1; default)
----------------------------------------------------------------

Core Actors streaming throughput (ASYNC) (num_replicas=1, tokens_per_request=1000, batch_size=10): 14199.39 +- 93.58 tokens/s
(CallerActor pid=98043) Individual request quantiles:
(CallerActor pid=98043) 	P50=480.73095849999964
(CallerActor pid=98043) 	P75=603.5131979999991
(CallerActor pid=98043) 	P99=770.7011399399998


----------------------------------------------------------------
Skip back-pressure handler
----------------------------------------------------------------

Core Actors streaming throughput (ASYNC) (num_replicas=1, tokens_per_request=1000, batch_size=10): 14643.21 +- 352.34 tokens/s
(CallerActor pid=13155) Individual request quantiles:
(CallerActor pid=13155) 	P50=685.5565414999995
(CallerActor pid=13155) 	P75=707.7666252500006
(CallerActor pid=13155) 	P99=777.5089473099982

----------------------------------------------------------------
Baseline
----------------------------------------------------------------

Core Actors streaming throughput (ASYNC) (num_replicas=1, tokens_per_request=1000, batch_size=10): 14264.66 +- 162.12 tokens/s
(CallerActor pid=12256) Individual request quantiles:
(CallerActor pid=12256) 	P50=705.1566460000007
(CallerActor pid=12256) 	P75=723.1873227499985
(CallerActor pid=12256) 	P99=771.7197762799998

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

rkooo567 · 2024-01-16T05:13:51Z

I will review it by tmrw

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

rkooo567

Generally LGTM!

rkooo567 · 2024-01-18T14:54:23Z

.buildkite/core.rayci.yml

@@ -343,7 +343,7 @@ steps:

  - label: ":ray: core: cpp worker tests"
    tags: core_cpp
-    instance_type: small
+    instance_type: medium


Is this necessary change?

Yep, recommended by @can-anyscale, as our C++ builds are periodically timing out

python/ray/_raylet.pyx

rkooo567 · 2024-01-18T14:57:26Z

python/ray/_raylet.pyx

@@ -1398,20 +1393,32 @@ async def execute_streaming_generator_async(
            raise
        except Exception as e:
            output_or_exception = e
+            is_exception = 1


why don't we use bool here?

Leftover from another experiment, let me clean that one up

rkooo567 · 2024-01-18T14:58:40Z

python/ray/_raylet.pyx

+                generator_backpressure_num_objects
+            )
+        else:
+            self.waiter = shared_ptr[CGeneratorBackpressureWaiter]()


This part doesn't seem to be explained in the PR description. Is this because having a waiter implementation slows down the performance? Also, the result you posted on the PR description include this optimization?

Yeah, removing waiter in particular only brought about ~3% performance improvement which wasn't standing out in any way, but i think it's still make sense to remove it from the surface area provided that it's only relevant for Data but not for RPCs

python/ray/_raylet.pyx

rkooo567 · 2024-01-18T15:06:23Z

python/ray/_raylet.pyx

+            # avoid blocking the event loop when serializing
+            # the output (which has nogil).
+            loop.run_in_executor(
+                worker.core_worker.get_thread_pool_for_async_event_loop(),


IIUC, this won't work if we increase the thread pool size to > 1 (cuz I don't know if report_streaming_generator_output is thread-safe now). Can you add assert somewhere to make sure it is not updated?

That's correct. Those APIs aren't thread-safe and i have discovered it in my experiments already, hence the cleanup that i started doing in #42443, to alleviate some of that (it's stacked on top of this one)

rkooo567 · 2024-01-18T15:08:03Z

src/ray/core_worker/core_worker.cc

+        return Status::OK();
+      }
+    });
+  } else {
    if (options_.check_signals) {


I think for else case, we can just return Status::OK. (if you look at the impl of WaitUntilObjectConsumed, that's the behavior now). check_signals is to avoid ignoring python signals while cpp is using the thread (which doesn't happen in this case)

I have revisited it to avoid duplication, essentially sharing the callback

actually we don't need to check signal if there's no waiter. The purpose of checking signal is when the thread is blocked inside cpp (e.g., due to locks or sleep), it cannot check python interrupt. But if we don't have the waiter there's no such problem

Got it. Will clean up to avoid checking for signals

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

edoakes · 2024-01-22T22:15:38Z

src/ray/core_worker/core_worker.cc

@@ -3016,40 +3016,52 @@ Status CoreWorker::ReportGeneratorItemReturns(
  RAY_LOG(DEBUG) << "Write the object ref stream, index: " << item_index
                 << ", id: " << dynamic_return_object.first;

-  waiter->IncrementObjectGenerated();
+  if (waiter) {


worth a comment why it's expected that waiter can be null

edoakes · 2024-01-22T22:17:23Z

src/ray/core_worker/core_worker.cc

-          // backpressure.
-          waiter->UpdateTotalObjectConsumed(waiter->TotalObjectGenerated());
-          RAY_LOG(WARNING) << "Failed to send the object ref.";
+        if (waiter) {


Should we avoid passing a callback entirely if there's no waiter? Not sure if this is handled any differently internally

We still need to check for signals, waiter is just a back-pressure hook for us to hold down the thread if we want to slow down the producer

https://github.com/ray-project/ray/pull/42260/files#r1464555370

rkooo567

LGTM. One last comment regarding checking signals

rkooo567 · 2024-01-24T09:06:05Z

src/ray/core_worker/core_worker.cc

+        return Status::OK();
+      }
+    });
+  } else {
    if (options_.check_signals) {


actually we don't need to check signal if there's no waiter. The purpose of checking signal is when the thread is blocked inside cpp (e.g., due to locks or sleep), it cannot check python interrupt. But if we don't have the waiter there's no such problem

rkooo567 · 2024-01-25T15:17:41Z

@alexeykudinkin I remember you contributed the benchmark. Is it running daily right now?

edoakes · 2024-01-25T16:42:50Z

@rkooo567 it's not. The serve release tests are not in good shape in general. @zcin is working on improving them in the coming months. Adding these should be part of it.

… before continuing (#44257) #42260 updated streaming generator tasks to asynchronously report generator returns, instead of synchronously reporting each generator return before yielding the next return. However this has a couple problems: If the task still has a reference to the yielded value, it may modify the value. The serialized and reported return will then have a different value than expected. As per [core] Streaming generator task waits for all object report acks before finishing the task #44079, we need to track the number of in-flight RPCs to report generator returns, so that we can wait for them all to reply before we return from the end of the task. If we increment the count of in-flight RPCs asynchronously, we can end up returning from the task while there are still in-flight RPCs. So this PR reverts some of the logic in #42260 to wait for the generator return to be serialized into the protobuf sent back to the caller. Note that we do not wait for the reply (unless under backpressure). We can later re-introduce asynchronous generator reports, but we will need to evaluate the performance benefit of a new implementation that also addresses both of the above points. --------- Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

… before continuing (ray-project#44257) ray-project#42260 updated streaming generator tasks to asynchronously report generator returns, instead of synchronously reporting each generator return before yielding the next return. However this has a couple problems: If the task still has a reference to the yielded value, it may modify the value. The serialized and reported return will then have a different value than expected. As per [core] Streaming generator task waits for all object report acks before finishing the task ray-project#44079, we need to track the number of in-flight RPCs to report generator returns, so that we can wait for them all to reply before we return from the end of the task. If we increment the count of in-flight RPCs asynchronously, we can end up returning from the task while there are still in-flight RPCs. So this PR reverts some of the logic in ray-project#42260 to wait for the generator return to be serialized into the protobuf sent back to the caller. Note that we do not wait for the reply (unless under backpressure). We can later re-introduce asynchronous generator reports, but we will need to evaluate the performance benefit of a new implementation that also addresses both of the above points. --------- Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

alexeykudinkin requested review from rkooo567, edoakes and jjyao and removed request for rkooo567 and edoakes January 9, 2024 17:52

alexeykudinkin mentioned this pull request Jan 11, 2024

[WIP] Benchmarking Streaming #41229

Closed

8 tasks

alexeykudinkin changed the title ~~[WIP][Streaming] Revisiting Ray Core streaming to perform I/O fully async avoiding syncing gRPC client and Python generator~~ [Streaming] Revisiting Ray Core streaming to perform I/O fully async avoiding syncing gRPC client and Python generator Jan 11, 2024

alexeykudinkin force-pushed the ak/strm-opt-ovrlp-io branch from 40252c4 to 487a17f Compare January 11, 2024 21:50

rkooo567 self-assigned this Jan 16, 2024

alexeykudinkin added 5 commits January 16, 2024 23:37

Avoid using back-pressure handler

4e80a9b

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Cleaned up response handling in streaming code to avoid duplication

3218338

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Submit all RPCs w/o an explicit sync on RPC being sent

fa3b683

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Tidying up

e0f2fd0

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

lint

f633401

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

alexeykudinkin force-pushed the ak/strm-opt-ovrlp-io branch from 487a17f to f633401 Compare January 17, 2024 07:48

alexeykudinkin mentioned this pull request Jan 17, 2024

[WIP][Streaming] Cleaning up streaming sequence #42443

Draft

8 tasks

Bumped instance-type to medium

74cea0f

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

rkooo567 reviewed Jan 18, 2024

View reviewed changes

rkooo567 added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jan 18, 2024

alexeykudinkin added 2 commits January 18, 2024 22:39

Tidying up

42f5e0b

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Tidying up more

0758428

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

alexeykudinkin removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jan 19, 2024

alexeykudinkin and others added 3 commits January 18, 2024 23:04

Abstracted check_signals_callback to avoid duping

51c4e6a

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

lint

7d825f0

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Merge branch 'master' into ak/strm-opt-ovrlp-io

6fb1478

edoakes approved these changes Jan 22, 2024

View reviewed changes

rkooo567 reviewed Jan 24, 2024

View reviewed changes

rkooo567 approved these changes Jan 25, 2024

View reviewed changes

rkooo567 merged commit c76a5ac into ray-project:master Jan 25, 2024
2 checks passed

stephanie-wang mentioned this pull request Mar 23, 2024

[core] Streaming generator executor waits for item report to complete before continuing #44257

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Streaming] Revisiting Ray Core streaming to perform I/O fully async avoiding syncing gRPC client and Python generator #42260

[Streaming] Revisiting Ray Core streaming to perform I/O fully async avoiding syncing gRPC client and Python generator #42260

alexeykudinkin commented Jan 9, 2024 •

edited

Loading

rkooo567 commented Jan 16, 2024

rkooo567 left a comment

rkooo567 Jan 18, 2024

alexeykudinkin Jan 18, 2024

rkooo567 Jan 18, 2024

alexeykudinkin Jan 19, 2024

rkooo567 Jan 18, 2024

alexeykudinkin Jan 19, 2024

rkooo567 Jan 18, 2024

alexeykudinkin Jan 19, 2024

rkooo567 Jan 18, 2024

alexeykudinkin Jan 22, 2024 •

edited

Loading

rkooo567 Jan 24, 2024

alexeykudinkin Jan 24, 2024

edoakes Jan 22, 2024

edoakes Jan 22, 2024

alexeykudinkin Jan 23, 2024

alexeykudinkin Jan 24, 2024

rkooo567 left a comment •

edited

Loading

rkooo567 Jan 24, 2024

rkooo567 commented Jan 25, 2024

edoakes commented Jan 25, 2024

[Streaming] Revisiting Ray Core streaming to perform I/O fully async avoiding syncing gRPC client and Python generator #42260

[Streaming] Revisiting Ray Core streaming to perform I/O fully async avoiding syncing gRPC client and Python generator #42260

Conversation

alexeykudinkin commented Jan 9, 2024 • edited Loading

Why are these changes needed?

Related issue number

Checks

rkooo567 commented Jan 16, 2024

rkooo567 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alexeykudinkin Jan 22, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rkooo567 left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rkooo567 commented Jan 25, 2024

edoakes commented Jan 25, 2024

alexeykudinkin commented Jan 9, 2024 •

edited

Loading

alexeykudinkin Jan 22, 2024 •

edited

Loading

rkooo567 left a comment •

edited

Loading