Example for using a separate threadpool for CPU bound work (try 3) #16331

alamb · 2025-06-08T16:01:17Z

Note: This PR contains an example and supporting code. It has no changes to the core.

Which issue does this PR close?

Closes Document DataFusion Threading / tokio runtimes (how to separate IO and CPU bound work) #12393
Note this is new version version of Example for using a separate threadpool for CPU bound work (try 2) #14286

Rationale for this change

I have heard from multiple people multiple times over multiple years that the specifics of using multiple threadpools for separate CPU and IO work in DataFusion is confusing.

They are not wrong, and it is a key detail for building low latency, high performance engines which process data directly from remote storage, which I think is a key capability for DataFusion

My past attempts in #13424 and #14286 to make this example have been bogged down trying to get consensus on details of how to transfer results across streams, the wisdom of wrapping streams, and other details. Thankfully, thanks to @tustvold and @ion-elgreco there is now a much better solution in ObjectStore 0.12.1: apache/arrow-rs-object-store#332

What changes are included in this PR?

thread_pools.rs example
Update documentation

Are these changes tested?

Yes the example is run as part of CI and there are tests

Are there any user-facing changes?

Not really

Pare back example

Omega359 · 2025-06-09T13:38:25Z

datafusion-examples/examples/thread_pools.rs

+    // systems, including remote catalog access, which is not included in this
+    // example.
+    let cpu_runtime = CpuRuntime::try_new()?;
+    let io_handle = Handle::current();


Question: this seems like the inverse of what I would have expected where DF would run on the current runtime and IO would run on a specialized runtime. Is there a reason why that would not work here? I would think it would simplify the code a fair bit.

I don't think there is any technical reason

The reason I did it this way is that I think most applications / server programs (and the examples from tokio, etc) use the runtime automatically created by tokio for IO and so I wanted to follow the same pattern.

I'll update the documentation to make this clearer.

My guess for that is that many/(most?) systems don't have a way to push IO to a separate runtime whereas it's easier to do so with cpu much of the time. However, with ObjectStore at least that isn't the case.

Yeah. Another consideration is that most uses of tokio are not for CPU bound work, so it makes sense to just use the default pool.

I updated it to say

/// This example uses the runtime created by [`tokio::main`] to do I/O and spawn /// CPU intensive tasks on a separate [`Runtime`], mirroring the common pattern /// when using Rust libraries such as `tonic`. Using a separate `Runtime` for /// CPU bound tasks will often be simpler in larger applications, even though it /// makes this example slightly more complex.

datafusion-examples/examples/thread_pools.rs

Co-authored-by: Bruce Ritchie <bruce.ritchie@veeva.com>

…ple4

datafusion-examples/examples/thread_pools.rs

Example for using a separate threadpool for CPU bound work (try 3)

c908a09

Pare back example

github-actions bot added the core Core DataFusion crate label Jun 8, 2025

alamb mentioned this pull request Jun 8, 2025

Example for using a separate threadpool for CPU bound work (try 2) #14286

Closed

alamb added the documentation Improvements or additions to documentation label Jun 8, 2025

Omega359 reviewed Jun 9, 2025

View reviewed changes

datafusion-examples/examples/thread_pools.rs Outdated Show resolved Hide resolved

Update datafusion-examples/examples/thread_pools.rs

56b0a3f

Co-authored-by: Bruce Ritchie <bruce.ritchie@veeva.com>

github-actions bot removed the documentation Improvements or additions to documentation label Jun 9, 2025

alamb added 2 commits June 9, 2025 10:42

Merge remote-tracking branch 'apache/main' into alamb/threadpool_exam…

9b2e770

…ple4

Add a note about why the main Runtime is used for IO and not CPU

cc26de1

Omega359 reviewed Jun 9, 2025

View reviewed changes

datafusion-examples/examples/thread_pools.rs Outdated Show resolved Hide resolved

remove random thought

1cdf543

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Example for using a separate threadpool for CPU bound work (try 3) #16331

Example for using a separate threadpool for CPU bound work (try 3) #16331

alamb commented Jun 8, 2025

Uh oh!

Omega359 Jun 9, 2025

Uh oh!

alamb Jun 9, 2025

Uh oh!

Omega359 Jun 9, 2025

Uh oh!

alamb Jun 9, 2025

Uh oh!

alamb Jun 9, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Example for using a separate threadpool for CPU bound work (try 3) #16331

Are you sure you want to change the base?

Example for using a separate threadpool for CPU bound work (try 3) #16331

Conversation

alamb commented Jun 8, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Omega359 Jun 9, 2025

Choose a reason for hiding this comment

Uh oh!

alamb Jun 9, 2025

Choose a reason for hiding this comment

Uh oh!

Omega359 Jun 9, 2025

Choose a reason for hiding this comment

Uh oh!

alamb Jun 9, 2025

Choose a reason for hiding this comment

Uh oh!

alamb Jun 9, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!