- 
                Notifications
    
You must be signed in to change notification settings  - Fork 85
 
performance[vortex-scan]: improve IO pipelining #4993
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
The issue with the current approach of using buffered(concurrency) is that all
of the concurrency tasks currently executing might be awaiting the same
coalesced IO operation, which stops other split_exec tasks from making progress
by issuing and awaiting their own IO requests.
The approach proposed in this commit is to group all split_exec tasks into
concurrency Futures{Ordered,Unordered} groups so concurrency tasks are
executing in parallel, but other futures in that group can make progress
concurrently during IO awaits.
This results in better pipelining IO operations with the downside of possibly
increased memory use. This patch was motivated because drive_send seems to
expect to expect to execute CONCURRENCY (192 by default) concurrent requests
but this is in practice much less because of the split_exec scheduling
strategy.
Signed-off-by: Alfonso Subiotto Marques <alfonso.subiotto@polarsignals.com>
    
          Benchmarks: FineWeb NVMeSummary
 Detailed Results Table
  | 
    
          Benchmarks: FineWeb S3Summary
 Detailed Results Table
  | 
    
          Benchmarks: TPC-H SF=1 on NVMESummary
 Detailed Results Table
  | 
    
          Codecov Report❌ Patch coverage is  
 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
  | 
    
| 
           Thank you for starting this discussion, indeed it's definitely an open area of design/research. I'm hoping to make some unrelated changes that should allow us to collect the futures for a split up-front. Maybe this is helpful, for example, the concurrency of splits should probably be driven by the total size of the in-use buffers, rather than by a fixed concurrency. As you say, all splits may be referencing the same future... But being able to cap memory usage is probably important in any solution here. It seems like this PR essentially attempts to make progress on all tasks at once in either branch? (With different spawning groups) The other thing to note is the way stream buffered works. I think this is how it works... A new future is polled only if all current futures return Pending. We may want a custom version of buffered that always tries to poll   | 
    
          Benchmarks: TPC-H SF=1 on S3Summary
 Detailed Results Table
  | 
    
          Benchmarks: TPC-H SF=10 on NVMESummary
 Detailed Results Table
  | 
    
          Benchmarks: TPC-DS SF=1 on NVMESummary
 Detailed Results Table
  | 
    
          Benchmarks: TPC-H SF=10 on S3Summary
 Detailed Results Table
  | 
    
          Benchmarks: Clickbench on NVMESummary
 Detailed Results Table
  | 
    
          Benchmarks: Statistical and Population GeneticsSummary
 Detailed Results Table
  | 
    
          
 Yes, this is probably a good idea.  I also really like the idea of pre-planning all IO up front. I think now we're in an in-between state where we already know the individual reads we want to plan, but then leave the coalescing/pipelining up to be dynamic during scan execution which depends on a lot of scheduling details (hence this PR). Ideally we would collect all reads (as we do now) and coalesce them at plan time, eagerly executing the ones that don't have any dependencies (i.e. filters) up to the memory limit and then executing a new read as soon as there is capacity available and all read dependencies have been consumed (i.e. projections after filters). 
 Yes, although, like the current behavior, it is only spawning  
 I think this is wrong and the core issue here.   | 
    
| 
           @gatesn just wanted to close the loop here. When we chatted last week it seemed like the lack of pipelining in the original screenshot is somewhat expected. I'm still not sure I fully understand. The file itself had zero rows that matched the filter and so there is only an initial filter evaluation against the actual chunks, but since these evaluations are independent they should be parallelized as much as possible, right? What do you think the next steps would be for better pipelining? OOC Is this something you're actively working on (seems kind of related to the operator changes)?  | 
    
| 
           It's something I'm thinking about, but not something I'm actively working on. If we scope "operator changes" to mean those for evaluating in-memory Vortex arrays, we have a similar set of changes for layouts and async out-of-memory compute too. I can try and dump down my current thoughts and see if we can drive this forwards in parallel?  | 
    
| 
           Sure! I can't guarantee I'll have time to work on anything large, but happy to discuss and see how we can drive this forward.  | 
    
Note
This PR is a proposal and might not be merged in its current form or at all. The primary motivation is performance, but I would also like to trigger a discussion about the intended scheduling strategy for scans. The downside is that we reduce or even eliminate the ability for the user to perform backpressure on the scan by not polling from the stream. However, at the same time there seems to be an assumption/expectation that we highly parallelize IO:
vortex/vortex-io/src/file/object_store.rs
Line 24 in d37444a
The issue with the current approach of using
buffered(concurrency)is that all of theconcurrencytasks currently executing might be awaiting the same coalesced IO operation, which stops othersplit_exectasks from making progress by issuing and awaiting their own IO requests.The approach proposed in this commit is to group all
split_exectasks intoconcurrencyFutures{Ordered,Unordered} groups soconcurrencytasks are executing in parallel, but other futures in that group can make progress concurrently (not in parallel) during IO awaits.This results in better pipelining IO operations with the downside of possibly increased memory use. This patch was motivated because drive_send seems to expect to expect to execute
CONCURRENCY(192 by default):vortex/vortex-io/src/file/object_store.rs
Line 24 in d37444a
Before:

After:
