-
Notifications
You must be signed in to change notification settings - Fork 837
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
parquet: an arrow::async_writer::AsyncArrowWriter
that splits CPU-bound and I/O bound tasks
#5269
Comments
arrow::AsyncArrowWriter
that splits CPU-bound and I/O bound tasksarrow::async_writer::AsyncArrowWriter
that splits CPU-bound and I/O bound tasks
One way this could be achieved is to provide an AsyncWrite that spawns the IO work to a different threadpool. Perhaps we could take a step back as to why you want to separate CPU-bound and IO-bound work? Traditionally this is done to prevent stalling out worker threads on blocking IO, but tokio doesn't perform blocking IO, so I wonder what the motivation here is? |
Emm, IIUC, blocking operations do not only include blocking I/O, CPU-bound computations (if take too much time), would still be considered blocking? |
Indeed, and if you multiplex tasks that are tail latency sensitive on the same threadpool as CPU-bound work this may cause issues. So you shouldn't stick your web frontend in the same tokio threadpool as handles your query processing, but there is nothing inherently wrong with "blocking" a tokio worker, provided you don't have tasks in that threadpool sensitive to tail latencies. For more context: |
Right, this is the case we are facing now, we have 2 threads runtime that will:
Since |
I think I must be missing something, why can't you just treat |
Well, this is probably very specific to our use case, we want to treat it as an I/O bound task, and delegate compression to another thread or threadpool |
Parquet encoding is not IO-bound, compression especially so, with many of the encodings very expensive to compute. You would need to move the entire synchronous parquet writer to this separate CPU-bound pool, leaving only the IO behind, at which point you've really just reimplemented AsyncArrowWriter 😅 |
That's right
My last statement is probably not that correct, we are aware that encoding/compression is very CPU-bound, and writing the parquet files is I/O-bound, we want to separate these 2 things, leaving only the I/O tasks in our runtime. I guess this is very specific to our usage and we should implement our own Thanks for your reply:) |
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
The AsyncArrowWriter type can write a
RecordBatch
asynchronously by buffering the contents to write, which does not seem to split CPU-bound and I/O-bound tasks. Let's say that I have enabled the compression, which is CPU-bound, then I will write the compressed contents to disk, which is I/O-bound, the current API does these 2 things together without usingspawn_blocking()
or something similar to that, and thus will block the underlying runtime thread while doing CPU-bound tasks.Describe the solution you'd like
I would like to have an API that handles CPU-bound or I/O-bound tasks separately, or maybe 2 APIs?
Describe alternatives you've considered
Additional context
#3957 is the PR that added this
AsyncArrowWriter
, the discussion of the implementation proposal is in #1269.The text was updated successfully, but these errors were encountered: