Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parquet: an arrow::async_writer::AsyncArrowWriter that splits CPU-bound and I/O bound tasks #5269

Closed
SteveLauC opened this issue Jan 2, 2024 · 8 comments
Labels
development-process Related to development process of arrow-rs enhancement Any new improvement worthy of a entry in the changelog

Comments

@SteveLauC
Copy link
Contributor

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

The AsyncArrowWriter type can write a RecordBatch asynchronously by buffering the contents to write, which does not seem to split CPU-bound and I/O-bound tasks. Let's say that I have enabled the compression, which is CPU-bound, then I will write the compressed contents to disk, which is I/O-bound, the current API does these 2 things together without using spawn_blocking() or something similar to that, and thus will block the underlying runtime thread while doing CPU-bound tasks.

Describe the solution you'd like

I would like to have an API that handles CPU-bound or I/O-bound tasks separately, or maybe 2 APIs?

Describe alternatives you've considered

Additional context

#3957 is the PR that added this AsyncArrowWriter, the discussion of the implementation proposal is in #1269.

@SteveLauC SteveLauC added the enhancement Any new improvement worthy of a entry in the changelog label Jan 2, 2024
@SteveLauC SteveLauC changed the title parquet: an arrow::AsyncArrowWriter that splits CPU-bound and I/O bound tasks parquet: an arrow::async_writer::AsyncArrowWriter that splits CPU-bound and I/O bound tasks Jan 2, 2024
@tustvold
Copy link
Contributor

tustvold commented Jan 2, 2024

One way this could be achieved is to provide an AsyncWrite that spawns the IO work to a different threadpool.

Perhaps we could take a step back as to why you want to separate CPU-bound and IO-bound work? Traditionally this is done to prevent stalling out worker threads on blocking IO, but tokio doesn't perform blocking IO, so I wonder what the motivation here is?

@SteveLauC
Copy link
Contributor Author

Perhaps we could take a step back as to why you want to separate CPU-bound and IO-bound work? Traditionally this is done to prevent stalling out worker threads on blocking IO, but tokio doesn't perform blocking IO, so I wonder what the motivation here is?

Emm, IIUC, blocking operations do not only include blocking I/O, CPU-bound computations (if take too much time), would still be considered blocking?

@tustvold
Copy link
Contributor

tustvold commented Jan 2, 2024

would still be considered blocking

Indeed, and if you multiplex tasks that are tail latency sensitive on the same threadpool as CPU-bound work this may cause issues. So you shouldn't stick your web frontend in the same tokio threadpool as handles your query processing, but there is nothing inherently wrong with "blocking" a tokio worker, provided you don't have tasks in that threadpool sensitive to tail latencies.

For more context:

@SteveLauC
Copy link
Contributor Author

Indeed, and if you multiplex tasks that are tail latency sensitive on the same threadpool as CPU-bound work this may cause issues.

Right, this is the case we are facing now, we have 2 threads runtime that will:

  1. write WAL files
  2. write parquet files

Since AsyncArrowWriter mixes CPU-bound and I/O bound tasks in one API, we cannot distribute the CPU-bound task (compression) to another thread, so I filed this issue

@tustvold
Copy link
Contributor

tustvold commented Jan 2, 2024

I think I must be missing something, why can't you just treat AsyncArrowWriter as a CPU-bound task that occasionally waits on IO, are you trying to use some non-async threadpool or something?

@SteveLauC
Copy link
Contributor Author

SteveLauC commented Jan 2, 2024

I think I must be missing something, why can't you just treat AsyncArrowWriter as a CPU-bound task that occasionally waits on IO, are you trying to use some non-async threadpool or something?

Well, this is probably very specific to our use case, we want to treat it as an I/O bound task, and delegate compression to another thread or threadpool

@tustvold
Copy link
Contributor

tustvold commented Jan 2, 2024

we want to treat it as an I/O bound task

Parquet encoding is not IO-bound, compression especially so, with many of the encodings very expensive to compute. You would need to move the entire synchronous parquet writer to this separate CPU-bound pool, leaving only the IO behind, at which point you've really just reimplemented AsyncArrowWriter 😅

@SteveLauC
Copy link
Contributor Author

Parquet encoding is not IO-bound, compression especially so, with many of the encodings very expensive to compute.

That's right

we want to treat it as an I/O bound task

My last statement is probably not that correct, we are aware that encoding/compression is very CPU-bound, and writing the parquet files is I/O-bound, we want to separate these 2 things, leaving only the I/O tasks in our runtime.

I guess this is very specific to our usage and we should implement our own AsyncArrowWriter.

Thanks for your reply:)

@tustvold tustvold added the development-process Related to development process of arrow-rs label Jan 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
development-process Related to development process of arrow-rs enhancement Any new improvement worthy of a entry in the changelog
Projects
None yet
Development

No branches or pull requests

2 participants