Skip to content

fix: allow AsyncFileWriter to be not Send #7613

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

jpopesculian
Copy link
Contributor

Which issue does this PR close?

Closes #7612.

What changes are included in this PR?

Changes BoxFuture to LocalBoxFuture. I tried a few implementations like using async in the trait or using associated types for the futures, but I find this one to be the easiest to manage, although it unfortunately does not allow for the future to be send if the user so desires. I think generally, using poll and context would make this more flexible, but I didn't want to overcomplicate this PR as generally I think people implement this trait just for use with AsyncArrowWriter

Are there any user-facing changes?

This is a breaking change as it changes the trait implementation and would be fixed wherever a user implements that trait. It's an easy fix however because LocalBoxFuture is always compatible to BoxFuture so the user would just have to call boxed_local instead of boxed.

@github-actions github-actions bot added the parquet Changes to the parquet crate label Jun 5, 2025
@jpopesculian
Copy link
Contributor Author

Let me know how y'all feel about this. I think ideally actually we would get rid of AsyncFileWriter and just use AsyncWrite directly. And for AsyncFileReader use something like the following

trait AsyncFileReader: AsyncRead + AsyncSeek {
  fn poll_metadata(
    self: Pin<&mut Self>,
    cx: &mut Context<'_>,
    options: Option<&'a ArrowReaderOptions>
  ) -> Poll<Result<Arc<ParquetMetaData>>> {
    // default implementation
  }
}

and then implement the appropriate futures. That way we don't need to worry about the specifics of Send and things

@Xuanwo
Copy link
Member

Xuanwo commented Jun 5, 2025

Hi, does AsyncWrite here mean tokio::AsyncWrite?

@jpopesculian
Copy link
Contributor Author

Hi, does AsyncWrite here mean tokio::AsyncWrite?

yes! same with AsyncRead and AsyncSeek in my previous comment

@Xuanwo
Copy link
Member

Xuanwo commented Jun 5, 2025

yes! same with AsyncRead and AsyncSeek in my previous comment

The motivation behind this trait is to enable zero-copy data writing. Almost all Arrow users generate Bytes, and all storage implementations accept Bytes as input. By doing this, we can simply reuse the users' bytes without making an additional copy. This can make a significant difference when writing large files.

Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As written this will make the functions within the writer !Send which is not something we can really do. One option might be to switch the trait to using GATs instead of boxed futures, as this will allow the machinery to work at least in theory, although in practice the compiler can get confused...

Another option might be to add a regular associated type, and the implementer can choose to use BoxFuture or LocalBoxFuture

@tustvold tustvold added the api-change Changes to the arrow API label Jun 5, 2025
@jpopesculian jpopesculian force-pushed the fix-async-arrow-writer-no-send branch from b636a82 to e306c7c Compare June 5, 2025 16:10
@jpopesculian
Copy link
Contributor Author

yes! same with AsyncRead and AsyncSeek in my previous comment

The motivation behind this trait is to enable zero-copy data writing. Almost all Arrow users generate Bytes, and all storage implementations accept Bytes as input. By doing this, we can simply reuse the users' bytes without making an additional copy. This can make a significant difference when writing large files.

Ooh gotcha, makes a lot of sense

@jpopesculian jpopesculian requested a review from tustvold June 5, 2025 16:11
@jpopesculian
Copy link
Contributor Author

I've gone with an associated type because GATs do not allow for dyn

@tustvold
Copy link
Contributor

tustvold commented Jun 5, 2025

Yeah, the lifetimes kind of turn into a bit of a mess... I'd need to sit down and check but I'm fairly confident constraining the lifetime on the trait will cause very hard to debug issues down the line.

Ultimately the AsyncArrowWriter isn't really doing all that much, its basically just a shim around the sync writer that checks its buffer after every batch and flushes it if it has gotten large. I wonder if the simplest option is just for you to do this yourself...

@jpopesculian
Copy link
Contributor Author

Yeah, I agree. It's a little ugly. I think alternatively we could implement the trait with a poll function and implement a future on top of it. It's harder to implement but its easiest for the compiler to pick up. I'll implement it and you can tell me what you think

@jpopesculian
Copy link
Contributor Author

I'm going to assume this probably won't be a desirable solution for you @tustvold, but I'll link it for completeness main...jpopesculian:arrow-rs:fix-async-arrow-writer-no-send-2

@jpopesculian
Copy link
Contributor Author

Maybe the solution is to put a feature flag on it @tustvold? Something like async-no-send that just replaces all BoxFuture with LocalBoxFutures?

@jpopesculian jpopesculian force-pushed the fix-async-arrow-writer-no-send branch 3 times, most recently from 497d3ca to c1ea3ff Compare June 6, 2025 18:18
@jpopesculian
Copy link
Contributor Author

I updated with a feature flag. I think its the cleanest solution (:

@jpopesculian jpopesculian force-pushed the fix-async-arrow-writer-no-send branch 5 times, most recently from b3b49cc to fcc91db Compare June 11, 2025 14:41
@alamb alamb marked this pull request as draft June 12, 2025 12:34
@alamb alamb marked this pull request as ready for review June 12, 2025 12:35
@jpopesculian
Copy link
Contributor Author

Any suggestions on this one @tustvold ?

@jpopesculian jpopesculian force-pushed the fix-async-arrow-writer-no-send branch from fcc91db to bbf2a8f Compare July 20, 2025 19:17
@alamb alamb marked this pull request as draft August 18, 2025 18:00
@alamb
Copy link
Contributor

alamb commented Aug 18, 2025

I believe we will support a similar usecase for the reader here:

Basically we will provide a parquet decoder that is not tied to the particular IO

I think you can do something similar already for writing using https://docs.rs/parquet/latest/parquet/arrow/arrow_writer/struct.ArrowColumnWriter.html thought that may require buffering in a writer / using channels

Maybe we should look into making the split between IO and CPU clearer in the writer too

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-change Changes to the arrow API parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

AsyncFileWriter doesn't need to be Send
4 participants