Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RecordBatchWriter trait can't be used in dynamic context due to close(self) #6326

Open
ashtuchkin opened this issue Aug 29, 2024 · 7 comments
Labels
enhancement Any new improvement worthy of a entry in the changelog

Comments

@ashtuchkin
Copy link
Contributor

ashtuchkin commented Aug 29, 2024

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

I'm trying to use RecordBatchWriter trait to introduce dynamism in the type of output - e.g. depending on the command line options, I'd like to write parquet, csv, json or ipc. I know that corresponding writers all implement this trait, thanks to #4206 and #4228 by @alexandreyc.

Specifically, I was planning to use Box<dyn RecordBatchWriter> and pass it to the writing code, something like this:

let writer: Box<dyn RecordBatchWriter> = match args.format {
    "parquet" => Box::new(parquet::arrow::ArrowWriter::try_new(file, schema, None)),
    "csv" => Box::new(arrow::csv::Writer::new(file)),
    ...
}

writer.write(my_batch);
writer.close();  // this is required to write the formats correctly

Unfortunately, due to close method in that trait consuming self, we can't really use this approach, as now close() needs to know the exact type of the writer. Calling close() on the trait object gives compilation error "the size of dyn RecordBatchWriter cannot be statically determined". See example in Rust playground.

Describe the solution you'd like

Add a finish(&mut self) method to the trait. This is backwards-compatible, as well as an established pattern e.g. see parquet writers.

This will not make the trait object-safe, but will unblock usage as Box<dyn RecordBatchWriter>.

Describe alternatives you've considered

Manually downcasting to specific types and then calling close() on them, but that doesn't seem to work, as RecordBatchWriter needs to explicitly have as_any method or similar.

@ashtuchkin ashtuchkin added the enhancement Any new improvement worthy of a entry in the changelog label Aug 29, 2024
@ashtuchkin ashtuchkin changed the title RecordBatchWriter trait can't really be used due to close(self) RecordBatchWriter trait can't be used in dynamic context due to close(self) Aug 29, 2024
@ashtuchkin
Copy link
Contributor Author

Ok trying to reimplement it myself, looks like the problem is that parquet::arrow::ArrowWriter has close(self) too. Any advice on how to deal with this would be appreciated.

@ashtuchkin
Copy link
Contributor Author

Well good news is that now we have finish(&mut self), added in March in #5471, so maybe we can switch to that?

@tustvold
Copy link
Contributor

tustvold commented Aug 29, 2024

On the other hand, I don't see how this trait can reasonably be used in client code in the current state

It is designed to be used with a generic type constraint. IIRC we switched to taking self to make it easier when working with non-static Write, and to prevent calling close multiple times

So you would instead have something like

match args.format {
    "parquet" => write_batches(parquet::arrow::ArrowWriter::try_new(file, schema, None), ...)?,
    "csv" => write_batches(arrow::csv::Writer::new(file), ...)?,
    ...
}

fn write_batches(w: impl RecordBatchWriter, batches: impl Iterator<Item=RecordBatch>) -> Result<()> {
    batches.try_for_each(|b| w.write(b))?;
    w.close()
}

@ashtuchkin
Copy link
Contributor Author

That works in simple cases, yes. I've tried this in my slightly more complex code and then spent unreasonable amount of time trying to make it work. Specifically, the writer in my case is stateful, so I created a struct with generic W: RecordBatchWriter and needed to specialize it for every format manually, like this

struct MyWriter<W: RecordBatchWriter> {
    writer: W,
    ...
}

let writer: Box<dyn MyWriter> = match args.format {
    "parquet" => Box::new(MyWriter::new(parquet::arrow::ArrowWriter::try_new(file, schema, None), ...)?),
    ...
}

use_my_writer_as_part_of_a_pipeline(writer);

Now we can't really do that ^^ because MyWriter is not a trait, so we need to introduce another higher-level trait.

Here's alternative proposal - can we add a finish(&mut self) to the trait to enable dynamic dispatch? We have a similar semantics in parquet where we have consuming close and non-consuming finish. That should be backwards compatible and would resolve the difficulties.

@ashtuchkin
Copy link
Contributor Author

One other consideration is the fn write_batches above can only write the whole file at a time, which means we have to keep all the file contents in memory. If we want to do any kind of streaming, we're back to the same difficulties.

@tustvold
Copy link
Contributor

tustvold commented Aug 29, 2024

Here's alternative proposal - can we add a finish(&mut self) to the trait to enable dynamic dispatch

I believe the trait would still not be object safe, due to the presence of the close method

If we want to do any kind of streaming, we're back to the same difficulties.

An iterator / stream can compute the next batch on demand

so we need to introduce another higher-level trait.

Or you can just define your own trait over the writers...

@ashtuchkin
Copy link
Contributor Author

Here's alternative proposal - can we add a finish(&mut self) to the trait to enable dynamic dispatch

I believe the trait would still not be object safe, due to the presence of the close method

It won't be object-safe, but we can use both write and finish methods dynamically - see playground.

If we want to do any kind of streaming, we're back to the same difficulties.

An iterator / stream can compute the next batch on demand

Uffh I'd need to restructure the whole program to support that :)

so we need to introduce another higher-level trait.

Or you can just define your own trait over the writers...

That's what I'm doing now and it's sad that my own trait is so close to existing one and I have to reimplement basically the same things.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog
Projects
None yet
Development

No branches or pull requests

2 participants