-
Notifications
You must be signed in to change notification settings - Fork 1k
Reuse zstd compression context when writing IPC #8405
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🤖 |
🤖: Benchmark completed Details
|
That is a very nice result 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @albertlockett and @mbrobbel
This code looks good to me (nice results!) but I wonder if we can avoid the breaking change)? I left some possible ideas
} | ||
|
||
// the reason we allow derivable_impls here is because when zstd feature is not enabled, this | ||
// becomes derivable. however with zstd feature want to be explicit about the compression level. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you for this context
arrow-ipc/src/writer.rs
Outdated
/// let data_gen = IpcDataGenerator::default(); | ||
/// let (encoded_dictionaries, encoded_message) = data_gen | ||
/// .encoded_batch(&batch, &mut dictionary_tracker, &options) | ||
/// .encoded_batch(&batch, &mut dictionary_tracker, &options, &mut compression_context) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an API change, right?
Is there some reason we can't add a CompressionContext
as a field to IpcDataGenerator to avoid the API change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ahh, that's tricky because the context needs to be &mut
(because the zstd::bulk::Compressor
it contains also needs to be a & mut
to call compress
), so if the context was a member of IpcDataGenerator
, the encode_batch's receiver would also need to be an &mut self
, which is again a breaking change.
Would it be too hacky if we added a encode_batch_with_context
method, and then changed the implementation of encode_batch
to just call that new method with &mut Default::default()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I personally think that would be nicer for downstream users
I think we should also deprecate the old one as described here https://github.com/apache/arrow-rs?tab=readme-ov-file#deprecation-guidelines (to help guide downstream users how to change their code)
Perhaps we could cal it encode
instead of encode_batch_with_context
to keep it shorter 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alamb made that change
/// compression. | ||
pub struct CompressionContext { | ||
#[cfg(feature = "zstd")] | ||
compressor: zstd::bulk::Compressor<'static>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This always contains zstd::bulk::Compressor
even when using lz4 compression?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we do the same for lz4_flex::frame::FrameEncoder
, does it help?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This always contains zstd::bulk::Compressor even when using lz4 compression?
If using lz4 compression, I imagine the the zstd
feature wouldn't be enabled and this would just be an empty struct, right?
Can we do the same for lz4_flex::frame::FrameEncoder, does it help?
I imagine that it probably would help, although I didn't investigate as my use case was only focussed on zstd. My motivation behind adding this CompressionContext
was that eventually it would be a good place to put something like this for lz4
. Maybe we could do this in a followup issue/PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A follow on sounds like a good idea to me -- I filed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM -- thank you @albertlockett and @mbrobbel
/// compression. | ||
pub struct CompressionContext { | ||
#[cfg(feature = "zstd")] | ||
compressor: zstd::bulk::Compressor<'static>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A follow on sounds like a good idea to me -- I filed
/// Encodes a batch to a number of [EncodedData] items (dictionary batches + the record batch). | ||
/// The [DictionaryTracker] keeps track of dictionaries with new `dict_id`s (so they are only sent once) | ||
/// Make sure the [DictionaryTracker] is initialized at the start of the stream. | ||
#[deprecated(since = "57.0.0", note = "Use `encode` instead")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Which issue does this PR close?
Rationale for this change
Reusing the zstd context between subsequent calls to compress_zstd in the Arrow IPC writer for performance improvement.
Benchmark results:
What changes are included in this PR?
Adds a
CompressionContext
struct, which when the zstd feature is enabled contains a zstd::bulk::Compressor object. This context object is owned by the ipcStreamWriter
/FileWriter
objects and is passed by mutable reference through theIpcDataGenerator
to theCompressionCodec
where it is used when compressing the ipc bytes.Are these changes tested?
Yes the existing unit tests cover the changed code paths
Are there any user-facing changes?
Yes, the method
IpcDataGenerator::encoded_batch
now takes&mut CompressionContext
as an argument.