Skip to content

Extraction to preallocated buffers#312

Draft
dawidpilarski wants to merge 4 commits into
rikyoz:developfrom
dawidpilarski:extraction-to-preallocated-buffers
Draft

Extraction to preallocated buffers#312
dawidpilarski wants to merge 4 commits into
rikyoz:developfrom
dawidpilarski:extraction-to-preallocated-buffers

Conversation

@dawidpilarski
Copy link
Copy Markdown

@dawidpilarski dawidpilarski commented Sep 18, 2025

This is a sketch of an idea to extract data to preallocated buffers.

The idea stays the same as previously - add API to extract archive to preallocated buffers.

When implementing that functionality with RawDataExtractCallback I realised, that I am missing information
about when the new file starts and when it ends.

Hence idea of yet new API appeared.

Once the idea is approved I will:

  • cleanup the code
  • add more thorough testing
  • add size hint to the FileAwareExtraction interface
  • unify RawDataExtractCallback and RawDataExtractCallbackWid

Description

One solution I could go with, but didn't choose one:

  • in the implementation save user provided callback for fileCallback
  • add own callback for fileCallback that would call user provided fileCallback + that will inform rawCallback, that file with new id is being written to (this also seems simplistic, as fileCallback only has filePath information without id)
  • use RawCallback implementation to perform extraction
  • on extraction finished restore user provided fileCallback

I added another option:

  • add one new user API to extract using FileAwareExtraction interface, that was introduced
struct FileAwareExtraction{
    virtual bool write( const byte_t* dataStart, std::size_t dataSize) = 0;
    virtual void onNewFile(std::uint32_t index, tstring fileName) = 0; // note: in final implementation I also want to add size hint for the file - if size information is available it will be provided to the user - this way user can preallocate buffers as needed.

protected:
    ~FileAwareExtraction() = default;
};
  • the interface consists of write callback to actually store extracted data and onNewFile callback for information when extraction of new file takes place (with id and fileName)
  • it uses slightly modified copy of RawDataExtractCallback (will unify if idea is approved)
  • the cllback calls interface functions when necessary

Motivation and Context

The change allows user to simply define their own extraction strategy:

  • by providing preallocated buffers (e.g.

How Has This Been Tested?

very initial testing using unit tests. More thorough tests will be added when idea is accepted.
unit tests were launched locally on windows 11 machine

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • My code follows the code style of this project.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.


struct FileAwareExtraction{
virtual bool write( const byte_t* dataStart, std::size_t dataSize) = 0;
virtual void onNewFile(std::uint32_t index, tstring fileName) = 0;
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo add fileSizeHint

Suggested change
virtual void onNewFile(std::uint32_t index, tstring fileName) = 0;
virtual void onNewFile(std::uint32_t index, tstring fileName, std::size_t* fileSizeHint) = 0;

* @param inArchive the input archive to be extracted.
* @param indicesWithBuffers the map with indices of files and preallocated buffers.
*/
void extract( Input inArchive, const std::map< std::uint32_t, BitView< byte_t > > & indicesWithBuffers ) const {
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: what should I do with it?

This function is fully implementable using the other API with FileAwareExtraction that was introduced, so if this class is meant to be an API, then there is little reason to move to BitInputArchive.

On the other hand, however user may want to use BitInputArchive directly, so I guess it makes sense to implement it there and here only use it - and I think I would lean towards this option (and I imlpemented this very roughly as a sketch - in fact it's not even tested).

so this class responsibility would be only to perform proper overload on the inArchive parameter

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it makes sense to implement it there and here only use it - and I think I would lean towards this option (and I imlpemented this very roughly as a sketch - in fact it's not even tested).

so this class responsibility would be only to perform proper overload on the inArchive parameter

Yeah, as you said, I think it should be implemented in BitInputArchive, and then use the proper overload here.
It is more consistent with how other BitExtractor functions are implemented.

using IndicesArray = std::array< std::uint32_t, N >;

// NOLINTBEGIN(*-explicit-conversions, *-avoid-c-arrays, *-pro-bounds-pointer-arithmetic)
class BitIndicesView final {
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

before we were talking about ByteSpan, but I noticed this class, which is exactly what was needed, except for the generalisation, so I extracted the general part.

I also named it (in line with what is here) BitView

but of course I can name it ByteSpan as we previously spoke

Copy link
Copy Markdown
Owner

@rikyoz rikyoz Sep 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't call it a BitView.
The main reason is that it doesn't align with the C++20 naming convention.
As far as I know, std::span was originally named std::array_view in early versions of the C++20 standard.
One reason for the name change was that the "view" suffix could be confusing. In std::string_view, for example, the "view" suffix means that the object is a read-only "view" of a string. However, std::array_view was meant to allow changes to the array's content. Therefore, it was renamed std::span to distinguish between a "read-only view" and a "span".

In the case of BitIndicesView, it is a read-only view of an array or vector of indices, so calling it "view" makes sense.
I continued using the "Bit" prefix for naming consistency with the other classes in the public API.

Calling it ByteSpan could also be confusing because it is now a generic span class.
It can be a span over any type, not just bytes (although, technically we are always talking about bytes in memory).

Perhaps we could simply call it Span.

In the early days of the project, I chose the "Bit" prefix because I came from the Qt world, where the "Q" prefix is common for every class, and I liked the idea of having a common prefix.
It was also a way to distinguish bit7z's classes from others'.
Over time, however, I started to dislike this decision because there are namespaces for this kind of thing.
Still, I kept using it for backward compatibility and naming consistency.

I think new classes can drop the prefix if doing so results in a less confusing name. BitSpan might be just as confusing as ByteSpan.

Also, we could add a type alias like this to reduce the verbosity when passing spans over buffers:

using ByteSpan = Span< byte_t >;

Let me know what you think about it!

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally agree. I just used the old name from previous PR as I noted simillar class, but I totally agree with Span naming 👍

I will add ByteSpan alias as well 👍


namespace bit7z {

RawDataExtractCallbackWid::RawDataExtractCallbackWid( const BitInputArchive& inputArchive, CallbackType& callback )
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo

once you approve of idea I will unify it with RawDataExtractCallback class

}
}

TEMPLATE_TEST_CASE( "BitInputArchive: Extract to callback",
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let me know:

  • should I use multiple_items as a test? I see there are more items and those are nested, so I think it might be a better candidate for the test actually.

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, multiple_items is a better candidate here, as we can check if we handle folders correctly.

#else
const auto testArchive = GENERATE( as< TestInputFormat >(),
TestInputFormat{ "7z", BitFormat::SevenZip },
// TestInputFormat{ "iso", BitFormat::Iso }, iso format has reverse ids of the elements? todo
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't really dug into that, but at first look it seemed to me as if iso had different indices for the files, but I would need to double check that, so you can skip this artifact for now.

I will polish the test once the idea is approved

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem. Yeah, there are little nuances in every format (and also in every version of 7-Zip/p7zip) that make testing a real pain!

@dawidpilarski dawidpilarski marked this pull request as draft September 18, 2025 21:45
@rikyoz rikyoz self-requested a review September 21, 2025 20:10
@rikyoz rikyoz self-assigned this Sep 21, 2025
Copy link
Copy Markdown
Owner

@rikyoz rikyoz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still need to evaluate whether the "FileAwareExtraction" interface is the best approach.
In any case, some changes need to be implemented, particularly due to the generalization of the BitIndicesView class into a span.

* @param inArchive the input archive to be extracted.
* @param indicesWithBuffers the map with indices of files and preallocated buffers.
*/
void extract( Input inArchive, const std::map< std::uint32_t, BitView< byte_t > > & indicesWithBuffers ) const {
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it makes sense to implement it there and here only use it - and I think I would lean towards this option (and I imlpemented this very roughly as a sketch - in fact it's not even tested).

so this class responsibility would be only to perform proper overload on the inArchive parameter

Yeah, as you said, I think it should be implemented in BitInputArchive, and then use the proper overload here.
It is more consistent with how other BitExtractor functions are implemented.

using IndicesArray = std::array< std::uint32_t, N >;

// NOLINTBEGIN(*-explicit-conversions, *-avoid-c-arrays, *-pro-bounds-pointer-arithmetic)
class BitIndicesView final {
Copy link
Copy Markdown
Owner

@rikyoz rikyoz Sep 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't call it a BitView.
The main reason is that it doesn't align with the C++20 naming convention.
As far as I know, std::span was originally named std::array_view in early versions of the C++20 standard.
One reason for the name change was that the "view" suffix could be confusing. In std::string_view, for example, the "view" suffix means that the object is a read-only "view" of a string. However, std::array_view was meant to allow changes to the array's content. Therefore, it was renamed std::span to distinguish between a "read-only view" and a "span".

In the case of BitIndicesView, it is a read-only view of an array or vector of indices, so calling it "view" makes sense.
I continued using the "Bit" prefix for naming consistency with the other classes in the public API.

Calling it ByteSpan could also be confusing because it is now a generic span class.
It can be a span over any type, not just bytes (although, technically we are always talking about bytes in memory).

Perhaps we could simply call it Span.

In the early days of the project, I chose the "Bit" prefix because I came from the Qt world, where the "Q" prefix is common for every class, and I liked the idea of having a common prefix.
It was also a way to distinguish bit7z's classes from others'.
Over time, however, I started to dislike this decision because there are namespaces for this kind of thing.
Still, I kept using it for backward compatibility and naming consistency.

I think new classes can drop the prefix if doing so results in a less confusing name. BitSpan might be just as confusing as ByteSpan.

Also, we could add a type alias like this to reduce the verbosity when passing spans over buffers:

using ByteSpan = Span< byte_t >;

Let me know what you think about it!

Comment thread include/bit7z/bitview.hpp
public:
using element_type = T; // T in std::span<T>.
using value_type = typename std::remove_cv< element_type >::type;
using size_type = std::uint32_t; // 7-Zip uses 32-bits for the size of the indices array.
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a potential issue here due to the generalization of BitIndicesView into a span class.

7-Zip uses 32-bit integers for the size of the indices array. For this reason, I made BitIndicesView use std::uint32_t for its size.
However, generalizing this class to a span class makes it incompatible with large buffers (≥ ~4 GB), unlike std::span, which uses a 64-bit integer (std::size_t) on 64-bit platforms.

So we likely need to define size_type as std::size_t, and then static cast the numItems variable in the BitInputArchive::extractArchive method from std::size_t to std::uint32_t (for backward compatibility).

const auto numItems = indices.empty() ? std::numeric_limits< std::uint32_t >::max() : indices.size();

Comment thread include/bit7z/bitview.hpp

constexpr BitView( const_pointer bufferStart, std::size_t size ) noexcept
: mStart{ size == 0 ? nullptr : bufferStart },
mSize{ static_cast< size_type >( size ) } {}
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With size_type = std::size_t, this cast is not needed anymore.


std::vector<std::uint32_t> indices;
indices.resize(indicesWithBuffers.size());
for (const auto& [key, value] : indicesWithBuffers) {
Copy link
Copy Markdown
Owner

@rikyoz rikyoz Sep 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will likely need to be changed, as bit7z v4.x must still compile using the C++14 standard (for keeping old compiler support).
Unfortunately, structured bindings are available from C++17.

Yeah, I can't wait to drop C++14 support and make bit7z a full C++17 library 😞.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

apologies. As of today I barely remember what is C++14 😅 it's just muscle memory typing for range loops :)

*/
void extract( Input inArchive, const std::map< std::uint32_t, BitView< byte_t > > & indicesWithBuffers ) const {
BitInputArchive inputArchive( *this, inArchive );
struct ExtractionCallback : FileAwareExtraction {
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main problem I have with the ExtractionCallback/FileAwareExtraction approach is that it differs greatly from the other bit7z callbacks. 7-Zip uses interfaces and classes for its callbacks, whereas bit7z uses std::functions.

Don't get me wrong, though. I'm not completely opposed to this idea; I'm just wondering if there's another way to implement the same functionality.

From the PR description, I can't understand why changing the RawDataCallback signature by passing an additional parameter with the index of the extracted file was problematic.
I think it could be used to detect when a new file is being extracted, unless I'm missing something.

Another possible approach might be similar to BufferCallback: provide a callback function that, given the index and the path (and size?) of the current item, returns a ByteSpan to the buffer that will contain the extracted item.
This would simplify the implementation of the std::map function, as it would use a simple lambda like this:

auto callback = [&indicesWithBuffers](std::uint32_t index, const tstring& path, std::size_t size) -> ByteSpan {
    return indicesWithBuffers.at(index);
};

But I didn't have the chance to test this, maybe I didn't take into account some drawbacks.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking on few variants of the interface as well, but wasn't really sure of any of them, but in my opinion at the end it made most sense. Initially the interface was even more complex, but all in all here's my thinking:

goal:

I was aiming for one interface, that may not be the easiest, but would definitely be most flexible.
Reason is, that I think there is no point on endlessly expanding list of extract/extractTo/extractMatching set of functions to fit every use case - I just don't think it's possible to predict every single use case library clients may have. Hence it would be good to have some kind of extraction mechanism, that users can tune to fit their needs.

(probably I already mentioned, some of those cases are memory mapped files, or files created in temporary manner with mkstemp, or combination of both)

auto callback = [&indicesWithBuffers](std::uint32_t index, const tstring& path, std::size_t size) -> ByteSpan {
    return indicesWithBuffers.at(index);
};

Here my doubts resolve around availability of the size information. As far as I recall, size is not always available information and it's not there in the streaming archives like bzip2. Hence the callback may not work well for such cases.

If on the other hand we skip the size - user may not be able to create a buffer on demand - especially that we are talking about the ByteSpan, which is not resizeable.

Then we could also have something along the lines of:

auto callback = [](std::uint32_t index, span<byte_t> inputBuffer){
    buffers[index].write(inputBuffer);
}

which is basically RawDataCallback enriched with index parameter, and that kind of callback could work, but my doubts resolve around:

  • does the user know, that index will be changed sequentially - that is, sequence of call with index 1, 2, 1 is not possible, it will always be 1, 1, 2 ... etc.
  • if user needs some specific logic after file extraction (for example commiting the file, fflush ing it etc.) would it be easier for him to track id, or have a separate function called (I guess the second option seems easier).
  • then also comes a question of combining that information together with file path information etc. which needs to be done on a separate FileCallback, which makes user two separate callbacks communicate with each other.

And above made me think, that interface like this - where generally I can relatively easy control extraction of elements (write()) and I get information about the elements at the same time (onNewFile) may really be valueable.

Copy link
Copy Markdown
Owner

@rikyoz rikyoz Sep 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reason is, that I think there is no point on endlessly expanding list of extract/extractTo/extractMatching set of functions to fit every use case

I agree. That's why I introduced the concept of BufferCallback. It allows us to use std::unordered_map instead of std::map to store and retrieve buffers, among other things.

Of course, there are many other ways to make these methods more general. However, this was a good compromise, at least for the time being, until the API is revised for version 5.

I just don't think it's possible to predict every single use case library clients may have. Hence it would be good to have some kind of extraction mechanism, that users can tune to fit their needs.

I agree!

Here my doubts resolve around availability of the size information. As far as I recall, size is not always available information and it's not there in the streaming archives like bzip2. Hence the callback may not work well for such cases.

Yes, we could allow a zero size if it's unavailable.
However, I'm not particularly fond of the idea.

If on the other hand we skip the size - user may not be able to create a buffer on demand - especially that we are talking about the ByteSpan, which is not resizeable.

Yeah, but that would be problematic with any approach you use. In this specific case, you either have external knowledge of how big the buffer should be or you can't do much except use dynamic buffers instead of fixed-size ones.

does the user know, that index will be changed sequentially - that is, sequence of call with index 1, 2, 1 is not possible, it will always be 1, 1, 2 ... etc.

Probably not, I don't think we can assume any particular order.
7-Zip can behave differently based on the archive format being used.

if user needs some specific logic after file extraction (for example commiting the file, fflush ing it etc.) would it be easier for him to track id, or have a separate function called (I guess the second option seems easier).

The second option might be easier/cleaner on the user side, yeah.

On bit7z side, I think it's better to reduce the number of callbacks.

then also comes a question of combining that information together with file path information etc. which needs to be done on a separate FileCallback, which makes user two separate callbacks communicate with each other.

One problem with using RawDataCallback is that it is called for every write request during file extraction. This creates the need for a second callback to indicate when extraction begins, as well as a way for the two callbacks to communicate, whether via lambda captures or class member variables.
Also, if you're using std::map, you need to store the reference to the current value to avoid performing search on the map on every write.

The BufferCallback/ByteSpanCallback approach, on the other hand, doesn't necessarily need a second callback, as the callback is called only once before extracting a file. For example:

BitArchiveReader info( lib, inputArchive, testArchive.format );

std::map< std::uint32_t, buffer_t > buffers{};
auto callback = [&info, &buffers]( std::uint32_t index, const tstring& /*path*/ ) -> ByteSpan {
    const auto& item = info.itemAt( index );

    // Allocation can happen here while creating the entry in the map.
    // You can use try_emplace in C++17.
    const auto buffer = buffers.emplace( index, buffer_t( item.size(), 0 ) );

    // Returning a ByteSpan over the just allocated buffer (no need to search it again in the map).
    return buffer.first->second;
};
info.extractTo(std::move(callback));

You can also handle specific logic after file extraction inside the same ByteSpanCallback (after the extractTo method for the last file).

However, if you prefer to allocate in a separate function, we could use a FilterCallback. This is supported by the ExtractCallback class and is called before extracting each file in the archive.
As I said, using two callbacks inevitably involves communication between them, but it's not much different than communicating via member variables in a class, as the interface approach requires.

And above made me think, that interface like this - where generally I can relatively easy control extraction of elements (write()) and I get information about the elements at the same time (onNewFile) may really be valuable.

I have a few concerns about the interface approach:

  • It is a bit less flexible than you might think. You're forcing the user to create a class to implement the interface, even for basic use cases.
    • Using std::function callbacks might be more flexible because they can be any callable, such as lambdas or class methods.
  • As I said, this approach differs from all the other callbacks provided by bit7z, so the library would lose consistency in the API.
    • This is not a significant drawback, however, if this approach proves superior to others.

I ran some tests on the ByteSpanCallback approach I was considering, and it has some interesting advantages:

  • It allows you to control where to extract the elements and get information about them (if the user needs it).
    • Of course, to get information, the callback needs access to the archive reader object from outside. This could be avoided if the callback passed the current element object as a parameter:
auto callback = [&buffers]( const BitArchiveItem& item ) -> ByteSpan {
    // As before, but without the call to info.itemAt(),
    // which would be done by bit7z before calling this callback...
};
  • It generalizes all the BitInputArchive methods that output to fixed-size buffers. This may allow us to deprecate some of these methods in the future.
  • There is no need for an additional ExtractCallback implementation. We can simply change FixedBufferExtractCallback to store a ByteSpanCallback instead of the buffer pointer and its size.
  • It requires minimal changes to Span to make it closer to std::span.
  • We can also use CFixedBufferOutStream and modify it to store a ByteSpan object instead of the buffer pointer and its size.
    • This class already handles all writes to the ByteSpan, so users don't need to reinvent the wheel by tinkering with std::copy_n.

One possible downside to this approach is that, if BufferCallback and ByteSpanCallback have the same parameters, a function that returns a buffer_t could be ambiguous (as it is implicitly convertible to ByteSpan). This means it could be either a BufferCallback or a ByteSpanCallback. Therefore, the user must specify which it is.

Additionally, it introduces some overhead when handling the simple case of extracting a single file to a fixed-size buffer (because a lambda must now be created to return the ByteSpan over the buffer).

However, I believe these are minor issues that can be addressed in the future.

What do you think of this approach? Is it general enough? Does it suit your use cases?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for late reply.

The problem with your proposal, that I have is that it will not work for every kind of archive - for those, that we cannot programatically preallocate memory, because the size of file after extraction is unknown.

I don't think it would be a nice experience if some functions will work for certain types of archives and other functions for others - leaving the user with resposibility to pick the right one. Oftentimes, the user will want to extract regardless of type of archive (of course opposite cases will also exist).

It is a bit less flexible than you might think. You're forcing the user to create a class to implement the interface, even for basic use cases.
Using std::function callbacks might be more flexible because they can be any callable, such as lambdas or class methods.

I don't think what you are mentioning is flexibility issue at all - rather easiness to use issue - as there is nothing you can achieve with lambda that you couldn't with a hand made type.

It's often simpler to just create ad-hoc lambda than creating a new type manually. And for simple cases - there are other functions that extract to vectors or extract to directory - so I do not want to force users to go through this heavy API for simplest cases - only users that have specific needs could use this API to give them flexibility in terms of how library should write the extracted data.

In any case. Let me propose some middle ground, where we could still have lambdas. Something of the following pattern:

// Abstraction wrapper over different kinds of writing mechanism
// you can write to preallocated buffer with span constructor,
// or write and expand buffer as extraction progresses with Buffer ctor
// or give custom write callback with function ctor
struct Writable{
  Writable(ByteSpan);
  Writable(Buffer&); 
  Writable(std::function<void(byte_t*, size_t)>);
// some definitions of public api like Write();
}; 

auto callback = [&buffers]( const BitArchiveItem& item ) -> Writable {
    if( item.size() == 0 ){
        auto resizableBuffer= //create resizable buffer
        return resizableBuffer;
    } else {
       auto span = add_buffer(buffers, item.size());
       return span;
    }
};

Copy link
Copy Markdown
Owner

@rikyoz rikyoz Oct 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry about this late reply as well!

The problem with your proposal, that I have is that it will not work for every kind of archive - for those, that we cannot programatically preallocate memory, because the size of file after extraction is unknown.

Yes, but I'm not sure if there's a generic way to handle both cases with only one callback function.
Take for example your callback proposal that returns a Writable object: you cannot create the resizable buffer inside the callback function, as the returned Writable object would store a dangling reference/pointer to the local buffer.
The user can take the buffer from outside using captures, of course.
But this is error prone, as there is not safety guarantee that prevents from passing a local variable to Writable's constructor, it might not be detected by the compiler or static analyzer.
On the other hand, the current signature of BufferCallback returns a reference to an existing buffer.
Of course, being C++, you can still shoot yourself in the foot and return a local object.
But any decent compiler/static analyzer will warn you that you're returning a reference to a local object.

I don't think it would be a nice experience if some functions will work for certain types of archives and other functions for others - leaving the user with resposibility to pick the right one. Oftentimes, the user will want to extract regardless of type of archive (of course opposite cases will also exist).

I agree.

I don't think what you are mentioning is flexibility issue at all - rather easiness to use issue

With "flexibility" I was referring to "expressiveness" flexibility.
With std::functions, you can use any approach you want: not just lambdas, but also simple functions, and class member functions.

  • as there is nothing you can achieve with lambda that you couldn't with a hand made type.

Yes, both approaches have the same capabilities, of course.
But in the end, I think it's more the other way around: the class approach forces one single way of writing code, while the std::function approach doesn't.
Don't get me wrong though, I'm open to use the interface approach, I'm just evaluating every possible API for this feature and trying to understand which is the best compromise in terms of ease-of-use, expressiveness, and performance.


There are several points to address.

First, users should not have to manually write the code that handles the writes to the buffer.

  • In most cases, users simply want to recreate the extracted file in a buffer, which always involves the same copy operations.

    • The only difference is whether the output buffer is dynamically resizable.
    • The recreation of the file in the buffer should be handled by the already existing output stream classes of bit7z.
    • As a side note, while re-reading bit7z's code for replying to this issue, I noticed important improvements that can be done in how bit7z supports extraction to dynamic buffers.
      In particular, the library currently doesn't call reserve on the buffer when the size of the extracted file is known.
      This should provide a significant performance boost in many cases. I'm surprised I didn't think of it before.
  • The remaining cases are when the user doesn't want to recreate the extracted file in a buffer (e.g., when computing the file's hash).
    This case is already covered by the RawDataCallback and doesn't require a buffer.

    • A generic approach could cover all cases, but for now, I will focus on the most common case: recreating the extracted file in memory.

Secondly, users shouldn't have to manually check whether to create a dynamic or preallocated buffer.
This can be handled internally by bit7z because relying on the user to perform the check is not ideal.

As I said, the "one callback function for all" approach is problematic. An alternative solution would be to use two callback functions: one that returns a dynamic buffer when the size is unknown, and another that returns a ByteSpan for fixed-size buffers. Then the library would choose the callback function to call depending on the availability of the size information.
Now, I can imagine two approaches: an extraction function that takes the two callback std::functions, or an extraction function that takes a reference to one single object implementing an interface.

In other words:

void BitInputArchive::extractTo(const BufferCallback& dynamicBufferCb, const ByteSpanCallback& fixedBufferCb);

// ...

reader.extractTo(
    [&](const BitArchiveItem& item) -> buffer_t& {
        // return a reference to an existing dynamic buffer.
    },
    [&](const BitArchiveItem& item) -> ByteSpan {
        // return a byte span to an existing fixed buffer.
    }
);

or

struct WritableBufferCallback {
    virtual auto getDynamicBuffer(const BitArchiveItem& item) -> buffer_t& = 0;
    virtual auto getFixedBuffer(const BitArchiveItem& item) -> ByteSpan = 0;
};

struct WritableBufferCallbackImpl : WritableBufferCallback {
    // implementation
};

void BitInputArchive::extractTo(const WritableBufferCallback& cb);

// ...

reader.extractTo(WritableBufferCallbackImpl{});

On one hand, a part of me prefers the conciseness of the std::functions approach.
On the other hand, I find the class approach a bit more readable.
What do you think?

}
}

TEMPLATE_TEST_CASE( "BitInputArchive: Extract to callback",
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, multiple_items is a better candidate here, as we can check if we handle folders correctly.

#else
const auto testArchive = GENERATE( as< TestInputFormat >(),
TestInputFormat{ "7z", BitFormat::SevenZip },
// TestInputFormat{ "iso", BitFormat::Iso }, iso format has reverse ids of the elements? todo
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem. Yeah, there are little nuances in every format (and also in every version of 7-Zip/p7zip) that make testing a real pain!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants