Skip to content

v2.0.0 #36

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
74 changes: 50 additions & 24 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,6 @@
Gzip C++ lib for gzip compression and decompression. Extracted from [mapnik-vector-tile](https://github.com/mapbox/mapnik-vector-tile) for light-weight modularity.
Gzip C++ lib for gzip compression and decompression.

This library is designed for **non streaming** gzip decompression and compression using C++ strings.

[![Build Status](https://travis-ci.org/mapbox/gzip-hpp.svg?branch=master)](https://travis-ci.com/mapbox/gzip-hpp) [![hpp-skel badge](https://mapbox.s3.amazonaws.com/cpp-assets/hpp-skel-badge_blue.svg)](https://github.com/mapbox/hpp-skel)

Expand All @@ -14,45 +16,69 @@ Gzip C++ lib for gzip compression and decompression. Extracted from [mapnik-vect
// All function calls must pass in a pointer of an
// immutable character sequence (aka a string in C) and its size
std::string data = "hello";
const char * pointer = data.data();
std::size_t size = data.size();

// Check if compressed. Can check both gzip and zlib.
bool c = gzip::is_compressed(pointer, size); // false
bool c = gzip::is_compressed(data); // false

// Compress returns a std::string
std::string compressed_data = gzip::compress(pointer, size);
std::string compressed_data = gzip::compress(data);

// Decompress returns a std::string and decodes both zlib and gzip
const char * compressed_pointer = compressed_data.data();
std::string decompressed_data = gzip::decompress(compressed_pointer, compressed_data.size());
std::string decompressed_data = gzip::decompress(compressed_data);

// Can also compress or decompress from a const char * and size
std::string compressed_data = gzip::compress(data.data(), data.size());
std::string decompressed_data = gzip::decompress(compressed_data.data(), data.size());

// You can pass in an existing string as well to be modified for compression or decompression
// both of which will have data appended to the end.
std::string compressed_data;
gzip::compress(data, compressed_data);
std::string decompressed_data
gzip::decompress(compressed_data, decompressed_data);

// This also works using pointers and sizes
std::string compressed_data;
gzip::compress(data.data(), data.size(), compressed_data);
std::string decompressed_data
gzip::decompress(compressed_data.data(), compressed_data.size(), decompressed_data);
```
#### Compress

// Or like so
std::string compressed_data = gzip::compress(tile->data(), tile->data.size());
All forms of compressed as shown above support the following optional arguments:
* Compression Level
* Buffering Size

// Or like so
std::string value = gzip::compress(node::Buffer::Data(obj), node::Buffer::Length(obj));
Compression level is a number 0 to 9, based on the zlib compression levels. This increases the time spent in compression
and may result in higher levels of compression for higher compression levels. The default is the default level for the
zlib compression library of `Z_DEFAULT_COMPRESSION`.

// Or...etc
Buffering size controls the amount of memory allocated as a buffer during compression with zlib. This by default is a buffer
that is 75% of the size of the data provided. A `0` passed to buffer size is the default and forces this 75% calculation.

```
#### Compress
```c++
// Optionally include compression level
std::size_t size; // No default value, but what happens when not passed??
int level = Z_DEFAULT_COMPRESSION; // Z_DEFAULT_COMPRESSION is the default if no arg is passed
const int level = Z_DEFAULT_COMPRESSION;
const std::size_t buffer_size = 1024; // 1 KB

std::string compressed_data = gzip::compress(tile->data(), size, level);
std::string compressed_data = gzip::compress(data, level, buffer_size);
```

#### Decompress
```c++
// No args other than the std:string
std::string data = "hello";
std::string compressed_data = gzip::compress(data);
const char * compressed_pointer = compressed_data.data();

std::string decompressed_data = gzip::decompress(compressed_pointer, compressed_data.size());
All forms of decompressed as shown above support the following optional arguments:
* Maximum Uncompressed Size
* Buffering Size

Maximum uncompressed size limits the total amount of memory that may be used during decompression. This is provided to prevent heavily
compressed malicious files from causing issues in an application. By default this value is set to `0` which disables this protection.

Buffering size controls the amount of memory allocated as a buffer during deccompression with zlib. This by default is a buffer
that is 150% of the size of the compressed data provided. A `0` passed to buffer size is the default and forces this 150% calculation.

```c++
const std::size_t max_decompressed_size = 1024 * 1024 * 1024; // 1 GB
const std::size_t buffer_size = 1024; // 1 kB
std::string decompressed_data = gzip::decompress(data, max_decompressed_size, buffer_size);
```

## Test
Expand Down
52 changes: 23 additions & 29 deletions bench/run.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -42,69 +42,63 @@ static void BM_decompress(benchmark::State& state) // NOLINT google-runtime-refe

BENCHMARK(BM_decompress);

static void BM_compress_class(benchmark::State& state) // NOLINT google-runtime-references
static void BM_compress_string(benchmark::State& state) // NOLINT google-runtime-references
{
std::string buffer = open_file("./bench/14-4685-6265.mvt");
gzip::Compressor comp;

for (auto _ : state)
{
std::string output;
comp.compress(output, buffer.data(), buffer.size());
std::string value = gzip::compress(buffer);
benchmark::DoNotOptimize(value.data());
}
}

BENCHMARK(BM_compress_class);
BENCHMARK(BM_compress_string);

static void BM_compress_class_no_reallocations(benchmark::State& state) // NOLINT google-runtime-references
static void BM_decompress_string(benchmark::State& state) // NOLINT google-runtime-references
{
std::string buffer = open_file("./bench/14-4685-6265.mvt");
gzip::Compressor comp;
std::string output;
// Run once prior to pre-allocate
comp.compress(output, buffer.data(), buffer.size());

std::string buffer_uncompressed = open_file("./bench/14-4685-6265.mvt");
std::string buffer = gzip::compress(buffer_uncompressed.data(), buffer_uncompressed.size());
for (auto _ : state)
{
comp.compress(output, buffer.data(), buffer.size());
std::string value = gzip::decompress(buffer);
benchmark::DoNotOptimize(value.data());
benchmark::ClobberMemory();
}
}

BENCHMARK(BM_compress_class_no_reallocations);
BENCHMARK(BM_decompress_string);

static void BM_decompress_class(benchmark::State& state) // NOLINT google-runtime-references
static void BM_compress_modify_string(benchmark::State& state) // NOLINT google-runtime-references
{

std::string buffer_uncompressed = open_file("./bench/14-4685-6265.mvt");
std::string buffer = gzip::compress(buffer_uncompressed.data(), buffer_uncompressed.size());
gzip::Decompressor decomp;
std::string buffer = open_file("./bench/14-4685-6265.mvt");

for (auto _ : state)
{
std::string output;
decomp.decompress(output, buffer.data(), buffer.size());
gzip::compress(buffer, output);
benchmark::DoNotOptimize(output.data());
benchmark::ClobberMemory();
}
}

BENCHMARK(BM_decompress_class);
BENCHMARK(BM_compress_modify_string);

static void BM_decompress_class_no_reallocations(benchmark::State& state) // NOLINT google-runtime-references
static void BM_decompress_modify_string(benchmark::State& state) // NOLINT google-runtime-references
{

std::string buffer_uncompressed = open_file("./bench/14-4685-6265.mvt");
std::string buffer = gzip::compress(buffer_uncompressed.data(), buffer_uncompressed.size());
gzip::Decompressor decomp;
std::string output;
// Run once prior to pre-allocate
decomp.decompress(output, buffer.data(), buffer.size());

for (auto _ : state)
{
decomp.decompress(output, buffer.data(), buffer.size());
std::string output;
gzip::decompress(buffer, output);
benchmark::DoNotOptimize(output.data());
benchmark::ClobberMemory();
}
}

BENCHMARK(BM_decompress_class_no_reallocations);
BENCHMARK(BM_decompress_modify_string);

#pragma GCC diagnostic push
#pragma GCC diagnostic ignored "-Wpedantic"
Expand Down
145 changes: 73 additions & 72 deletions include/gzip/compress.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -10,103 +10,104 @@

namespace gzip {

class Compressor
inline void compress(const char* data,
std::size_t size,
std::string& output,
int level = Z_DEFAULT_COMPRESSION,
std::size_t buffering_size = 0)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the different overloads of the compress() function are somewhat confusing and I wonder if they are all correct and unique. The compiler will happily convert between int and size_t and the last parameters have a default value, so I fear in some situation the wrong overload could be chosen or the compiler would detect an ambiguity making the library hard to use. The same applies to the decompress() function below. The original class-based approach had the advantage at least that the configuration parameters were set at construction of the Compressor class.

{
std::size_t max_;
int level_;

public:
Compressor(int level = Z_DEFAULT_COMPRESSION,
std::size_t max_bytes = 2000000000) // by default refuse operation if uncompressed data is > 2GB
: max_(max_bytes),
level_(level)
if (buffering_size == 0)
{
buffering_size = (size * 3 / 4) + 16;
}

template <typename InputType>
void compress(InputType& output,
const char* data,
std::size_t size) const
{

#ifdef DEBUG
// Verify if size input will fit into unsigned int, type used for zlib's avail_in
if (size > std::numeric_limits<unsigned int>::max())
{
throw std::runtime_error("size arg is too large to fit into unsigned int type");
}
#endif
if (size > max_)
{
throw std::runtime_error("size may use more memory than intended when decompressing");
}

z_stream deflate_s;
deflate_s.zalloc = Z_NULL;
deflate_s.zfree = Z_NULL;
deflate_s.opaque = Z_NULL;
deflate_s.avail_in = 0;
deflate_s.next_in = Z_NULL;

// The windowBits parameter is the base two logarithm of the window size (the size of the history buffer).
// It should be in the range 8..15 for this version of the library.
// Larger values of this parameter result in better compression at the expense of memory usage.
// This range of values also changes the decoding type:
// -8 to -15 for raw deflate
// 8 to 15 for zlib
// (8 to 15) + 16 for gzip
// (8 to 15) + 32 to automatically detect gzip/zlib header (decompression/inflate only)
constexpr int window_bits = 15 + 16; // gzip with windowbits of 15

constexpr int mem_level = 8;
// The memory requirements for deflate are (in bytes):
// (1 << (window_bits+2)) + (1 << (mem_level+9))
// with a default value of 8 for mem_level and our window_bits of 15
// this is 128Kb
z_stream deflate_s;
deflate_s.zalloc = Z_NULL;
deflate_s.zfree = Z_NULL;
deflate_s.opaque = Z_NULL;
deflate_s.avail_in = 0;
deflate_s.next_in = Z_NULL;

// The windowBits parameter is the base two logarithm of the window size (the size of the history buffer).
// It should be in the range 8..15 for this version of the library.
// Larger values of this parameter result in better compression at the expense of memory usage.
// This range of values also changes the decoding type:
// -8 to -15 for raw deflate
// 8 to 15 for zlib
// (8 to 15) + 16 for gzip
// (8 to 15) + 32 to automatically detect gzip/zlib header (decompression/inflate only)
constexpr int window_bits = 15 + 16; // gzip with windowbits of 15

constexpr int mem_level = 8;
// The memory requirements for deflate are (in bytes):
// (1 << (window_bits+2)) + (1 << (mem_level+9))
// with a default value of 8 for mem_level and our window_bits of 15
// this is 128Kb

#pragma GCC diagnostic push
#pragma GCC diagnostic ignored "-Wold-style-cast"
if (deflateInit2(&deflate_s, level_, Z_DEFLATED, window_bits, mem_level, Z_DEFAULT_STRATEGY) != Z_OK)
{
throw std::runtime_error("deflate init failed");
}
if (deflateInit2(&deflate_s, level, Z_DEFLATED, window_bits, mem_level, Z_DEFAULT_STRATEGY) != Z_OK)
{
throw std::runtime_error("deflate init failed");
}
#pragma GCC diagnostic pop

std::string buffer;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need the extra buffer here? Can't we put the result in output directly?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used a buffer here rather than directly writing to output because we need to resize at the end and if multiple loops are required we will be moving a much larger string rather then just reusing the existing buffer and appending it. In the old code this was done by using output.resize(...) followed by another output.resize(...) to shrink the size if necessary.

The problem with my solution is that it might allocate more memory than required though if a buffer size too large is selected. I will continue to think about this.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Allocations/deallocations are expensive, copying the buffer probably less so. Using an extra buffer will give you more allocations. Not sure how the different sizes of allocations will impact the performance here. In the end you'd have to benchmark to decide this.

do
{
constexpr std::size_t max_step = static_cast<std::size_t>(std::numeric_limits<unsigned int>::max());
const unsigned int step_size = size > max_step ? max_step : static_cast<unsigned int>(size);
size -= step_size;
const unsigned int buffer_size = buffering_size > step_size ? step_size : static_cast<unsigned int>(buffering_size);

deflate_s.next_in = reinterpret_cast<z_const Bytef*>(data);
deflate_s.avail_in = static_cast<unsigned int>(size);
data = data + step_size;
deflate_s.avail_in = step_size;

std::size_t size_compressed = 0;
buffer.resize(static_cast<std::size_t>(buffer_size));
do
{
size_t increase = size / 2 + 1024;
if (output.size() < (size_compressed + increase))
{
output.resize(size_compressed + increase);
}
// There is no way we see that "increase" would not fit in an unsigned int,
// hence we use static cast here to avoid -Wshorten-64-to-32 error
deflate_s.avail_out = static_cast<unsigned int>(increase);
deflate_s.next_out = reinterpret_cast<Bytef*>((&output[0] + size_compressed));
deflate_s.avail_out = buffer_size;
deflate_s.next_out = reinterpret_cast<Bytef*>(&buffer[0]);
// From http://www.zlib.net/zlib_how.html
// "deflate() has a return value that can indicate errors, yet we do not check it here.
// Why not? Well, it turns out that deflate() can do no wrong here."
// Basically only possible error is from deflateInit not working properly
deflate(&deflate_s, Z_FINISH);
size_compressed += (increase - deflate_s.avail_out);
output.append(buffer, 0, static_cast<std::size_t>(buffer_size - deflate_s.avail_out));
} while (deflate_s.avail_out == 0);

deflateEnd(&deflate_s);
output.resize(size_compressed);
} while (size > 0);
const int ret = deflateEnd(&deflate_s);
if (ret != Z_OK)
{
throw std::runtime_error("Unexpected gzip compression error, stream was inconsistent or freed prematurely");
}
};
}

inline void compress(std::string const& input,
std::string& output,
int level = Z_DEFAULT_COMPRESSION,
std::size_t buffering_size = 0)
{
compress(input.data(), input.size(), output, level, buffering_size);
}

inline std::string compress(const char* data,
std::size_t size,
int level = Z_DEFAULT_COMPRESSION)
int level = Z_DEFAULT_COMPRESSION,
std::size_t buffering_size = 0)
{
std::string output;
compress(data, size, output, level, buffering_size);
return output;
}

inline std::string compress(std::string const& input,
int level = Z_DEFAULT_COMPRESSION,
std::size_t buffering_size = 0)
{
Compressor comp(level);
std::string output;
comp.compress(output, data, size);
compress(input.data(), input.size(), output, level, buffering_size);
return output;
}

Expand Down
Loading