Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zstd: Add experimental encoder dictionaries #281

Merged
merged 9 commits into from
Sep 3, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 10 additions & 0 deletions huff0/huff0.go
Original file line number Diff line number Diff line change
Expand Up @@ -119,6 +119,16 @@ type Scratch struct {
huffWeight [maxSymbolValue + 1]byte
}

// TransferCTable will transfer the previously used compression table.
func (s *Scratch) TransferCTable(src *Scratch) {
if cap(s.prevTable) < len(src.prevTable) {
s.prevTable = make(cTable, 0, maxSymbolValue+1)
}
s.prevTable = s.prevTable[:len(src.prevTable)]
copy(s.prevTable, src.prevTable)
s.prevTableLog = src.prevTableLog
}

func (s *Scratch) prepare(in []byte) (*Scratch, error) {
if len(in) > BlockSizeMax {
return nil, ErrTooBig
Expand Down
51 changes: 15 additions & 36 deletions zstd/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,6 @@ It offers a very wide range of compression / speed trade-off, while being backed
A high performance compression algorithm is implemented. For now focused on speed.

This package provides [compression](#Compressor) to and [decompression](#Decompressor) of Zstandard content.
Note that custom dictionaries are only supported for decompression.

This package is pure Go and without use of "unsafe".

Expand Down Expand Up @@ -232,41 +231,6 @@ nyc-taxi-data-10M.csv gzstd 1 3325605752 928656485 23876 132.83
nyc-taxi-data-10M.csv gzkp 1 3325605752 924718719 16388 193.53
```

### Converters

As part of the development process a *Snappy* -> *Zstandard* converter was also built.

This can convert a *framed* [Snappy Stream](https://godoc.org/github.com/golang/snappy#Writer) to a zstd stream.
Note that a single block is not framed.

Conversion is done by converting the stream directly from Snappy without intermediate full decoding.
Therefore the compression ratio is much less than what can be done by a full decompression
and compression, and a faulty Snappy stream may lead to a faulty Zstandard stream without
any errors being generated.
No CRC value is being generated and not all CRC values of the Snappy stream are checked.
However, it provides really fast re-compression of Snappy streams.


```
BenchmarkSnappy_ConvertSilesia-8 1 1156001600 ns/op 183.35 MB/s
Snappy len 103008711 -> zstd len 82687318

BenchmarkSnappy_Enwik9-8 1 6472998400 ns/op 154.49 MB/s
Snappy len 508028601 -> zstd len 390921079
```


```Go
s := zstd.SnappyConverter{}
n, err = s.Convert(input, output)
if err != nil {
fmt.Println("Re-compressed stream to", n, "bytes")
}
```

The converter `s` can be reused to avoid allocations, even after errors.


## Decompressor

Staus: STABLE - there may still be subtle bugs, but a wide variety of content has been tested.
Expand Down Expand Up @@ -337,6 +301,21 @@ A re-used Decoder will still contain the dictionaries registered.

When registering multiple dictionaries with the same ID, the last one will be used.

It is possible to use dictionaries when compressing data.

To enable a dictionary use `WithEncoderDict(dict []byte)`. Here only one dictionary will be used
and it will likely be used even if it doesn't improve compression.

The used dictionary must be used to decompress the content.

For any real gains, the dictionary should be built with similar data.
If an unsuitable dictionary is used the output may be slightly larger than using no dictionary.
Use the [zstd commandline tool](https://github.com/facebook/zstd/releases) to build a dictionary from sample data.
For information see [zstd dictionary information](https://github.com/facebook/zstd#the-case-for-small-data-compression).

For now there is a fixed startup performance penalty for compressing content with dictionaries.
This will likely be improved over time. Just be aware to test performance when implementing.

### Allocation-less operation

The decoder has been designed to operate without allocations after a warmup.
Expand Down
2 changes: 1 addition & 1 deletion zstd/blockdec.go
Original file line number Diff line number Diff line change
Expand Up @@ -646,7 +646,7 @@ func (b *blockDec) decodeCompressed(hist *history) error {
}
} else {
if hist.huffTree != nil && huff != nil {
if hist.dict == nil || hist.dict.litDec != hist.huffTree {
if hist.dict == nil || hist.dict.litEnc != hist.huffTree {
huffDecoderPool.Put(hist.huffTree)
}
hist.huffTree = nil
Expand Down
67 changes: 42 additions & 25 deletions zstd/blockenc.go
Original file line number Diff line number Diff line change
Expand Up @@ -14,12 +14,13 @@ import (
)

type blockEnc struct {
size int
literals []byte
sequences []seq
coders seqCoders
litEnc *huff0.Scratch
wr bitWriter
size int
literals []byte
sequences []seq
coders seqCoders
litEnc *huff0.Scratch
dictLitEnc *huff0.Scratch
wr bitWriter

extraLits int
last bool
Expand Down Expand Up @@ -314,19 +315,19 @@ func (b *blockEnc) encodeRawTo(dst, src []byte) []byte {
}

// encodeLits can be used if the block is only litLen.
func (b *blockEnc) encodeLits(raw bool) error {
func (b *blockEnc) encodeLits(lits []byte, raw bool) error {
var bh blockHeader
bh.setLast(b.last)
bh.setSize(uint32(len(b.literals)))
bh.setSize(uint32(len(lits)))

// Don't compress extremely small blocks
if len(b.literals) < 32 || raw {
if len(lits) < 8 || (len(lits) < 32 && b.dictLitEnc == nil) || raw {
if debug {
println("Adding RAW block, length", len(b.literals), "last:", b.last)
println("Adding RAW block, length", len(lits), "last:", b.last)
}
bh.setType(blockTypeRaw)
b.output = bh.appendTo(b.output)
b.output = append(b.output, b.literals...)
b.output = append(b.output, lits...)
return nil
}

Expand All @@ -335,33 +336,38 @@ func (b *blockEnc) encodeLits(raw bool) error {
reUsed, single bool
err error
)
if len(b.literals) >= 1024 {
if b.dictLitEnc != nil {
b.litEnc.TransferCTable(b.dictLitEnc)
b.litEnc.Reuse = huff0.ReusePolicyAllow
b.dictLitEnc = nil
}
if len(lits) >= 1024 {
// Use 4 Streams.
out, reUsed, err = huff0.Compress4X(b.literals, b.litEnc)
} else if len(b.literals) > 32 {
out, reUsed, err = huff0.Compress4X(lits, b.litEnc)
} else if len(lits) > 32 {
// Use 1 stream
single = true
out, reUsed, err = huff0.Compress1X(b.literals, b.litEnc)
out, reUsed, err = huff0.Compress1X(lits, b.litEnc)
} else {
err = huff0.ErrIncompressible
}

switch err {
case huff0.ErrIncompressible:
if debug {
println("Adding RAW block, length", len(b.literals), "last:", b.last)
println("Adding RAW block, length", len(lits), "last:", b.last)
}
bh.setType(blockTypeRaw)
b.output = bh.appendTo(b.output)
b.output = append(b.output, b.literals...)
b.output = append(b.output, lits...)
return nil
case huff0.ErrUseRLE:
if debug {
println("Adding RLE block, length", len(b.literals))
println("Adding RLE block, length", len(lits))
}
bh.setType(blockTypeRLE)
b.output = bh.appendTo(b.output)
b.output = append(b.output, b.literals[0])
b.output = append(b.output, lits[0])
return nil
default:
return err
Expand All @@ -384,7 +390,7 @@ func (b *blockEnc) encodeLits(raw bool) error {
lh.setType(literalsBlockCompressed)
}
// Set sizes
lh.setSizes(len(out), len(b.literals), single)
lh.setSizes(len(out), len(lits), single)
bh.setSize(uint32(len(out) + lh.size() + 1))

// Write block headers.
Expand Down Expand Up @@ -444,13 +450,19 @@ func fuzzFseEncoder(data []byte) int {
}

// encode will encode the block and append the output in b.output.
func (b *blockEnc) encode(raw, rawAllLits bool) error {
// Previous offset codes must be pushed if more blocks are expected.
func (b *blockEnc) encode(org []byte, raw, rawAllLits bool) error {
if len(b.sequences) == 0 {
return b.encodeLits(rawAllLits)
return b.encodeLits(b.literals, rawAllLits)
}
// We want some difference
if len(b.literals) > (b.size - (b.size >> 5)) {
return errIncompressible
// We want some difference to at least account for the headers.
saved := b.size - len(b.literals) - (b.size >> 5)
if saved < 16 {
if org == nil {
return errIncompressible
}
b.popOffsets()
return b.encodeLits(org, rawAllLits)
}

var bh blockHeader
Expand All @@ -466,6 +478,11 @@ func (b *blockEnc) encode(raw, rawAllLits bool) error {
reUsed, single bool
err error
)
if b.dictLitEnc != nil {
b.litEnc.TransferCTable(b.dictLitEnc)
b.litEnc.Reuse = huff0.ReusePolicyAllow
b.dictLitEnc = nil
}
if len(b.literals) >= 1024 && !raw {
// Use 4 Streams.
out, reUsed, err = huff0.Compress4X(b.literals, b.litEnc)
Expand Down
8 changes: 7 additions & 1 deletion zstd/decoder.go
Original file line number Diff line number Diff line change
Expand Up @@ -187,7 +187,7 @@ func (d *Decoder) Reset(r io.Reader) error {
d.current.err = err
d.current.flushed = true
if debug {
println("sync decode to ", len(dst), "bytes, err:", err)
println("sync decode to", len(dst), "bytes, err:", err)
}
return nil
}
Expand Down Expand Up @@ -303,6 +303,9 @@ func (d *Decoder) DecodeAll(input, dst []byte) ([]byte, error) {
frame.history.reset()
err := frame.reset(&frame.bBuf)
if err == io.EOF {
if debug {
println("frame reset return EOF")
}
return dst, nil
}
if frame.DictionaryID != nil {
Expand Down Expand Up @@ -341,6 +344,9 @@ func (d *Decoder) DecodeAll(input, dst []byte) ([]byte, error) {
return dst, err
}
if len(frame.bBuf) == 0 {
if debug {
println("frame dbuf empty")
}
break
}
}
Expand Down
26 changes: 22 additions & 4 deletions zstd/dict.go
Original file line number Diff line number Diff line change
Expand Up @@ -13,14 +13,31 @@ import (
type dict struct {
id uint32

litDec *huff0.Scratch
litEnc *huff0.Scratch
llDec, ofDec, mlDec sequenceDec
offsets [3]int
content []byte
//llEnc, ofEnc, mlEnc []*fseEncoder
offsets [3]int
content []byte
}

var dictMagic = [4]byte{0x37, 0xa4, 0x30, 0xec}

// ID returns the dictionary id or 0 if d is nil.
func (d *dict) ID() uint32 {
if d == nil {
return 0
}
return d.id
}

// DictContentSize returns the dictionary content size or 0 if d is nil.
func (d *dict) DictContentSize() int {
if d == nil {
return 0
}
return len(d.content)
}

// Load a dictionary as described in
// https://github.com/facebook/zstd/blob/master/doc/zstd_compression_format.md#dictionary-format
func loadDict(b []byte) (*dict, error) {
Expand All @@ -43,10 +60,11 @@ func loadDict(b []byte) (*dict, error) {

// Read literal table
var err error
d.litDec, b, err = huff0.ReadTable(b[8:], nil)
d.litEnc, b, err = huff0.ReadTable(b[8:], nil)
if err != nil {
return nil, err
}
d.litEnc.Reuse = huff0.ReusePolicyMust

br := byteReader{
b: b,
Expand Down
Loading