Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zstd: expose header decoder as public API mimicking zstd -lv foo.zst #237

Closed
ribasushi opened this issue Feb 28, 2020 · 13 comments · Fixed by #299
Closed

zstd: expose header decoder as public API mimicking zstd -lv foo.zst #237

ribasushi opened this issue Feb 28, 2020 · 13 comments · Fixed by #299

Comments

@ribasushi
Copy link

As per subject it would be awesome if

compress/zstd/framedec.go

Lines 183 to 206 in 98b287b

d.FrameContentSize = 0
if fcsSize > 0 {
b := br.readSmall(fcsSize)
if b == nil {
println("Reading Frame content", io.ErrUnexpectedEOF)
return io.ErrUnexpectedEOF
}
switch fcsSize {
case 1:
d.FrameContentSize = uint64(b[0])
case 2:
// When FCS_Field_Size is 2, the offset of 256 is added.
d.FrameContentSize = uint64(b[0]) | (uint64(b[1]) << 8) + 256
case 4:
d.FrameContentSize = uint64(b[0]) | (uint64(b[1]) << 8) | (uint64(b[2]) << 16) | (uint64(b[3]) << 24)
case 8:
d1 := uint32(b[0]) | (uint32(b[1]) << 8) | (uint32(b[2]) << 16) | (uint32(b[3]) << 24)
d2 := uint32(b[4]) | (uint32(b[5]) << 8) | (uint32(b[6]) << 16) | (uint32(b[7]) << 24)
d.FrameContentSize = uint64(d1) | (uint64(d2) << 32)
}
if debug {
println("field size bits:", v, "fcsSize:", fcsSize, "FrameContentSize:", d.FrameContentSize, hex.EncodeToString(b[:fcsSize]), "singleseg:", d.SingleSegment, "window:", d.WindowSize)
}
}
was available for use. At present the only options to know the size is to either call out to a zstd binary or to decompress a stream and count the bytes 🙀

@klauspost
Copy link
Owner

klauspost commented Feb 28, 2020

I could be added, but it would not be super reliable.

  1. As you can see the frame size is optional. It will only be set if the size can be determined when the encode starts, meaning EncodeAll was used.

  2. Frames can be concatenated. There may be several frames on the stream and you would only get the first one.

My suggestion in general would be to keep the size outside the compression part.

@ribasushi
Copy link
Author

I work with externally supplied archives, without a side-channel for supplying size. It is true that the presence of the value is not guaranteed, but when it is there, it is there. At present I am doing an abomination like this:

sizeFinder := regexp.MustCompile(`(?m)^Decompressed Size:.+\(([0-9]+) B\)`)
var size int64
{
	var out bytes.Buffer
	nativeZstd := exec.Command("zstd", "-lv", fn)
	nativeZstd.Stdout = &out
	if err := nativeZstd.Run(); err == nil {
		if sizeStr := sizeFinder.FindSubmatch(out.Bytes()); len(sizeStr) == 2 && len(sizeStr[1]) > 0 {
			size, _ = strconv.ParseInt(
				string(sizeStr[1]),
				10,
				64,
			)
		}
	}

	// :cryingbear:
	if size == 0 {
		size = decompressFile(fn, ioutil.Discard)
	}
}

Replacing the exec() abomination with reliable code you've already written would be a plus. Please reconsider! 😉

@klauspost
Copy link
Owner

:cryingbear: indeed.

A quite reliable way would be to parse all block headers of a stream. It would of course requiring reading through the stream, but blocks would not need to be decompressed, so it should be possible to do at IO speeds.

I am not a super fan of returning the frame header values since it is unreliable when multiple frames are present.

@ribasushi
Copy link
Author

I am not a super fan of returning the frame header values since it is unreliable when multiple frames are present.

I apologize for being imprecise - I was actually not directly suggesting that. I was more asking about the possibility of a more general-ish method that would utilize internally the part I linked. Something that works only when the input buffer is a regular file or somesuch

For reference gzip -lv and xz -lv are a thing as well, so there is definitely sufficient surface for such an "opportunistic API"

@ribasushi
Copy link
Author

Something that works only when the input buffer is a regular file or somesuch

Note that there is precedent on this already:

echo | xz -c | xz -clv
xz: --list does not support reading from standard input

@klauspost
Copy link
Owner

klauspost commented Feb 28, 2020

Agree. I would like something that always works no matter how the file was compressed, but it could optionally allow unreliable numbers.

It could be something like:

// UncompressedSize returns the uncompressed size of the provided stream.
// If scanBlocks is true all blocks of the stream will be scanned. 
// While very fast it will require to read the entire stream.
// If onlyUseHeader is true any value in the header size is returned.
// If no header size is set blocks will be scanned if scanBlocks is set.
func UncompressedSize(r io.Reader, scanBlocks, onlyUseHeader bool) (int64, error)

@ribasushi
Copy link
Author

@klauspost I love it!

@klauspost
Copy link
Owner

UncompressedSize(r, false, false) is the only combination that doens't make sense. I can live with that.

It shouldn't take a crazy amount of time to do, but I can't make specific promises.

@ribasushi
Copy link
Author

I can't make specific promises.

No worries at all. As you can see I have a ( horrid ) workaround in place for this. I will rip it out when your version lands.

Thank you for considering my request!

@ribasushi
Copy link
Author

// it will require to read the entire stream ... If onlyUseHeader is true any value in the header size is returned.

Actually now that I think about this - the "lighter" option will have to consume the first header in the stream anyway, so further decompression will fail. I wonder if there is a way to keep some state in an instance to make the following possible:

decompressor, initErr := zstd.NewReader(tap)
projectedSize, sizeErr := decompressor.UncompressedSize(false, true)
written, copyErr = io.Copy(sink, decompressor)

@klauspost
Copy link
Owner

I know that would be convenient, but it would also mean a lot of housekeeping and scanning blocks will be impossible since it would have to buffer the entire stream.

So if you want the size you must be able to reset the stream yourself.

@ribasushi
Copy link
Author

So if you want the size you must be able to reset the stream yourself.

This is fair, and I agree with your justification. However in this case the proposed API
func UncompressedSize(r io.Reader, scanBlocks, onlyUseHeader bool) (int64, error)
becomes somewhat redundant - scanBlocks will be nearly as fast as onlyUseHeader, therefore one might as well always give the correct answer if there are zero guarantees on the state of the io.Reader when the process is done.

Moreover, at least for zstd - in both cases ( either scanBlocks or onlyUseHeader ) you have to read until EOF anyway, since there is no other way to determine whether there is another independent frame after the one you are currently parsing.

TLDR: it seems that the following would be both simpler and more robust:
func UncompressedSize ( r io.Reader ) ( int64, error )

@klauspost
Copy link
Owner

@ribasushi I looked into this. Unfortunately compressed blocks don't have their uncompressed size available and the sequences must be decoded, which is quite expensive.

I will still add a function to get the frame header and the first block.

klauspost added a commit that referenced this issue Dec 5, 2020
Fixes #253
Fixes #237

TODO: Test(s)
klauspost added a commit that referenced this issue Dec 18, 2020
* zstd: Add header decoder

Fixes #253
Fixes #237

# Usage

```Go
	var header zstd.Header
	err = header.Decode(b)
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants