zstd: expose header decoder as public API mimicking `zstd -lv foo.zst` #237

ribasushi · 2020-02-28T01:38:34Z

As per subject it would be awesome if

Lines 183 to 206 in 98b287b

    
           d.FrameContentSize = 0 
        
           if fcsSize > 0 { 
        
           	b := br.readSmall(fcsSize) 
        
           	if b == nil { 
        
           		println("Reading Frame content", io.ErrUnexpectedEOF) 
        
           		return io.ErrUnexpectedEOF 
        
           	} 
        
           	switch fcsSize { 
        
           	case 1: 
        
           		d.FrameContentSize = uint64(b[0]) 
        
           	case 2: 
        
           		// When FCS_Field_Size is 2, the offset of 256 is added. 
        
           		d.FrameContentSize = uint64(b[0]) | (uint64(b[1]) << 8) + 256 
        
           	case 4: 
        
           		d.FrameContentSize = uint64(b[0]) | (uint64(b[1]) << 8) | (uint64(b[2]) << 16) | (uint64(b[3]) << 24) 
        
           	case 8: 
        
           		d1 := uint32(b[0]) | (uint32(b[1]) << 8) | (uint32(b[2]) << 16) | (uint32(b[3]) << 24) 
        
           		d2 := uint32(b[4]) | (uint32(b[5]) << 8) | (uint32(b[6]) << 16) | (uint32(b[7]) << 24) 
        
           		d.FrameContentSize = uint64(d1) | (uint64(d2) << 32) 
        
           	} 
        
           	if debug { 
        
           		println("field size bits:", v, "fcsSize:", fcsSize, "FrameContentSize:", d.FrameContentSize, hex.EncodeToString(b[:fcsSize]), "singleseg:", d.SingleSegment, "window:", d.WindowSize) 
        
           	} 
        
           }

was available for use. At present the only options to know the size is to either call out to a zstd binary or to decompress a stream and count the bytes 🙀

The text was updated successfully, but these errors were encountered:

klauspost · 2020-02-28T11:32:47Z

I could be added, but it would not be super reliable.

As you can see the frame size is optional. It will only be set if the size can be determined when the encode starts, meaning EncodeAll was used.
Frames can be concatenated. There may be several frames on the stream and you would only get the first one.

My suggestion in general would be to keep the size outside the compression part.

ribasushi · 2020-02-28T12:56:59Z

I work with externally supplied archives, without a side-channel for supplying size. It is true that the presence of the value is not guaranteed, but when it is there, it is there. At present I am doing an abomination like this:

sizeFinder := regexp.MustCompile(`(?m)^Decompressed Size:.+\(([0-9]+) B\)`)
var size int64
{
	var out bytes.Buffer
	nativeZstd := exec.Command("zstd", "-lv", fn)
	nativeZstd.Stdout = &out
	if err := nativeZstd.Run(); err == nil {
		if sizeStr := sizeFinder.FindSubmatch(out.Bytes()); len(sizeStr) == 2 && len(sizeStr[1]) > 0 {
			size, _ = strconv.ParseInt(
				string(sizeStr[1]),
				10,
				64,
			)
		}
	}

	// :cryingbear:
	if size == 0 {
		size = decompressFile(fn, ioutil.Discard)
	}
}

Replacing the exec() abomination with reliable code you've already written would be a plus. Please reconsider! 😉

klauspost · 2020-02-28T13:14:15Z

:cryingbear: indeed.

A quite reliable way would be to parse all block headers of a stream. It would of course requiring reading through the stream, but blocks would not need to be decompressed, so it should be possible to do at IO speeds.

I am not a super fan of returning the frame header values since it is unreliable when multiple frames are present.

ribasushi · 2020-02-28T13:24:03Z

I am not a super fan of returning the frame header values since it is unreliable when multiple frames are present.

I apologize for being imprecise - I was actually not directly suggesting that. I was more asking about the possibility of a more general-ish method that would utilize internally the part I linked. Something that works only when the input buffer is a regular file or somesuch

For reference gzip -lv and xz -lv are a thing as well, so there is definitely sufficient surface for such an "opportunistic API"

ribasushi · 2020-02-28T13:25:14Z

Something that works only when the input buffer is a regular file or somesuch

Note that there is precedent on this already:

echo | xz -c | xz -clv
xz: --list does not support reading from standard input

klauspost · 2020-02-28T13:41:17Z

Agree. I would like something that always works no matter how the file was compressed, but it could optionally allow unreliable numbers.

It could be something like:

// UncompressedSize returns the uncompressed size of the provided stream.
// If scanBlocks is true all blocks of the stream will be scanned. 
// While very fast it will require to read the entire stream.
// If onlyUseHeader is true any value in the header size is returned.
// If no header size is set blocks will be scanned if scanBlocks is set.
func UncompressedSize(r io.Reader, scanBlocks, onlyUseHeader bool) (int64, error)

ribasushi · 2020-02-28T14:25:14Z

@klauspost I love it!

klauspost · 2020-02-28T14:32:03Z

UncompressedSize(r, false, false) is the only combination that doens't make sense. I can live with that.

It shouldn't take a crazy amount of time to do, but I can't make specific promises.

ribasushi · 2020-02-28T14:40:05Z

I can't make specific promises.

No worries at all. As you can see I have a ( horrid ) workaround in place for this. I will rip it out when your version lands.

Thank you for considering my request!

ribasushi · 2020-02-28T14:45:50Z

// it will require to read the entire stream ... If onlyUseHeader is true any value in the header size is returned.

Actually now that I think about this - the "lighter" option will have to consume the first header in the stream anyway, so further decompression will fail. I wonder if there is a way to keep some state in an instance to make the following possible:

decompressor, initErr := zstd.NewReader(tap)
projectedSize, sizeErr := decompressor.UncompressedSize(false, true)
written, copyErr = io.Copy(sink, decompressor)

klauspost · 2020-03-01T12:52:45Z

I know that would be convenient, but it would also mean a lot of housekeeping and scanning blocks will be impossible since it would have to buffer the entire stream.

So if you want the size you must be able to reset the stream yourself.

ribasushi · 2020-03-02T23:54:33Z

So if you want the size you must be able to reset the stream yourself.

This is fair, and I agree with your justification. However in this case the proposed API
func UncompressedSize(r io.Reader, scanBlocks, onlyUseHeader bool) (int64, error)
becomes somewhat redundant - scanBlocks will be nearly as fast as onlyUseHeader, therefore one might as well always give the correct answer if there are zero guarantees on the state of the io.Reader when the process is done.

Moreover, at least for zstd - in both cases ( either scanBlocks or onlyUseHeader ) you have to read until EOF anyway, since there is no other way to determine whether there is another independent frame after the one you are currently parsing.

TLDR: it seems that the following would be both simpler and more robust:
func UncompressedSize ( r io.Reader ) ( int64, error )

klauspost · 2020-12-05T12:04:26Z

@ribasushi I looked into this. Unfortunately compressed blocks don't have their uncompressed size available and the sequences must be decoded, which is quite expensive.

I will still add a function to get the frame header and the first block.

Fixes #253 Fixes #237 TODO: Test(s)

* zstd: Add header decoder Fixes #253 Fixes #237 # Usage ```Go var header zstd.Header err = header.Decode(b) ```

klauspost mentioned this issue Apr 27, 2020

zstd: provide a "contains magic number" func #253

Closed

klauspost added the enhancement label Oct 12, 2020

klauspost added a commit that referenced this issue Dec 5, 2020

zstd: Add header decoder

4f061dc

Fixes #253 Fixes #237 TODO: Test(s)

klauspost mentioned this issue Dec 5, 2020

zstd: Add header decoder #299

Merged

klauspost closed this as completed in #299 Dec 18, 2020

klauspost added a commit that referenced this issue Dec 18, 2020

zstd: Add header decoder (#299)

bb6b954

* zstd: Add header decoder Fixes #253 Fixes #237 # Usage ```Go var header zstd.Header err = header.Decode(b) ```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

zstd: expose header decoder as public API mimicking `zstd -lv foo.zst` #237

zstd: expose header decoder as public API mimicking `zstd -lv foo.zst` #237

ribasushi commented Feb 28, 2020

klauspost commented Feb 28, 2020 •

edited

Loading

ribasushi commented Feb 28, 2020

klauspost commented Feb 28, 2020

ribasushi commented Feb 28, 2020

ribasushi commented Feb 28, 2020

klauspost commented Feb 28, 2020 •

edited

Loading

ribasushi commented Feb 28, 2020

klauspost commented Feb 28, 2020

ribasushi commented Feb 28, 2020

ribasushi commented Feb 28, 2020

klauspost commented Mar 1, 2020

ribasushi commented Mar 2, 2020

klauspost commented Dec 5, 2020

zstd: expose header decoder as public API mimicking zstd -lv foo.zst #237

zstd: expose header decoder as public API mimicking zstd -lv foo.zst #237

Comments

ribasushi commented Feb 28, 2020

klauspost commented Feb 28, 2020 • edited Loading

ribasushi commented Feb 28, 2020

klauspost commented Feb 28, 2020

ribasushi commented Feb 28, 2020

ribasushi commented Feb 28, 2020

klauspost commented Feb 28, 2020 • edited Loading

ribasushi commented Feb 28, 2020

klauspost commented Feb 28, 2020

ribasushi commented Feb 28, 2020

ribasushi commented Feb 28, 2020

klauspost commented Mar 1, 2020

ribasushi commented Mar 2, 2020

klauspost commented Dec 5, 2020

zstd: expose header decoder as public API mimicking `zstd -lv foo.zst` #237

zstd: expose header decoder as public API mimicking `zstd -lv foo.zst` #237

klauspost commented Feb 28, 2020 •

edited

Loading

klauspost commented Feb 28, 2020 •

edited

Loading