Skip to content

Detect truncated GZip streams #61768

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 17 commits into from
May 16, 2022
Merged

Detect truncated GZip streams #61768

merged 17 commits into from
May 16, 2022

Conversation

mfkl
Copy link
Contributor

@mfkl mfkl commented Nov 18, 2021

For context, please see:

TL;DR: The System.IO.Compression.GZipStream implementation over zlib does not currently do strict native error reporting when it comes to truncated streams. Other zlib .NET managed implementation or bindings over the native API do report issues on corrupt data, but they rely on an error code returned by zlib, which should be ignored sometimes. This PR tries to address this edge case.

I initially used the approach of relying on the zlib buf errors to implement this (as some other community wrappers have done), but it failed some of your tests. So, I decided to add a new API parameter to enable "strict validation" as an opt-in.

But, after more troubleshooting with the help from @marcin-krystianc, it seems we could do without the new API if we don't rely on zlib buf errors. Some of the existing unit tests should actually be changed to account for the new behavior (e.g. throw on invalid data) as I've done in mfkl@196783b

I'm eager to hear your feedback on this as I'm no compression expert.

@ghost ghost added community-contribution Indicates that the PR has been added by a community member area-System.IO.Compression and removed community-contribution Indicates that the PR has been added by a community member labels Nov 18, 2021
@ghost
Copy link

ghost commented Nov 18, 2021

Tagging subscribers to this area: @dotnet/area-system-io-compression
See info in area-owners.md if you want to be subscribed.

Issue Details

For context, please see:

TL;DR: The System.IO.Compression.GZipStream implementation over zlib does not currently do strict native error reporting when it comes to truncated streams. Other zlib .NET managed implementation or bindings over the native API do report issues on corrupt data, but they rely on an error code returned by zlib, which should be ignored sometimes. This PR tries to address this edge case.

I initially used the approach of relying on the zlib buf errors to implement this (as some other community wrappers have done), but it failed some of your tests. So, I decided to add a new API parameter to enable "strict validation" as an opt-in.

But, after more troubleshooting with the help from @marcin-krystianc, it seems we could do without the new API if we don't rely on zlib buf errors. Some of the existing unit tests should actually be changed to account for the new behavior (e.g. throw on invalid data) as I've done in mfkl@196783b

I'm eager to hear your feedback on this as I'm no compression expert.

Author: mfkl
Assignees: -
Labels:

area-System.IO.Compression

Milestone: -

@mfkl mfkl force-pushed the gzip-strict-validation branch from 196783b to 6adb821 Compare November 19, 2021 07:22
@mfkl
Copy link
Contributor Author

mfkl commented Nov 26, 2021

Hmm the failed jobs look unrelated to this PR. This is ready for review /cc @adamsitnik @danmoseley @carlossanlop

{
ThrowGenericInvalidData();
}
else
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the 'else' needed here - eg codegen improves because it does not realize that the throw method never returns?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how to properly check this but I'll look into it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I checked things correctly (by codegen you meant IL and not x86, right?), this is the result. With the else

      // [286 28 - 286 54]
      IL_00bc: call         void System.IO.Compression.DeflateStream::ThrowGenericInvalidData()
      IL_00c1: nop          

      // [287 25 - 287 26]
      IL_00c2: nop          

      IL_00c3: br.s         IL_00c8

      // [289 25 - 289 26]
      IL_00c5: nop          

      // [290 29 - 290 35]
      IL_00c6: br.s         IL_0120

      // [292 21 - 292 22]
      IL_00c8: nop          

      IL_00c9: br.s         IL_00fc

Without the else

      // [286 28 - 286 54]
      IL_00bc: call         void System.IO.Compression.DeflateStream::ThrowGenericInvalidData()
      IL_00c1: nop          

      // [287 25 - 287 26]
      IL_00c2: nop          

      // [290 29 - 290 35]
      IL_00c3: br.s         IL_011a

So it seems you are correct! I updated the PR accordingly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By codegen we generally mean assembly code. Usually the quickest way to verify that is with sharplab.io. Not sure how much the IL correlates. But it makes sense to change as you suggest.

data = compressed.ToArray();
}

for (var i = 1; i < data.Length; i++)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this manage to hit all the 4 new throw statements?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it does.

@@ -17,6 +17,7 @@ internal sealed class Inflater : IDisposable
private const int MinWindowBits = -15; // WindowBits must be between -8..-15 to ignore the header, 8..15 for
private const int MaxWindowBits = 47; // zlib headers, 24..31 for GZip headers, or 40..47 for either Zlib or GZip

private bool _nonZeroInput; // Whether there is any non zero input
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, would nonEmptyInput be better name? a zero input might be a buffer of zeroes.
or perhaps this means nonEmptyOutput?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to nonEmptyInput.

@danmoseley
Copy link
Member

cc @adamsitnik @carlossanlop @jozkee are the full compression crew.

@mfkl mfkl force-pushed the gzip-strict-validation branch from 9c54663 to def1924 Compare December 6, 2021 10:34
@mfkl
Copy link
Contributor Author

mfkl commented Jan 21, 2022

As mentioned in #61768 (comment), CI failure is unrelated, please feel free to restart a CI build, thanks!

@danmoseley
Copy link
Member

@mfkl you can restart CI by close and reopen actually ..

@danmoseley danmoseley closed this Jan 21, 2022
@danmoseley danmoseley reopened this Jan 21, 2022
@mfkl
Copy link
Contributor Author

mfkl commented Feb 7, 2022

Anything I can do to help move the review process forward?

Even though it is quite an edge case, it'd be nice to get this merged in, as it fixes one really old bug with regard to gzip streams in dotnet.

@stackedsax
Copy link

Hey there @danmoseley @jozkee, any updates from your side on this?

@danmoseley
Copy link
Member

Apologies for the delay @mfkl . Perhaps @jozkee can set expectations as I know he's juggling several tasks.

Comment on lines 893 to 896
if (!_deflateStream._inflater.Finished())
{
ThrowGenericInvalidData();
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These conditions still hold true in CopyTo[Async].

Suggested change
if (!_deflateStream._inflater.Finished())
{
ThrowGenericInvalidData();
}
if (!_deflateStream._inflater.Finished())
{
Debug.Assert(_deflateStream._inflater.NonEmptyInput());
Debug.Assert(_deflateStream._inflater.AvailableOutput > 0);
ThrowGenericInvalidData();
}

@@ -418,6 +422,10 @@ async ValueTask<int> Core(Memory<byte> buffer, CancellationToken cancellationTok
int n = await _stream.ReadAsync(new Memory<byte>(_buffer, 0, _buffer.Length), cancellationToken).ConfigureAwait(false);
if (n <= 0)
{
if (!_inflater.Finished() && _inflater.NonEmptyInput() && _inflater.AvailableOutput > 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the condition as "if we haven't finished inflating AND there's partial input set AND there's already output available; we are in the bad state and we should throw".
Is it possible (i.e: an scenario) where there could be partial input but not available output?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please double check test-code-coverage of the three conditions. If I remove _inflater.AvailableOutput > 0 all tests still pass.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, without _inflater.AvailableOutput > 0, the following test fails.

  System.IO.InvalidDataException : Found invalid data while decoding.

  Stack Trace: 
DeflateStream.ThrowGenericInvalidData() line 348
DeflateStream.ReadCore(Span`1 buffer) line 280
DeflateStream.Read(Byte[] buffer, Int32 offset, Int32 count) line 235
ZipFileTestBase.ReadAllBytes(Stream stream, Byte[] buffer, Int32 offset, Int32 count) line 83
ZipFileTestBase.StreamsEqual(Stream ast, Stream bst, Int32 blocksToRead) line 142
ZipFileTestBase.StreamsEqual(Stream ast, Stream bst) line 111
ZipFile_Create.CreateFromDirectory_IncludeBaseDirectory() line 42

This is the reason I added an async version of the test in this PR, to cover the async code path.

continue;
}

throw new XunitException($"Truncated stream was decompressed successfully: length={i}");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aside of truncation tests, could we test changing byte values?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, I'll re-use the test from the original SO thread, if that is fine with you.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mfkl it's fine to reuse that from SO per the SO license. If you end up essentially pasting as-is, please add a few lines to our THIRD-PARTY-NOTICES to ensure that the poster is recognized. If you write something that effectively does the same thing, that's probably not necessary.

@mfkl
Copy link
Contributor Author

mfkl commented May 2, 2022

Yes, I believe so @danmoseley.

// - Inflation is not finished yet.
// - Provided input wasn't completely empty
// In such case, we are dealing with a truncated input stream.
if (!buffer.IsEmpty && !_inflater.Finished() && _inflater.NonEmptyInput())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been rethinking it and I think it makes sense to not throw for buffer.IsEmpty, we don't pinvoke inflate in such case so there's no way to tell if there's still available output on the inflater.

Comment on lines 598 to 606
try
{
while(ZipFileTestBase.ReadAllBytes(s, buffer, 0, buffer.Length) != 0) { };

Assert.Equal(source, buffer);
}
catch (InvalidDataException)
{
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will eat-up false positives. You can do this instead:

Assert.Throws<InvalidDataException>(() => { while (ZipFileTestBase.ReadAllBytes(s, buffer, 0, buffer.Length) != 0) ; });

If there are cases where a corruption can't be expected (e.g: changing a header byte), we should skip them, but they should be deliberated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I did this (code may not be elegant though, let me know if you have a more elegant version in mind).

Skipping though means that we do not check the decompression process was successful in these cases (e.g. comparing the decompressed buffer with the initial source). It may or may not be relevant to do so, I'm not sure.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not exactly sure what you mean here, but I think the way you have it in the latest commits is the way to go, we should always assert for the exception when desired, not just ignore it (unless there's a reason for it).

Assert.Throws<InvalidDataException>(() =>
{
while (ZipFileTestBase.ReadAllBytes(decompressor, buffer, 0, buffer.Length) != 0);
});

@mfkl mfkl force-pushed the gzip-strict-validation branch from 5f52914 to 0c492ab Compare May 5, 2022 08:57
@mfkl
Copy link
Contributor Author

mfkl commented May 5, 2022

zlib should be able to detect corruption as with gzip currently (but not brotli). I'm looking for a way to add this test without duplicating code (as some types of streams are unable to detect such corruption).

@mfkl
Copy link
Contributor Author

mfkl commented May 6, 2022

For the truncated error case, should a new error message be created to more precisely describe the problem?

Currently Found invalid data while decoding is used (and Decoder ran into invalid data for brotli). It could be Found truncated data while decoding to provide a more meaningful hint.

Aside from this, I think it's OK now. Let me know!

Comment on lines 295 to 296
// corrupting these bytes goes undetected by gzip, skip them
int[] byteToSkip = { 3, 4, 5, 6, 7, 8, 9 };
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add, as a comment, the reason why these bytes don't corrupt the compressed data?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, done.

@@ -16,5 +18,44 @@ public class ZLibStreamUnitTests : CompressionStreamUnitTestBase
public override Stream CreateStream(Stream stream, CompressionLevel level, bool leaveOpen) => new ZLibStream(stream, level, leaveOpen);
public override Stream BaseStream(Stream stream) => ((ZLibStream)stream).BaseStream;
protected override string CompressedTestFile(string uncompressedPath) => Path.Combine("ZLibTestData", Path.GetFileName(uncompressedPath) + ".z");

[Fact]
public void StreamCorruption_IsDetected()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we add this test to CompressionStreamUnitTests.Deflate.cs as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sadly this is not really an option as DeflateStream does not do CRC, hence fails to detect stream corruption (unlike zlib and gzip streams). Truncation detection works fine though.

@jozkee
Copy link
Member

jozkee commented May 10, 2022

For the truncated error case, should a new error message be created to more precisely describe the problem?

Currently Found invalid data while decoding is used (and Decoder ran into invalid data for brotli). It could be Found truncated data while decoding to provide a more meaningful hint.

I think we can do that, feel free to add your proposed error message to the .\src\libraries\System.IO.Compression\src\Resources\Strings.resx file and use it as needed.

Copy link
Member

@jozkee jozkee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @mfkl

@jozkee jozkee added breaking-change Issue or PR that represents a breaking API or functional change over a prerelease. needs-breaking-change-doc-created Breaking changes need an issue opened with https://github.com/dotnet/docs/issues/new?template=dotnet labels May 16, 2022
@jozkee jozkee merged commit e71a46b into dotnet:main May 16, 2022
@ghost
Copy link

ghost commented May 16, 2022

Added needs-breaking-change-doc-created label because this PR has the breaking-change label.

When you commit this breaking change:

  1. Create and link to this PR and the issue a matching issue in the dotnet/docs repo using the breaking change documentation template, then remove this needs-breaking-change-doc-created label.
  2. Ask a committer to mail the .NET Breaking Change Notification DL.

Tagging @dotnet/compat for awareness of the breaking change.

@mfkl
Copy link
Contributor Author

mfkl commented May 17, 2022

Thanks for the review feedback @jozkee :)

@mfkl mfkl deleted the gzip-strict-validation branch May 17, 2022 06:13
@ghost ghost locked as resolved and limited conversation to collaborators Jun 16, 2022
@stephentoub
Copy link
Member

Reverted in #72742

@stephentoub stephentoub removed breaking-change Issue or PR that represents a breaking API or functional change over a prerelease. needs-breaking-change-doc-created Breaking changes need an issue opened with https://github.com/dotnet/docs/issues/new?template=dotnet labels Sep 23, 2022
@stephentoub
Copy link
Member

@ericstj, I've removed the breaking change labels as this was reverted.

@ericstj
Copy link
Member

ericstj commented Sep 23, 2022

Thank you @stephentoub

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants