Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Respect general-purpose bit flags when decoding ZipArchiveEntry names and comments #103271

Merged
merged 4 commits into from
Aug 6, 2024
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Resolving 92283
If bit 11 in the general purpose bit flags is set, forces the use of UTF-8 instead of the encoding specified in the ZipArchive constructor.
  • Loading branch information
edwardneal committed Jun 11, 2024
commit f2723da69a9218468546f0804497c45024aa0ef4
Original file line number Diff line number Diff line change
Expand Up @@ -77,7 +77,7 @@ internal ZipArchiveEntry(ZipArchive archive, ZipCentralDirectoryFileHeader cd)
_outstandingWriteStream = null;

_storedEntryNameBytes = cd.Filename;
_storedEntryName = (_archive.EntryNameAndCommentEncoding ?? Encoding.UTF8).GetString(_storedEntryNameBytes);
_storedEntryName = DecodeEntryString(_storedEntryNameBytes);
DetectEntryNameVersion();

_lhUnknownExtraFields = null;
Expand Down Expand Up @@ -200,7 +200,7 @@ public int ExternalAttributes
[AllowNull]
public string Comment
{
get => (_archive.EntryNameAndCommentEncoding ?? Encoding.UTF8).GetString(_fileComment);
get => DecodeEntryString(_fileComment);
set
{
_fileComment = ZipHelper.GetEncodedTruncatedBytesFromString(value, _archive.EntryNameAndCommentEncoding, ushort.MaxValue, out bool isUTF8);
Expand Down Expand Up @@ -352,6 +352,18 @@ public override string ToString()
return FullName;
}

private string DecodeEntryString(byte[] entryStringBytes)
{
Debug.Assert(entryStringBytes != null);

Encoding readEntryStringEncoding =
(_generalPurposeBitFlag & BitFlagValues.UnicodeFileNameAndComment) == BitFlagValues.UnicodeFileNameAndComment
? Encoding.UTF8
: _archive?.EntryNameAndCommentEncoding ?? Encoding.UTF8;

return readEntryStringEncoding.GetString(entryStringBytes);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see you're bringing back a condition similar to the one we used to have in .NET 6, as proposed in the issue description. I can see that it fixes the problem. I'd like to provide some context.

Before this change:

The archive's encoding was set either via the ZipArchive public constructors or the ZipFile Open or Create methods that take an entryNameEncoding argument. This encoding was then passed to the ZipArchiveEntry constructors that take an encoding argument as well. This is important because we need to respect that encoding when the uses passes it as an argument.

We would only read the encoding from each entry's general purpose bit when the user did not pass an entryNameEncoding argument or when the value was passed as null, in which case the archive entries were constructed using the internal ZipArchiveEntry constructor that take an archive and a central directory header as arguments.

What we were not doing was read this general purpose bit flag value when the public ZipArchiveEntry constructors were called, meaning that when the time came to check the encoding of the entry, we would check if the archive's EntryNameAndCommentEncoding had a value, but it would always be unset (because of the constructor used to create this entry). So the next step was to default to UTF8 and read the file comment using that encoding.

After this change:

We will always first read the entry's general purpose bit flag to see if the value was set. If not, then we will do what we were doing before: check the EntryNameAndCommentEncoding, and if that is unset, we default to UTF8 to read the file comment.

The bit flag will only be set when the internal entry constructor is called, which was the bug. This should not affect entries created with the public constructors, meaning the behavior remains unmodified for them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the context @carlossanlop. I think we've both read the pre-PR code for decoding existing ZipArchiveEntrys in the same way.

I'll re-check and test for the correct behaviour when creating a new ZipArchiveEntry - I can see that L136 of ZipArchiveEntry sets FullName, and this property's setter uses the existing behaviour you've described to convert the value to a byte array. This looks correct to me, but I'll check.

Was prioritising the supplied encoding over the general-purpose bit flags deliberate? The ZIP file spec. seems to indicate that it's the other way around: if bit 11 of the flags is set, the filename must always be UTF-8. I'm looking at the latest version, sections 4.4.4 and appendix D.

Section 4.4.4:

Bit 11: Language encoding flag (EFS). If this bit is set,
the filename and comment fields for this file
MUST be encoded using UTF-8. (see APPENDIX D)

Appendix D:

D.1 The ZIP format has historically supported only the original IBM PC character
encoding set, commonly referred to as IBM Code Page 437. This limits storing
file name characters to only those within the original MS-DOS range of values
and does not properly support file names in other character encodings, or
languages. To address this limitation, this specification will support the
following change.

D.2 If general purpose bit 11 is unset, the file name and comment SHOULD conform
to the original ZIP character encoding. If general purpose bit 11 is set, the
filename and comment MUST support The Unicode Standard, Version 4.1.0 or
greater using the character encoding form defined by the UTF-8 storage
specification. The Unicode Standard is published by the The Unicode
Consortium (www.unicode.org). UTF-8 encoded data stored within ZIP files
is expected to not include a byte order mark (BOM).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm inclined to sign-off. I looked at the change history in this area, I wanted to double check that we were not regressing this: #87883 Note that I added the tests for that fix in a follow-up PR: #88978

I can see that we did not break them in this PR.


// Only allow opening ZipArchives with large ZipArchiveEntries in update mode when running in a 64-bit process.
// This is for compatibility with old behavior that threw an exception for all process bitnesses, because this
// will not work in a 32-bit process.
Expand Down