Skip to content

[Breaking change]: ZipArchiveEntry names and comments now respect UTF8 flag when decoded #42003

Closed
@edwardneal

Description

@edwardneal

Description

Relates to dotnet/runtime#103271.

A ZipArchive can be created with an Encoding parameter, which is used to decode the names and comments of entries in the ZIP archive. .NET 7 and 8 introduced a regression where this encoding was used by default, with a fallback to the system default code page (UTF8 in .NET Core) if no encoding was supplied. This regression is being corrected in .NET 9: if the entry's general purpose bit flags indicate that UTF8 should be used, this will be respected, the user-supplied encoding will be used (with the existing fallback to the system default code page if none is supplied.)

I've stated that .NET 9 RC 1 introduced this change - the PR hasn't yet been merged (it's pending this work) so I've selected the next known release. It'll definitely be in .NET 9.

Version

.NET 9 RC 1

Previous behavior

If ZipArchive was instantiated with a user-specified entryNameEncoding parameter, this encoding would always be used when decoding the names and comments of entries in the ZIP archive (even if the entry had the bit set to signify that its name and comment were encoded in UTF8.)

New behavior

When a ZIP archive entry's name and comment are being decoded, its UTF8 bit flag will be respected. The user-supplied entryNameEncoding parameter will only be used to decode the entry's name and comment if this bit flag is unset.

Type of breaking change

  • Binary incompatible: Existing binaries might encounter a breaking change in behavior, such as failure to load or execute, and if so, require recompilation.
  • Source incompatible: When recompiled using the new SDK or component or to target the new runtime, existing source code might require source changes to compile successfully.
  • Behavioral change: Existing binaries might behave differently at run time.

Reason for change

This corrects a regression in .NET 7 and .NET 8 (reported in dotnet/runtime#92283). It also returns ZipArchive to compliance with the ZIP file format specification, sections 4.4.4 and appendix D.

Section 4.4.4:

Bit 11: Language encoding flag (EFS). If this bit is set,
the filename and comment fields for this file
MUST be encoded using UTF-8. (see APPENDIX D)

Appendix D:

D.1 The ZIP format has historically supported only the original IBM PC character
encoding set, commonly referred to as IBM Code Page 437. This limits storing
file name characters to only those within the original MS-DOS range of values
and does not properly support file names in other character encodings, or
languages. To address this limitation, this specification will support the
following change.

D.2 If general purpose bit 11 is unset, the file name and comment SHOULD conform
to the original ZIP character encoding. If general purpose bit 11 is set, the
filename and comment MUST support The Unicode Standard, Version 4.1.0 or
greater using the character encoding form defined by the UTF-8 storage
specification. The Unicode Standard is published by the The Unicode
Consortium (www.unicode.org). UTF-8 encoded data stored within ZIP files
is expected to not include a byte order mark (BOM).

Recommended action

Users passing an encoding to the ZipArchive constructor should be aware that this will not be respected in all situations. It will only be used if the entry's UTF8 bit is not set.

Users who are using ZipArchive to parse ZIP entries with names encoded in non-UTF8 format (but which have the UTF8 bit flag set) will no longer be able to do so. This was always a bug.

Feature area

Core .NET libraries

Affected APIs

ZipArchive..ctor(Stream, ZipArchiveMode, Boolean, Encoding)
ZipFile.ExtractToDirectory(String, String, Encoding, Boolean)
ZipFile.ExtractToDirectory(Stream, String, Encoding, Boolean)
ZipFile.ExtractToDirectory(String, String, Encoding)
ZipFile.ExtractToDirectory(Stream, String, Encoding)
ZipFile.Open(String, ZipArchiveMode, Encoding)

Associated WorkItem - 292500

Metadata

Metadata

Assignees

Labels

📌 seQUESTeredIdentifies that an issue has been imported into Quest.breaking-changeIndicates a .NET Core breaking changein-prThis issue will be closed (fixed) by an active pull request.

Type

No type

Projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions