Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for the language encoding flag #1

Merged

Conversation

mstock
Copy link
Contributor

@mstock mstock commented Mar 7, 2019

I recently noticed that IO::Compress::Zip does currently not support setting the language encoding flag of a zip file member, which can result in file names with 'weird' characters when unpacking a zip file where the member names are UTF-8 encoded (for example with the Windows Explorer on Windows 7, but likely also with other tools).

This pull request adds support for setting the flag when creating a zip file, and to get back the value of the flag when reading a zip file. It doesn't do any encoding or decoding of the name and comment fields.

This is also a very partial revert of 02889ff where the support for the language encoding flag was commented out in IO::Compress::Zip. Unfortunately, I couldn't find a reason why this was done.

I've already prepared a branch in Archive-Zip-SimpleZip based on this, so should this get merged, I'm planning to create a pull request for Archive-Zip-SimpleZip as well.

This allows to set the flag when creating a zip file, and to get back the
value of the flag when reading a zip file. It doesn't do any encoding or
decoding of the name and comment fields.

This is also a very partial revert of 02889ff where the support for the
language encoding flag was commented out in IO::Compress::Zip.
@pmqs
Copy link
Owner

pmqs commented Mar 8, 2019

Hey Manfred

Thanks for the patch. This feature has been on my todo list for a very long time.

From memory, the primary reason support was removed was because there are a couple of ways to store a UTF8 filename/comment in a zip file. At the time I hadn't decided which method to use. The one used here is the most common, but I pulled support until I bottomed out on that analysis.

In principle I'm ok with (finally) enabling this feature, but I need to take another quick look at the zip specs before I reinstate the change

There are a few changes I'll need to make before this can be released into the wild

  1. Change the option name from -utf8 to something else.
    I know that’s what I used myself, but I don't think it is the best name for this feature. Might use -UTF8Header instead.
  2. Decide what (if anything) to do if badly formed UTF8 is passed in for a filename/comment.
  3. Add the encoding (obviously)
  4. Get the test harness to actually write UTF8 filename/comments.

@pmqs
Copy link
Owner

pmqs commented Mar 8, 2019

References

APPNOTE.TXT, APPENDIX D - Language Encoding (EFS)

D.1 The ZIP format has historically supported only the original IBM PC character 
encoding set, commonly referred to as IBM Code Page 437.  This limits storing 
file name characters to only those within the original MS-DOS range of values 
and does not properly support file names in other character encodings, or 
languages. To address this limitation, this specification will support the 
following change. 

D.2 If general purpose bit 11 is unset, the file name and comment SHOULD conform 
to the original ZIP character encoding.  If general purpose bit 11 is set, the 
filename and comment MUST support The Unicode Standard, Version 4.1.0 or 
greater using the character encoding form defined by the UTF-8 storage 
specification.  The Unicode Standard is published by the The Unicode
Consortium (www.unicode.org).  UTF-8 encoded data stored within ZIP files 
is expected to not include a byte order mark (BOM). 

D.3 Applications MAY choose to supplement this file name storage through the use 
of the 0x0008 Extra Field.  Storage for this optional field is currently 
undefined, however it will be used to allow storing extended information 
on source or target encoding that MAY further assist applications with file 
name, or file content encoding tasks.  Please contact PKWARE with any
requirements on how this field SHOULD be used.

D.4 The 0x0008 Extra Field storage MAY be used with either setting for general 
purpose bit 11.  Examples of the intended usage for this field is to store 
whether "modified-UTF-8" (JAVA) is used, or UTF-8-MAC.  Similarly, other 
commonly used character encoding (code page) designations can be indicated 
through this field.  Formalized values for use of the 0x0008 record remain 
undefined at this time.  The definition for the layout of the 0x0008 field
will be published when available.  Use of the 0x0008 Extra Field provides
for storing data within a ZIP file in an encoding other than IBM Code
Page 437 or UTF-8.

D.5 General purpose bit 11 will not imply any encoding of file content or
password.  Values defining character encoding for file content or 
password MUST be stored within the 0x0008 Extended Language Encoding 
Extra Field.

D.6 Ed Gordon of the Info-ZIP group has defined a pair of "extra field" records 
that can be used to store UTF-8 file name and file comment fields.  These
records can be used for cases when the general purpose bit 11 method
for storing UTF-8 data in the standard file name and comment fields is
not desirable.  A common case for this alternate method is if backward
compatibility with older programs is required.

D.7 Definitions for the record structure of these fields are included above 
in the section on 3rd party mappings for "extra field" records.  These
records are identified by Header ID's 0x6375 (Info-ZIP Unicode Comment 
Extra Field) and 0x7075 (Info-ZIP Unicode Path Extra Field).

D.8 The choice of which storage method to use when writing a ZIP file is left
to the implementation.  Developers SHOULD expect that a ZIP file MAY 
contain either method and SHOULD provide support for reading data in 
either format. Use of general purpose bit 11 reduces storage requirements 
for file name data by not requiring additional "extra field" data for
each file, but can result in older ZIP programs not being able to extract 
files.  Use of the 0x6375 and 0x7075 records will result in a ZIP file 
that SHOULD always be readable by older ZIP programs, but requires more 
storage per file to write file name and/or file comment fields.

@pmqs pmqs added the enhancement New feature or request label Mar 8, 2019
@pmqs pmqs changed the base branch from master to language-encoding-flag March 8, 2019 22:00
@pmqs pmqs merged commit 3ba1ed6 into pmqs:language-encoding-flag Mar 8, 2019
p5p pushed a commit to Perl/perl5 that referenced this pull request Jun 3, 2019
  [DELTA]

  2.086 31 March 2019

      * IO::Compress::Zip & IO::Uncompress::Unzip
        Added support for Language Encoding Flag via the EFS option.
        Starting point was pull request pmqs/IO-Compress#1

      * zipdetails - some support for MVS (Z390) zip files

      * IO::Uncompress::Base
        Issue with trailing data after zip archive
        #128626 for IO-Compress: mainframe zip archive

      * t/cz-14gzopen.t
        cperl error found in http://www.cpantesters.org/cpan/report/448cafc4-3108-11e9-9b6b-d3d33d7b1231
        Perl has this: "Not enough arguments for Compress::Zlib::gzopen"
        cperl uses this: "Not enough arguments for subroutine entry Compress::Zlib::gzopen"

      * Handlers being called when optional modules are not installed
        #128538:  $SIG{__DIE__}

      *  #128194: Beef up diag when system returns error

      * Moved source to github https://github.com/pmqs/IO-Compress

      * Add META_MERGE to Makefile.PL

      * Added meta-json.t & meta-yaml.t
@pmqs pmqs added unzip Issues regarding unzip zip Issues regarding zip labels Sep 8, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request unzip Issues regarding unzip zip Issues regarding zip
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants