-
Notifications
You must be signed in to change notification settings - Fork 5.2k
Description
Background and Motivation
We use System.IO.Compression.ZipArchive to manage the creation of large .zip files in a streaming fashion. When creating new archives, this works quite well (with ZipArchiveMode.Create, it writes through to the underlying stream so long as you only write to one entry at a time).
However, when we want to append to an existing archive, we have to use ZipArchiveMode.Update. According to the doc comments, with this mode the contents of the entire archive must be held in memory! This caused our system to crash due to array length restrictions when working with a particularly large file.
The zip format is designed to support efficient appending of files, so I believe it should be possible for .NET's implementation to support this use-case.
Proposed API
This could be addressed using a new ZipArchiveMode enum value, perhaps named ZipArchiveMode.Append to match FileMode.Append. This would be similar to Create but would allow for an existing file to be used.
Usage Examples
using var fileStream = File.Open("existing.zip");
var zip = new ZipArchive(fileStream, ZipArchiveMode.Append);
var newEntry = zip.CreateEntry("new");
using var writer = new StreamWriter(newEntry.Open());
// write lots of content!
Alternative Designs
Another approach would be to change the behavior of Update such that it would only bring things into memory as needed (e. g. if you change the contents of an existing entry or keep multiple entries open for writing at the same time).
This second approach would have the benefit of improving the performance of all existing programs which use Update mode to append to existing zips, which seems likely to be a common use-case for Update.
Similarly, it seems that this could also enable Update to share the streaming benefits of Read mode in many cases.
The downside would be that code could silently go from performant to non-performant if the usage limitations were violated, although that is already the case with Read when the underlying stream is not seekable and Create when writing to multiple entries at once.
Another potential downside is that today presumably Update mode does all writes at the end of the operation, potentially allowing other readers to use the zip until then. This change would alter that behavior.
Risks
With the design approach of optimizing Update for specific scenarios, the design might entail update switching from a write-through approach to an in-memory approach partway through an operation. This might add overhead to someone who is actually leveraging the ability modify existing entries.