Skip to content

Support for tar archives #1540

Closed
Closed
@bogdanteleaga

Description

@bogdanteleaga

Update: New proposal here: #65951

Summary

The TAR archive format is commonly used in Unix/Linux-native workloads. .NET applications should be able to produce and consume these archives with built-in APIs that support the most frequently used TAR features and variations.

API Proposal

Reading APIs

We could gradually offer functionality. Initially, we must offer APIs that can read archives.

namespace System.IO.Compression
{
    public class TarArchive : IDisposable
    {
        public TarOptions Options { get; }
        public TarArchive(Stream stream, TarOptions? options);
        public bool TryGetNextEntry(out TarArchiveEntry? entry);
        public void Dispose();
        protected virtual void Dispose(bool disposing);
    }
    public class TarArchiveEntry
    {
        public TarArchiveEntry(TarArchive archive, string fullName);
        public string FullName { get; }
        public string LinkName { get; }
        public int Mode { get; }
        public int Uid { get; }
        public int Gid { get; }
        public string UName { get; }
        public string GName { get; }
        public int DevMajor { get; }
        public int DevMinor { get; }
        public long Length { get; }
        public DateTime LastWriteTime { get; }
        public int CheckSum { get; }
        public TarArchiveEntryType EntryType { get; }
        public Stream Open();
        public override string ToString();
    }
    public enum TarMode
    {
        Read = 0
    }
    public class TarOptions
    {
        public TarMode Mode { get; set; }
        public bool LeaveOpen { get; set; }
        public Encoding EntryNameEncoding { get; set; }
        public TarOptions();
    }
    public enum TarArchiveEntryType
    {
        Normal, // Old normal :\0, New normal: 0
        Link, // 1
        SymbolicLink, // 2
        Character, // 3
        Block, // 4
        Directory, // 5
        Fifo, // 6
        Contiguous, // 7
        LongLink, // L
    }
}
Writing APIs

The next step would be to add writing capabilities:

  • Creating archives
  • Adding new entries
  • Deleting existing entries
namespace System.IO.Compression
{
    public class TarArchive : IDisposable
    {
        public void AddEntry(TarArchiveEntry entry) { }
    }
    public class TarArchiveEntry
    {
        public int Mode { set; }
        public int Uid { set; }
        public int Gid { set; }
        public string UName { set; }
        public string GName { set; }
        public int DevMajor { set; }
        public int DevMinor { set; }
        public DateTime LastWriteTime { set; }
        public TarArchiveEntryType EntryType { set; }
        public void Delete() { }
    }
    public enum TarMode
    {
        Create = 1,
        Update = 2
    }
}
Static APIs - Sync

These APIs were heavily inspired in the ZipFile APIs.

namespace System.IO.Compression
{
    public static class TarFile
    {
        public static void CreateFromDirectory(string sourceDirectoryName, string destinationArchiveFileName);
        public static void ExtractToDirectory(string sourceArchiveFileName, string destinationDirectoryName);
        public static void ExtractToDirectory(string sourceArchiveFileName, string destinationDirectoryName, bool overwriteFiles);
        public static void ExtractToDirectory(string sourceArchiveFileName, string destinationDirectoryName, Encoding? entryNameEncoding);
        public static void ExtractToDirectory(string sourceArchiveFileName, string destinationDirectoryName, Encoding? entryNameEncoding, bool overwriteFiles);
    }
}

The following static methods would be powerful because they would be able to decompress the file, then read the internal tar.

We are unsure if from the perspective of API design, it makes sense to mix purposes.

namespace System.IO.Compression
{
    public enum CompressionMethod
    {
        None,
        GZip,
        Deflate,
        Brotli,
        ZLib,
        // More in the future
    }
    public class TarFileOptions
    {
        TarFileOptions();
        CompressionMethod Method { get; set; } // Default is None
        TarMode Mode { get; set; } // Default is Read
        Encoding EntryNameEncoding { get; set; } // Default is ASCII
    }
    public static class TarFile
    {
        public static TarArchive Open(string archiveFileName, TarFileOptions? options);
        public static TarArchive OpenRead(string archiveFileName, CompressionMethod compressionMethod);
    }
}
Static APIs - Async
namespace System.IO.Compression
{
    public static class TarFile
    {
        public static ValueTask CreateFromDirectoryAsync(string sourceDirectoryName, string destinationArchiveFileName, CancellationToken cancellationToken);
        public static ValueTask ExtractToDirectoryAsync(string sourceArchiveFileName, string destinationDirectoryName, CancellationToken cancellationToken);
        public static ValueTask ExtractToDirectoryAsync(string sourceArchiveFileName, string destinationDirectoryName, bool overwriteFiles, CancellationToken cancellationToken);
        public static ValueTask ExtractToDirectoryAsync(string sourceArchiveFileName, string destinationDirectoryName, Encoding? entryNameEncoding, CancellationToken cancellationToken);
        public static ValueTask ExtractToDirectoryAsync(string sourceArchiveFileName, string destinationDirectoryName, Encoding? entryNameEncoding, bool overwriteFiles, CancellationToken cancellationToken);
        public static ValueTask<TarArchive> OpenAsync(string archiveFileName, TarFileOptions options, CancellationToken cancellationToken);
        public static ValueTask<TarArchive> OpenReadAsync(string archiveFileName, CompressionMethod compressionMethod, CancellationToken cancellationToken);
    }
}
Static extension APIs - Sync

These extension APIs are similar to the ZipArchiveEntry ones.

We could directly add these methods to the TarArchiveEntry class instead of making them extensions, since we are currently designing it all at the same time.

The overwriteFiles boolean argument should be clearly documented with warnings about potential tarbomb behavior.

namespace System.IO.Compression
{
    public static class TarFileExtensions
    {
        public static TarArchiveEntry CreateEntryFromFile(this TarArchive destination, string sourceFileName, string entryName);
        public static void ExtractToDirectory(this TarArchive source, string destinationDirectoryName);
        public static void ExtractToDirectory(this TarArchive source, string destinationDirectoryName, bool overwriteFiles);
        public static void ExtractToFile(this TarArchiveEntry source, string destinationFileName);
        public static void ExtractToFile(this TarArchiveEntry source, string destinationFileName, bool overwrite);
    }
}
Static extension APIs - Async
namespace System.IO.Compression
{
    public static class TarFileExtensions
    {
        public static ValueTask<TarArchiveEntry> CreateEntryFromFileAsync(this TarArchive destination, string sourceFileName, string entryName, CancellationToken cancellationToken);
        public static ValueTask ExtractToDirectoryAsync(this TarArchive source, string destinationDirectoryName, CancellationToken cancellationToken);
        public static ValueTask ExtractToDirectoryAsync(this TarArchive source, string destinationDirectoryName, bool overwriteFiles, CancellationToken cancellationToken);
        public static ValueTask ExtractToFileAsync(this TarArchiveEntry source, string destinationFileName, CancellationToken cancellationToken);
        public static ValueTask ExtractToFileAsync(this TarArchiveEntry source, string destinationFileName, bool overwrite, CancellationToken cancellationToken);
    }
}

Usage examples

Here is a basic example of opening a tar.gz file for reading. First we decompress the gzip, then we read the archive.

using FileStream fs = File.Open("file.tar.gz", FileMode.Open);

using var decompressor = new GZipStream(fs, CompressionMode.Decompress);
using var decompressedStream = new MemoryStream();
decompressor.CopyTo(decompressedStream);

var options = new TarOptions{ Mode = TarMode.Read, };
using var archive = new TarArchive(decompressedStream, options);

while (archive.TryGetNextEntry(out TarArchiveEntry? entry))
{
    Console.WriteLine($"{entry.FullName}");
}

TODO: More examples to come.

Tar format description

Optional read. Feel free to skip.

A tar archive is a linear sequence of blocks. Each block consists of a header and the file contents described by that header.

The blocks are aligned to a fixed block size, usually 512. In other words, a block size needs to be a multiple of the block size, which can be achieved by adding trailing null bytes at the end of the file contents, when necessary.

The header describes the metadata of the file contents (filename, mode, uid, guid, size, last modification time, etc.). The size of a header is fixed. Its fields all have a predefined max size.
The file contents can be 0 or more raw bytes, representing the contents of the file.

If the block represents a directory, the file contents can optionally be 0. It's not 0 when it contains a list of the filesystem entries inside that directory, which some tar format versions allow.

A tar archive is navigated by jumping from header to header. The beginning of the next header can be found by adding up the fixed size of a header plus the size of the file contents, minding the block size padding.

Tar archives do not contain a central directory like zip archives. A zip central directory is an uncompressed region of the zip archive that indicates the total number of files in the archive. If the user wants to know the total number of files contained in a tar archive, the whole archive needs to be traversed to count the total number of block headers found.

The tar spec was not designed to include compression capabilities, but tars are commonly combined with a compression method. The most popular method is to first generate the tar file, then compress it, usually with GZip (.tar.gz) or with LZMA (.tar.xz). While this method simplifies and separates the archival and compression stages, it also means that the only way the user can read the contents of the tar file is by decompressing it first.

Another not-so-common method is to compress the file contents individually, leaving the header readable by the user. The reason why it's not so common is because the header offers no field to indicate which compression method was used to compress each file contents block, so the user needs to preserve that information somewhere else.

There are multiple versions of the tar format: v7, ustar, pax, gnu, oldgnu, solaris, aix, macosx. We should focus on v7, ustar, pax and gnu.

Sources:

Open questions

Tar versions

  • Should we implement the different tar versions separately? As Ian suggested above, we could gradually add support to the more complex ones:
    • Add TarArchive/TarArchiveEntry with V7&UStar support and format header detection. Throw error for unsupported formats (e.g. GNU, PAX)
    • Add tar archival/de-archival support for GNU tar
    • Add tar archival/de-archival support for PAX/Posix tar

Assembly

  • In which assembly should the stream-based APIs live?
    • System.IO.Compression
    • System.IO.Compression.Tar
  • In which assembly should the static APIs live?
    • System.IO.Compression
    • System.IO.Compression.TarFile (similar to ZipFile)

TarArchiveEntry

  • Do we need to expose Mode, Uid, Gid, UName, GName?
  • How commonly used are DevMajor and DevMinor? Do users need these properties to be exposed at all?
  • Do we need the EntryType property? I'd say yes, especially because some entries are LongLink and the actual entry is expected to be located in the next position.
    • If we do, should the values of the enum be the exact values that can be found in the tar header, or should we assign default values, then map them internally to the actual value?
    • If the user adds an entry, what EntryType values should be allowed? Can the user programatically add a Block, Fifo Contiguous, Character entry?
  • We use FullName to be consistent with other full path properties. But should we instead use FullPath?
  • Mode, Uid and Guid are in base 10, but they will be converted to base 8 internally.
  • How should we distinguish between the end of file and a corrupt entry when calling TryGetNextEntry? EOF is marked in a tar with two 512-byte blocks filled with nulls.

TarOptions

  • Should the properties be settable, or should they be init? Consider that the TarArchive would cache it, but it may not make sense for the user to be able to change the value of the cached options.

Static APIs

  • The static extension APIs are inspired on the Zip ones. Since these are all being added together, maybe we don't need them as extensions, but they can be part of the class they extend. Thoughts?
  • There are many shared fields in the different overloads. Should we have a separate class (similar to TarOptions) one for extraction and another one for creation? We can pasa an instance of this class as an argument, and have only one method, instead of several overloads. This would be helpful in case we grow the options in the future.

Compression

  • Notice that none of the APIs offer compression. Should we add static methods that allow the user to create a compressed tar file, and let them choose the desired algorithm?
  • .NET currently only offers GZip, Deflate, Brotli and ZLib. We are considering adding support for ZStandard and LZMA. We could consider adding static APIs that allow composability with external compression stream-based APIs like those offered by SharpCompress or SharpZLib.

Security

  • Notice there is no Entries property, like in zip. This is because we don't have a central directory. If we receive a network stream, we wouldn't be able to know the Count.
  • Internally, we would only cache the list of visited entries if the TarArchive is opened in Create or Update mode. This is because the assumption is that we will modify the tar file on dispose, either because we want to add new entries, or because we want to delete existing entries.
    • One thing we could do is add an Entries property that can only be used if the stream is a seekable FileStream, in which case we can use the new RandomAccess APIs to get the files.
  • If the user opens an existing tar file, and an existing entry has a Mode, Uname and/or GName that does not match that of the current user, should we allow the user read/update/delete/extract that entry, or should we forbid access to it?
  • Tarbombs happen when a tar file is extracted into an existing directory and overwrites existing files.
    • They also give problems when an entry has an absolute path, and on extraction, it could potentially overwrite a system file.
    • The fact that tars can contain symbolic links can also be problematic if it is expected to extract files into a symlinked folder. By default, we should not follow symlinks.
    • Tar files allow having multiple files with identical path and filename. A tarbomb behavior could happen if the first extracted file is a symlink, and the next one is a regular file, in which case the second file could end up being written in the target location of the symlink. We should avoid such behaviors. One possible solution is to cache all the names, see if it already existed, and the subsequent duplicates are extracted with a suffix in their name. Another behavior is to throw.

Testing

  • We have an initial set of files in dotnet/runtime-assets created with the Ubuntu tar command, which generates gnutar files.
  • @adamhathcock would it be ok with you if we reuse the test tar files you have in your repo? You have a good selection of test cases.

Metadata

Metadata

Assignees

No one assigned

    Labels

    User StoryA single user-facing feature. Can be grouped under an epic.api-suggestionEarly API idea and discussion, it is NOT ready for implementationarea-System.IO.Compression

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions