Skip to content

[API Proposal]: APIs to support tar archives #65951

Closed
@carlossanlop

Description

@carlossanlop

Background and motivation

Creating a new issue to get fresh feedback. Original tar proposal

Tar is an old, stable and robust archiving format that is heavily used, particularly in the Unix world.

The community has expressed interest in having .NET offer APIs that would allow creation, manipulation and extraction of tar files. The following proposal aims to satisfy the request.

API Proposal

namespace System.Formats.Tar
{
    // Easy to use straightforward archiving and extraction APIs.
    public static class TarFile
    {
        public static void CreateFromDirectory(string sourceDirectoryName, string destinationArchiveFileName, bool includeBaseDirectory);
        public static Task CreateFromDirectoryAsync(string sourceDirectoryName, string destinationArchiveFileName, bool includeBaseDirectory, CancellationToken cancellationToken = default);
        public static void ExtractToDirectory(string sourceArchiveFileName, string destinationDirectoryName, bool overwriteFiles);
        public static Task ExtractToDirectoryAsync(string sourceArchiveFileName, string destinationDirectoryName, bool overwriteFiles, CancellationToken cancellationToken = default);
    }

    // Enum representing the entry types that can be detected from V7, Ustar, PAX and GNU.
    public enum TarEntryType : byte
    {
        RegularFileV7 = '\0', // Used exclusively by V7

        // Used  by all formats
        RegularFile = '0',
        HardLink = '1',
        SymbolicLink = '2',
        CharacterDevice = '3',
        BlockDevice = '4',
        Directory = '5',
        Fifo = '6',

        // Exclusively used by PAX
        GlobalExtendedAttributes = 'g',
        ExtendedAttributes = 'x',

        // Exclusively used by GNU
        ContiguousFile = '7',
        DirectoryList = 'D',
        LongLink = 'K',
        LongPath = 'L',
        MultiVolume = 'M',
        RenamedOrSymlinked = 'N',
        SparseFile = 'S',
        TapeVolume = 'T',
    }
    
    // The formats these APIs will be able to read
    public enum TarFormat
    {
        Unknown = 0, // For when an archive that is being read is not recognized
        V7 = 1,
        Ustar = 2,
        Pax = 3,
        Gnu = 4,
    }

    // For traversing entries in an existing tar archive
    public sealed class TarReader : System.IDisposable, System.IAsyncDisposable
    {
        public TarReader(System.IO.Stream archiveStream, bool leaveOpen = false);
        public System.Formats.Tar.TarFormat Format { get; }
        public System.Collections.Generic.IReadOnlyDictionary<string, string>? GlobalExtendedAttributes { get; }
        public void Dispose();
        public ValueTask DisposeAsync();
        public System.Formats.Tar.TarEntry? GetNextEntry(bool copyData = false);
        public ValueTask<TarEntry?> GetNextEntryAsync(bool copyData = false, CancellationToken cancellationToken = default);
    }

    // For creating a tar archive
    public sealed class TarWriter : System.IDisposable, System.IAsyncDisposable
    {
        public TarWriter(System.IO.Stream archiveStream, System.Collections.Generic.IEnumerable<System.Collections.Generic.KeyValuePair<string, string>>? globalExtendedAttributes = null, bool leaveOpen = false);
        public TarWriter(System.IO.Stream archiveStream, System.Formats.Tar.TarFormat archiveFormat, bool leaveOpen = false);
        public System.Formats.Tar.TarFormat Format { get; }
        public void Dispose();
        public ValueTask DisposeAsync();
        public void WriteEntry(string fileName, string? entryName);
        public Task WriteEntryAsync(string fileName, string? entryName, CancellationToken cancellationToken = default);
        public void WriteEntry(System.Formats.Tar.TarEntry entry);
        public Task WriteEntryAsync(TarEntry entry, CancellationToken cancellationToken = default);
    }

    // Abstract type to represent the header record's metadata fields
    // These fields are found in all the tar formats
    public abstract class TarEntry
    {
        internal TarEntry();
        public int Checksum { get; }
        public System.IO.Stream? DataStream { get; set; }
        public System.Formats.Tar.TarEntryType EntryType { get; }
        public int Gid { get; set; }
        public long Length { get; }
        public string LinkName { get; set; }
        public System.IO.UnixFileMode Mode { get; set; }
        public System.DateTimeOffset MTime { get; set; }
        public string Name { get; set; }
        public int Uid { get; set; }
        public void ExtractToFile(string destinationFileName, bool overwrite);
        public Task ExtractToFileAsync(string destinationFileName, bool overwrite, CancellationToken cancellationToken = default);
        public override string ToString();
    }

    // Allows instancing a V7 tar entry
    public sealed class TarEntryV7 : TarEntry
    {
        public TarEntryV7(System.Formats.Tar.TarEntryType entryType, string entryName);
    }

    // Abstract type that describe the fields proposed by POSIX and inherited by all the subsequent formats
    // GNU also inherits them, but notice that format is not POSIX, so the name for this abstract type can be different
    public abstract class TarEntryPosix : TarEntry
    {
        internal TarEntryPosix();
        public int DeviceMajor { get; set; }
        public int DeviceMinor { get; set; }
        public string GName { get; set; }
        public string UName { get; set; }
        public override string ToString();
    }

    // Allows instancing a Ustar tar entry, which is the first POSIX format
    public sealed class TarEntryUstar : TarEntryPosix
    {
        public TarEntryUstar(System.Formats.Tar.TarEntryType entryType, string entryName);
    }

    // Allows instancing a PAX tar entry
    // Contains a dictionary that exposes the extended attributes found in the extra metadata entry the Pax format defines
    public sealed class TarEntryPax : TarEntryPosix
    {
        public TarEntryPax(System.Formats.Tar.TarEntryType entryType, string entryName, System.Collections.Generic.IEnumerable<System.Collections.Generic.KeyValuePair<string, string>>? extendedAttributes);
        public System.Collections.Generic.IReadOnlyDictionary<string, string> ExtendedAttributes { get; }
    }

    // Allows instancing a GNU tar entry
    // Contains additional metadata fields found in GNU
    public sealed class TarEntryGnu : TarEntryPosix
    {
        public TarEntryGnu(System.Formats.Tar.TarEntryType entryType, string entryName);
        public System.DateTimeOffset ATime { get; set; }
        public System.DateTimeOffset CTime { get; set; }
    }
}

namespace System.IO
{
    [System.FlagsAttribute]
    public enum UnixFileMode
    {
        None = 0,
        OtherExecute = 1,
        OtherWrite = 2,
        OtherRead = 4,
        GroupExecute = 8,
        GroupWrite = 16,
        GroupRead = 32,
        UserExecute = 64,
        UserWrite = 128,
        UserRead = 256,
        StickyBit = 512,
        GroupSpecial = 1024,
        UserSpecial = 2048,
    }
}

The tar archiving format's specification is best described in the FreeBSD man 5 page.

The tar spec defines a set of rules to collect filesystem objects into a single stream of bytes. A tar archive consists of a series of 512-byte records, where the first record that represents a filesystem object (the "header") contains fixed-size metadata fields describing said object, and the subsequent records have the actual data of the file. When the data size is not a multiple of 512, it is always zero-padded to guarantee the next record (or "header" record) will be found on the next multiple of 512. The end of a tar archive is found if at least two zero-byte 512 records are found.

Unlike the zip archiving+compression format, the tar format does not have a central directory. This means there is no way of knowing how many files a tar archive contains unless the whole archive is traversed.

The tar format evolved over time, and currently there are four well known formats:

  • 1979 Version 7 AT&T Unix Tar Command Format. Known as "V7". This format supports regular files, directories, symbolic links and hard links. Filenames and linknames are limited to 100 bytes.

  • POSIX IEEE 1003.1-1988 Unix Standard Tar Format. Known as "Ustar". This format was an improvent of V7, so in a way it's backwards compatible with it. The main improvements were:

    • An additional 155-byte field was added to allow longer file names.
    • New filesystem objects supported: Fifo, Character devices and Block devices.
  • POSIX IEEE 1003.1-2001 ("POSIX.1") Pax Interchange Tar Format. Known as "PAX". This is the standard format, the most flexible, and the one with the least limitations It's built on top of ustar, so it's backwards compatible with both ustar and V7. Advantages:

    • All entries are preceded by an extra entry exclusively for metadata (known as "Extended Attributes"): The data section stores key value pairs containing information that otherwise would not fit in the old fixed-size fields. For example: long file names, long link names, file sizes larger than the largest number that can fit in the fixed-size size field, support for UTF8, among other benefits.
    • The format allows the creation of a unique metadata entry (known as "Global Extended Attributes") that is added at the beginning of the archive to contain metadata information that is shared by all entries in the archive, except when overriden by their own metadata entry.
    • The gnu tar tool documented that they will eventually switch to this format as the default: https://www.gnu.org/software/tar/manual/html_section/Formats.html#:~:text=The%20default%20format%20for%20GNU,will%20switch%20to%20%27%20posix%20%27.
    • Allows insertion of vendor-specific metadata.
  • GNU Tar Format.

    • This format was created as a variant of ustar, but was made incompatible with it due to having a collision in the location of fields in the header.
    • Allows the header to be defined over multiple records.
    • Defines two entry types that exclusively contain long names or long link names.
    • Supports rare files like: multi volume files, sparse files, contiguous files.
  • Other formats: There is little documentation about them (schilly tar, gnu tar pax, aix tar, solaris tar, macosx tar) so these APIs should be able to read the archives and extract as best as possible, but would not be able to write them.

The default format for the writing APIs is proposed to be PAX.

Here's a table I created showing the differences between formats:

API Usage

We can split the APIs into different categories according to the type of usage they are intended for:

Stream-less APIs

The TarFile static class allows to easily archive the contents of a directory or extract the contents of a tar archive without any need to manipulate streams:

// Generates a tar archive where all the entry paths are prefixed by the root directory 'SourceDirectory'
TarFile.CreateFromDirectory(sourceDirectoryName: "D:/SourceDirectory/", destinationArchiveFileName: "D:/destination.tar", includeBaseDirectory: true);

// Extracts the contents of a tar archive into the specified directory, but avoiding overwriting anything found inside
TarFile.ExtractToDirectory(sourceArchiveFileName: "D:/destination.tar", destinationDirectoryName: "D:/DestinationDirectory/", overwriteFiles: false);

Reading an archive entry by entry

The TarReader class allows reading an existing tar archive represented by a stream:

FileStream archiveStream = File.Open("D:/archive.tar", FileMode.Open, FileAccess.Read);

The only requirement to be able to iterate the entries of a stream representing a tar archive is that the stream is readable.
The archive format should be immediately detected upon creation of the reader, even when the first entry has not been read by the user yet.
If the Global Extended Attributes dictionary is not null, it's safe to assume the archive format is PAX, since it's the only format that supports them.
If leaveOpen is passed to the constructor, the stream is not disposed when the reader is disposed.
The streams created to wrap the data section of an entry are automatically disposed when the reader is disposed.

using TarReader reader = new TarReader(archiveStream, leaveOpen: true);

Console.WriteLine($"Format: {reader.Format}");

if (reader.GlobalExtendedAttributes != null)
{
    Console.WriteLine("Format is PAX");
}

TarEntry? entry;
while ((entry = reader.GetNextEntry()) != null)
{
    Console.WriteLine($"Entry name: {entry.Name}, entry type: {entry.EntryType}");
    entry.ExtractToFile(destinationFileName: Path.Join("D:/MyExtractionFolder/", entry.Name), overwrite: false);
}

What if the passed stream is unseekable, like when it comes from the network? Then the user will have two option:

They can read it as it arrives, but knowing that it will be lost when the next entry is read:

public void ReadTarFromNetwork(NetworkStream archiveStream) // This stream is not seekable
{
    using TarReader reader = new TarReader(archiveStream);

    while ((entry = reader.GetNextEntry(copyData: false)) != null) // Not copying the data means it needs to be read now, before advancing the stream position
    {
        if (entry.EntryType is TarEntryType.RegularFile)
        {
            // This needs to be done now because the position pointer will not be able to seek back later
            entry.ExtractToFile(destinationFileName: Path.Join("D:/MyExtractionFolder/", entry.Name), overwrite: false);

            DoSomethingWithTheData(entry.DataStream); // This won't be possible since the data stream position pointer is at the end of the stream
        }
    }
}

Or they can request to get the data preserved internally for reading later:

public void ReadTarFromNetwork(NetworkStream archiveStream) // This stream is not seekable
{
    List<TarEntry> entries = new List<TarEntry>();

    using TarReader reader = new TarReader(archiveStream);

    while ((entry = reader.GetNextEntry(copyData: true)) != null) // Copy the data internally for later usage
    {
        entries.Add(entry);
    } // Stream position is now located at the end of the stream

    foreach (TarEntry entry in entries)
    {
        if (entry.EntryType is TarEntryType.RegularFile)
        {
            // This is possible because the data was saved internally
            entry.ExtractToFile(destinationFileName: Path.Join("D:/MyExtractionFolder/", entry.Name), overwrite: false);

            // We can also inspect the data stream now
            entry.DataStream.Seek(0, SeekOrigin.Begin);
            DoSomethingWithTheData(entry.DataStream);
        }
    }
    
}

Writing a new archive

The user can generate archives using streams.

FileStream archiveStream = File.Create("D:/archive.tar");

The archive can be created in V7 format:

using TarWriter writerV7 = new TarWriter(archiveStream, TarFormat.V7);

Or Ustar:

using TarWriter writerUstar = new TarWriter(archiveStream, TarFormat.Ustar);

Or Pax:

using TarWriter writerPax1 = new TarWriter(archiveStream, TarFormat.Pax); // No Global Extended Attributes entry

Or Pax with a Global Extended Attributes entry appended at the beginning:

Dictionary<string, string> gea = new Dictionary<string, string>();
gea.Add("something", "global");
using TarWriter writerPaxGEA = new TarWriter(archiveStream, globalExtendedAttributes: dictionary); // Note there's no need to indicate the format, it's assumed

Or GNU:

using TarWriter writerGnu = new TarWriter(archiveStream, TarFormat.Gnu);

The user can add entries in two ways.

By indicating the path of the file to add, which will automatically detect the entry type of the file:

// EntryType: Directory
writer.WriteEntry(fileName: "D:/IAmADirectory/", entryName: "IAmADirectory");

// EntryType: RegularFile (or if V7: OldRegularFile)
writer.WriteEntry(fileName: "D:/file.txt", entryName: "file.txt");

// In Unix, if the writer was opened in Ustar, Pax or Gnu, the user can also add fifo, block device and character device files to the archive
writer.WriteEntry(fileName: "/home/carlos/myfifo", entryName: "myfifo"); // EntryType: Fifo
writer.WriteEntry(fileName: "/home/carlos/myblockdevice", entryName: "myblockdevice"); // EntryType: BlockDevice
writer.WriteEntry(fileName: "/home/carlos/mycharacterdevice", entryName: "mychardevice"); // EntryType: CharDevice

Or by manually constructing an entry.
Notice that OldRegularFile (V7 only) and RegularFile (all other formats) are the only two entry types the user can create with a data section. To do that, they need to save a stream in the DataStream property containing the information to write, and then they need to dispose it.

V7:

TarEntryV7 entry = new TarEntryV7(entryType: TarEntryType.OldRegularFile, entryName: "file.txt");
using (FileStream dataStream = File.Open("D:/file.txt", FileMode.Open, FileAccess.Read))
{
  entry.DataStream = dataStream;
  entry.Gid = 5;
  entry.Uid = 7;
  writerV7.WriteEntry(entry);
} // The user created the data stream externally, so they need to dispose it

Ustar:

TarEntryUstar entry = new TarEntryUstar(entryType: TarEntryType.RegularFile, entryName: "file.txt");
entry.DataStream = File.Open("D:/file.txt", FileMode.Open, FileAccess.Read);
entry.Mode = UnixFileMode.UserRead | UnixFileMode.GroupRead | UnixFileMode.OtherRead;
entry.UName = "carlos";
entry.GName = "dotnet";
writerUstar.WriteEntry(entry);

PAX:

TarEntryPax entry = new TarEntryPax(entryType: TarEntryType.Directory, entryName: "directory", extendedAttributes: null); // No extended attributes, but the metadata header is created anyway
writerPax.WriteEntry(entry);

Dictionary<string, string> ea = new DictionaryString<ea, ea>();
ea.Add("atime", $"{DateTimeOffset.Now}");
ea.Add("ctime", $"{DateTimeOffset.Now}");
TarEntryPax entryWithEA = new TarEntryPax(entryType: TarEntryType.SymbolicLink, entryName: "symlink", extendedAttributes: ea);
entryWithEA.LinkName = "this/is/a/link/path";
writer.WriteEntry(entryWithEA);

GNU:

TarEntryGnu entry = new TarEntryGnu(entryType: TarEntryType.CharacterDevice, entryName: "chardevice");
entry.DeviceMajor = 444;
entry.DeviceMinor = 555;
entry.ATime = DateTimeOffset.Now;
entry.CTime = DateTimeOffset.Now;
writerGnu.WriteEntry(entry);

Creating an archive using entries from another archive

The absence of a central directory prevents updating existing entries. But this scenario should still be possible for the user if needed. It should be especially useful if the user wants to convert entries from one format to another.

using TarReader reader = new TarReader(originStream); // The detected format of this archive should not matter
using TarWriter writer = new TarWriter(destinationStream, TarFormat.Pax);

TarEntry? entry;
while ((entry = reader.GetNextEntry(copyData: true)) != null)
{
    writer.WriteEntry(entry); // Entries should be saved in PAX format, reading as much as possible from the passed entry in a different format
}

Creating a tar.gz archive

We already offer GZip stream APIs, so it should be relatively easy to compress a tar archive when manipulating streams.

MemoryStream archiveStream = new MemoryStream();
using (TarWriter writer = new TarWriter(archiveStream, TarFormat.Pax, leaveOpen: true)) // Do not close stream on dispose
{
    TarEntryPax entry = new TarEntryPax(entryType: TarEntryType.RegularFile, entryName: "file.txt");
    // ... configure the entry ...
    writer.WriteEntry(entry);
} // Dispose triggers writing the empty records at the end of the archive

using FileStream compressedFileStream = File.Create("file.tar.gz");
using GZipStream compressor = new GZipStream(compressedFileStream, CompressionMode.Compress);
archiveStream.CopyTo(compressor); // After disposing these two, the tar.gz will be commited

The reason why this proposal does not include TarFile APIs to enable compression support, is because we first need to decide how to standardize the compression configuration pattern for all the compression formats we support. This is being discussed here: #42820

Alternative Designs

We were originally considering offering APIs that looked more similar to ZipArchive, but the absence of a central directory and the mixture of writing and reading tasks would make the APIs very difficult to use, especially due to the existence of an "Update" mode. In Zip, the presence of a central directory helps with the complexities of modifying an existing archive, but in tar, not knowing the entries in advance makes it extremely complicated, especially with huge files or with unseekable streams. That proposal was discussed in the old tar issue.

Risks

The complexity of the formats will require a lot of testing, particularly with rare files, files generated in unsupported/rare formats, or files containing rare entry types.

The extraction APIs we offer should have a way to prevent risky behaviors like tar-bombs.

There are entry types that are not supported the same way across platforms: block device, character device, fifos. This will have to be considered when extracting a file created in another OS.

There are four rare entry types in the GNU format that would not be supported at the beginning due to their complexity and the difficulty to generate archives using the unix tar tool for testing them. For example:

  • Contiguous files ('7'): The documentation states that this entry type should be treated as "regular file" except on one obscure "RTOS" (Real-Time Operating System, the spec does not say which) where this entry type is used to indicate the pre-allocation of a contiguous file on disk.

  • Multi-volume files ('M'): Allows splitting a file into different archives. To add support to this entry type, new APIs would be required, particularly on TarFile, to ensure multiple files can be grouped into one single extraction.

  • Files to be renamed or symlinked after extraction ('N'): This entry type is no longer generated by the 'GNU' tar due to security concerns.

  • Sparse regular files ('S'): Fragmented files that are stored split among multiple entries with this entry type.

  • Tape/volume header name ('V'): The spec says this entry type is ignored.

But they can be addressed in later iterations and ignore the entry types in the meanwhile.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions