Description
Update: New proposal here: #65951
Summary
The TAR archive format is commonly used in Unix/Linux-native workloads. .NET applications should be able to produce and consume these archives with built-in APIs that support the most frequently used TAR features and variations.
API Proposal
Reading APIs
We could gradually offer functionality. Initially, we must offer APIs that can read archives.
namespace System.IO.Compression
{
public class TarArchive : IDisposable
{
public TarOptions Options { get; }
public TarArchive(Stream stream, TarOptions? options);
public bool TryGetNextEntry(out TarArchiveEntry? entry);
public void Dispose();
protected virtual void Dispose(bool disposing);
}
public class TarArchiveEntry
{
public TarArchiveEntry(TarArchive archive, string fullName);
public string FullName { get; }
public string LinkName { get; }
public int Mode { get; }
public int Uid { get; }
public int Gid { get; }
public string UName { get; }
public string GName { get; }
public int DevMajor { get; }
public int DevMinor { get; }
public long Length { get; }
public DateTime LastWriteTime { get; }
public int CheckSum { get; }
public TarArchiveEntryType EntryType { get; }
public Stream Open();
public override string ToString();
}
public enum TarMode
{
Read = 0
}
public class TarOptions
{
public TarMode Mode { get; set; }
public bool LeaveOpen { get; set; }
public Encoding EntryNameEncoding { get; set; }
public TarOptions();
}
public enum TarArchiveEntryType
{
Normal, // Old normal :\0, New normal: 0
Link, // 1
SymbolicLink, // 2
Character, // 3
Block, // 4
Directory, // 5
Fifo, // 6
Contiguous, // 7
LongLink, // L
}
}
Writing APIs
The next step would be to add writing capabilities:
- Creating archives
- Adding new entries
- Deleting existing entries
namespace System.IO.Compression
{
public class TarArchive : IDisposable
{
public void AddEntry(TarArchiveEntry entry) { }
}
public class TarArchiveEntry
{
public int Mode { set; }
public int Uid { set; }
public int Gid { set; }
public string UName { set; }
public string GName { set; }
public int DevMajor { set; }
public int DevMinor { set; }
public DateTime LastWriteTime { set; }
public TarArchiveEntryType EntryType { set; }
public void Delete() { }
}
public enum TarMode
{
Create = 1,
Update = 2
}
}
Static APIs - Sync
These APIs were heavily inspired in the ZipFile
APIs.
namespace System.IO.Compression
{
public static class TarFile
{
public static void CreateFromDirectory(string sourceDirectoryName, string destinationArchiveFileName);
public static void ExtractToDirectory(string sourceArchiveFileName, string destinationDirectoryName);
public static void ExtractToDirectory(string sourceArchiveFileName, string destinationDirectoryName, bool overwriteFiles);
public static void ExtractToDirectory(string sourceArchiveFileName, string destinationDirectoryName, Encoding? entryNameEncoding);
public static void ExtractToDirectory(string sourceArchiveFileName, string destinationDirectoryName, Encoding? entryNameEncoding, bool overwriteFiles);
}
}
The following static methods would be powerful because they would be able to decompress the file, then read the internal tar.
We are unsure if from the perspective of API design, it makes sense to mix purposes.
namespace System.IO.Compression
{
public enum CompressionMethod
{
None,
GZip,
Deflate,
Brotli,
ZLib,
// More in the future
}
public class TarFileOptions
{
TarFileOptions();
CompressionMethod Method { get; set; } // Default is None
TarMode Mode { get; set; } // Default is Read
Encoding EntryNameEncoding { get; set; } // Default is ASCII
}
public static class TarFile
{
public static TarArchive Open(string archiveFileName, TarFileOptions? options);
public static TarArchive OpenRead(string archiveFileName, CompressionMethod compressionMethod);
}
}
Static APIs - Async
namespace System.IO.Compression
{
public static class TarFile
{
public static ValueTask CreateFromDirectoryAsync(string sourceDirectoryName, string destinationArchiveFileName, CancellationToken cancellationToken);
public static ValueTask ExtractToDirectoryAsync(string sourceArchiveFileName, string destinationDirectoryName, CancellationToken cancellationToken);
public static ValueTask ExtractToDirectoryAsync(string sourceArchiveFileName, string destinationDirectoryName, bool overwriteFiles, CancellationToken cancellationToken);
public static ValueTask ExtractToDirectoryAsync(string sourceArchiveFileName, string destinationDirectoryName, Encoding? entryNameEncoding, CancellationToken cancellationToken);
public static ValueTask ExtractToDirectoryAsync(string sourceArchiveFileName, string destinationDirectoryName, Encoding? entryNameEncoding, bool overwriteFiles, CancellationToken cancellationToken);
public static ValueTask<TarArchive> OpenAsync(string archiveFileName, TarFileOptions options, CancellationToken cancellationToken);
public static ValueTask<TarArchive> OpenReadAsync(string archiveFileName, CompressionMethod compressionMethod, CancellationToken cancellationToken);
}
}
Static extension APIs - Sync
These extension APIs are similar to the ZipArchiveEntry
ones.
We could directly add these methods to the TarArchiveEntry
class instead of making them extensions, since we are currently designing it all at the same time.
The overwriteFiles
boolean argument should be clearly documented with warnings about potential tarbomb behavior.
namespace System.IO.Compression
{
public static class TarFileExtensions
{
public static TarArchiveEntry CreateEntryFromFile(this TarArchive destination, string sourceFileName, string entryName);
public static void ExtractToDirectory(this TarArchive source, string destinationDirectoryName);
public static void ExtractToDirectory(this TarArchive source, string destinationDirectoryName, bool overwriteFiles);
public static void ExtractToFile(this TarArchiveEntry source, string destinationFileName);
public static void ExtractToFile(this TarArchiveEntry source, string destinationFileName, bool overwrite);
}
}
Static extension APIs - Async
namespace System.IO.Compression
{
public static class TarFileExtensions
{
public static ValueTask<TarArchiveEntry> CreateEntryFromFileAsync(this TarArchive destination, string sourceFileName, string entryName, CancellationToken cancellationToken);
public static ValueTask ExtractToDirectoryAsync(this TarArchive source, string destinationDirectoryName, CancellationToken cancellationToken);
public static ValueTask ExtractToDirectoryAsync(this TarArchive source, string destinationDirectoryName, bool overwriteFiles, CancellationToken cancellationToken);
public static ValueTask ExtractToFileAsync(this TarArchiveEntry source, string destinationFileName, CancellationToken cancellationToken);
public static ValueTask ExtractToFileAsync(this TarArchiveEntry source, string destinationFileName, bool overwrite, CancellationToken cancellationToken);
}
}
Usage examples
Here is a basic example of opening a tar.gz file for reading. First we decompress the gzip, then we read the archive.
using FileStream fs = File.Open("file.tar.gz", FileMode.Open);
using var decompressor = new GZipStream(fs, CompressionMode.Decompress);
using var decompressedStream = new MemoryStream();
decompressor.CopyTo(decompressedStream);
var options = new TarOptions{ Mode = TarMode.Read, };
using var archive = new TarArchive(decompressedStream, options);
while (archive.TryGetNextEntry(out TarArchiveEntry? entry))
{
Console.WriteLine($"{entry.FullName}");
}
TODO: More examples to come.
Tar format description
Optional read. Feel free to skip.
A tar archive is a linear sequence of blocks. Each block consists of a header and the file contents described by that header.
The blocks are aligned to a fixed block size, usually 512. In other words, a block size needs to be a multiple of the block size, which can be achieved by adding trailing null bytes at the end of the file contents, when necessary.
The header describes the metadata of the file contents (filename, mode, uid, guid, size, last modification time, etc.). The size of a header is fixed. Its fields all have a predefined max size.
The file contents can be 0 or more raw bytes, representing the contents of the file.
If the block represents a directory, the file contents can optionally be 0. It's not 0 when it contains a list of the filesystem entries inside that directory, which some tar format versions allow.
A tar archive is navigated by jumping from header to header. The beginning of the next header can be found by adding up the fixed size of a header plus the size of the file contents, minding the block size padding.
Tar archives do not contain a central directory like zip archives. A zip central directory is an uncompressed region of the zip archive that indicates the total number of files in the archive. If the user wants to know the total number of files contained in a tar archive, the whole archive needs to be traversed to count the total number of block headers found.
The tar spec was not designed to include compression capabilities, but tars are commonly combined with a compression method. The most popular method is to first generate the tar file, then compress it, usually with GZip (.tar.gz) or with LZMA (.tar.xz). While this method simplifies and separates the archival and compression stages, it also means that the only way the user can read the contents of the tar file is by decompressing it first.
Another not-so-common method is to compress the file contents individually, leaving the header readable by the user. The reason why it's not so common is because the header offers no field to indicate which compression method was used to compress each file contents block, so the user needs to preserve that information somewhere else.
There are multiple versions of the tar format: v7, ustar, pax, gnu, oldgnu, solaris, aix, macosx. We should focus on v7, ustar, pax and gnu.
Sources:
- https://en.wikipedia.org/wiki/Tar_(computing)
- https://serverfault.com/a/897948 - A great summary of the different tar versions.
- https://www.systutorials.com/docs/linux/man/5-tar/ - Man 5 tar, description of the tar file format.
- https://www.gnu.org/software/tar/manual/html_node/Standard.html - The GNU tar spec.
- https://pubs.opengroup.org/onlinepubs/9699919799/utilities/pax.html#tag_20_92_13_03 - The PAX spec.
Open questions
Tar versions
- Should we implement the different tar versions separately? As Ian suggested above, we could gradually add support to the more complex ones:
- Add TarArchive/TarArchiveEntry with V7&UStar support and format header detection. Throw error for unsupported formats (e.g. GNU, PAX)
- Add tar archival/de-archival support for GNU tar
- Add tar archival/de-archival support for PAX/Posix tar
Assembly
- In which assembly should the stream-based APIs live?
- System.IO.Compression
- System.IO.Compression.Tar
- In which assembly should the static APIs live?
- System.IO.Compression
- System.IO.Compression.TarFile (similar to ZipFile)
TarArchiveEntry
- Do we need to expose Mode, Uid, Gid, UName, GName?
- How commonly used are DevMajor and DevMinor? Do users need these properties to be exposed at all?
- Do we need the
EntryType
property? I'd say yes, especially because some entries areLongLink
and the actual entry is expected to be located in the next position.- If we do, should the values of the enum be the exact values that can be found in the tar header, or should we assign default values, then map them internally to the actual value?
- If the user adds an entry, what
EntryType
values should be allowed? Can the user programatically add a Block, Fifo Contiguous, Character entry?
- We use
FullName
to be consistent with other full path properties. But should we instead useFullPath
? - Mode, Uid and Guid are in base 10, but they will be converted to base 8 internally.
- How should we distinguish between the end of file and a corrupt entry when calling
TryGetNextEntry
? EOF is marked in a tar with two 512-byte blocks filled with nulls.
TarOptions
- Should the properties be settable, or should they be
init
? Consider that theTarArchive
would cache it, but it may not make sense for the user to be able to change the value of the cached options.
Static APIs
- The static extension APIs are inspired on the Zip ones. Since these are all being added together, maybe we don't need them as extensions, but they can be part of the class they extend. Thoughts?
- There are many shared fields in the different overloads. Should we have a separate class (similar to
TarOptions
) one for extraction and another one for creation? We can pasa an instance of this class as an argument, and have only one method, instead of several overloads. This would be helpful in case we grow the options in the future.
Compression
- Notice that none of the APIs offer compression. Should we add static methods that allow the user to create a compressed tar file, and let them choose the desired algorithm?
- .NET currently only offers GZip, Deflate, Brotli and ZLib. We are considering adding support for ZStandard and LZMA. We could consider adding static APIs that allow composability with external compression stream-based APIs like those offered by SharpCompress or SharpZLib.
Security
- Notice there is no
Entries
property, like in zip. This is because we don't have a central directory. If we receive a network stream, we wouldn't be able to know theCount
. - Internally, we would only cache the list of visited entries if the
TarArchive
is opened inCreate
orUpdate
mode. This is because the assumption is that we will modify the tar file on dispose, either because we want to add new entries, or because we want to delete existing entries.- One thing we could do is add an
Entries
property that can only be used if the stream is a seekableFileStream
, in which case we can use the newRandomAccess
APIs to get the files.
- One thing we could do is add an
- If the user opens an existing tar file, and an existing entry has a
Mode
,Uname
and/orGName
that does not match that of the current user, should we allow the user read/update/delete/extract that entry, or should we forbid access to it? - Tarbombs happen when a tar file is extracted into an existing directory and overwrites existing files.
- They also give problems when an entry has an absolute path, and on extraction, it could potentially overwrite a system file.
- The fact that tars can contain symbolic links can also be problematic if it is expected to extract files into a symlinked folder. By default, we should not follow symlinks.
- Tar files allow having multiple files with identical path and filename. A tarbomb behavior could happen if the first extracted file is a symlink, and the next one is a regular file, in which case the second file could end up being written in the target location of the symlink. We should avoid such behaviors. One possible solution is to cache all the names, see if it already existed, and the subsequent duplicates are extracted with a suffix in their name. Another behavior is to throw.
Testing
- We have an initial set of files in dotnet/runtime-assets created with the Ubuntu
tar
command, which generatesgnutar
files. - @adamhathcock would it be ok with you if we reuse the test tar files you have in your repo? You have a good selection of test cases.