Draft of Par3 specification

mdnahas · mdnahas · commit 873af93c0dac · 2022-01-28T16:01:06.000-06:00
diff --git a/doc/Parity_Volume_Set_Specification_v3.0.txt b/doc/Parity_Volume_Set_Specification_v3.0.txt
@@ -1,19 +1,171 @@
+Parity Volume Set Specificiation 3.0
 
-Goal:
+Michael Nahas
 
-Provide a complete solution for the bottom two layers of archiving - redundant data and splitting.  The other layers are best supported by other programs (tar, zip/gzip/7zip, pgp/gpg, etc.).  Provide minimal support for other layers, for ease and integration (e.g., multiple input files).
+Started January 16th, 2020
 
-Expanded goals from Par2:
 
-* support any linear code (Reed-Solomon with Vandermonde matrix or Cauchy, LDPC, random sparse martix)
+Based on Parity Volume Set Specification 1.0 [2001-10-14] by Stefan Wehlus and others.
+Based on Parity Volume Set Specification 2.0 [2003-05-11] by Michael Nahas with ideas from Peter Clements, Paul Nettle, and Ryan Gallagher
+
+
+Introduction:
+
+This document describes a file format for storing redundant data for a set of files.  If any of the original set of files is damaged in storage or transmission, the redundant data can be used to regenerate the original input.  Of course, not all damages can be repaired, but many can. 
+
+In operation, a user will select a set of files from which the redundant data is to be made. These are known as "input files" and the set of them is known as the "recovery set". The user will provide these to a program which generates file(s) that match the specification in this document. The program is known as a "PAR 3.0 client" or "client" for short, and the generated files are known as "PAR 3.0 files" or "PAR files". If the files in the recovery set ever get damaged (e.g. when they are transmitted or stored on a faulty disk) the client can read the damaged input files, read the (possibly damaged) PAR files, and regenerate the original input files.  Again, not all damages can be repaired, but many can. 
+
+In addition to being a file format, the Par 3.0 standard can be used as a network protocol for forward error correction.  Instead of files, data objects can be packaged in the Par 3.0 format and send over an unreliable channel, like UDP.  Redundant data can be generated and also sent over the channel.  The receiver can use the redundant data to recover data objects that experienced data loss.
+
+In Par 3.0, the redundant data is calculated using any linear systematic code. This includes a wide variety of error correcting codes, including Reed-Solomon and many Low Density Parity Check (LDPC). This flexibility allows Par 3.0 clients to choose between speed and the number of errors that can be recovered.  It also allows Par 3.0 to support any codes whose patents expire or new codes that are developed.
+
+During Par 3.0 file creation, the input files are chopped into equal-sized blocks known as "input slices".  If a file does not fill out the chunk, i.e. it ends mid-slice, the rest of the slice is treated as if it is padded with zero bytes.  Each slice is treated as an array of values in a "Galios Field", which is a mathematical concept that behaves like an unsigned integer with a strange/useful overflow behavior.  The redundant data is generated by multiplying data from each input slice by a matrix.  The redundant data is generated in the same equal-sized blocks, which are called "recovery slices".
+
+Recovery is accomplished by subtracting the effect of every input slice that arrives intact, inverting the matrix, and then multiplying the recovery data from each recovery slice that arrived intact.  
+
+The PAR 3.0 file itself is made of packets - self-contained parts with their own checksum. This design prevents damage to one part of the file from making the whole file unusable. 
+
+Packets have a type and each type of packet serves a different purpose. One describes a file.  One describes the matrix.  Another contains the checksums of the slices in a file. And yet another contains a recovery slice.  There are a few other types too.
+
+A PAR 3.0 file is only required to contain 1 specific packet - the packet that identifies the type of client that created the file. This way, if clients are creating files that don't match the specification in some way, they can be tracked down.
+
+The packets can be packaged into multiple files. Files can contain duplicate packets - in fact, this is recommended for vital packets, such as the ones that describe the input files. Packets can appear in any order in a file, but there is a recommended order if you want to support clients that recover the file(s) in a single pass.  
+
+
+Design Goals:
+
+Par 3.0's goal is to provide a complete solution for the bottom two layers of archiving - redundant data and splitting.  The other layers are best supported by other programs (tar, zip/gzip/7zip, pgp/gpg, etc.).  Par 3.0 does provide minimal support for other layers, for ease and integration.  For example, it does support multiple input files.
+
+Major differences from Par 2.0 are:
+* support any systematic linear code (Reed-Solomon with Vandermonde matrix or Cauchy, LDPC, random sparse martix)
 * support streaming / single-pass recovery
-* support parity data inside a ZIP file or other file
-* better support splitting files into equal-sized outputs
+* support parity data inside a ZIP, ISO 9600, or other file
+
+Part of supporting any linear code is to fix the major bug in Par 2.0, which is that it did not do Reed-Solomon encoding as it promised.  There was a major mistake in the paper that Par 2.0 relied on.  The problem manifested as a bug in Par 1.0 and, while Par 2.0 reduced its occurance, it did not fix the problem.  Par 2.0 did not use an always invertible matrix; it essentially used a random matrix, which (luckily) is invertible with high probability.  Par 3.0 fixes that bug.
+
+Supporting new linear codes, like LDPC and sparse random matrices, will allow much faster Par file generation.
+
+Some minor differences are:
+* UTF-8 filenames (which were supported as a never-published Par 2.1 standard)
+* support for recovery from multiple files with overlapping recovery sets
+* the ability to change file names without regenerating all Par files
+* empty directories
+* more than 2^16 files
+
+Par 3.0 drops explicit support for the non-recovery set.  It can be simulated by zero entries in the matrix.
+
+
+
+Conventions
+
+There are a number of conventions used in the design of this specification.
+
+The data is 8-byte aligned. That is, every field starts on an index in the file which is congruent to zero, modulus 8. (That is, address % 8 == 0) This is because some memory systems function faster if 64-bit quantities are 8-byte aligned. It should be noted that a file could be corrupted (bytes inserted or deleted) to throw off the alignment. 
+
+All integers in this version of the spec are unsigned integers of either 4 or 8 bytes in length.
+
+Strings are not null-terminated. This is to prevent hackers from using stack-overflow attacks. In order to make a string 8-byte aligned, 1 to 7 zero bytes may be appended.  If an N-byte field contains an array, a null-terminated string can be created by copying the N-byte field into a character array of length N+1 and then the setting the N+1 character to '\0'.
+
+The lengths of arrays and strings are often implicit. For example, if a region is known to be 32 bytes and that region contains an 8-byte integer and a string, then the string is known to take up 24 bytes. The string is then at least 17 bytes in length, since the 24 bytes contains 0 to 7 bytes of NUL padding at the end.
+
+All strings are UTF-8.  
+
+The lengths of files and parts of files are determined by 8-byte integers. This is to support OSes that can handle files longer than 4GB.
+
+All integers are Intel-endian. (That is, little endian.)
+
+The recovery set is identified by a 16-byte value known as the Recovery Set ID. Every part of the PAR file that affects a recovery set contains the recovery set ID. In this 3.0 version, the Recovery Set ID is a random number.  It is recommended that clients use the MD5 hash of a user identifier (account name), machine identifier (hostname, IP address, etc.) and a high-resolution clock.  The way of calculating this value could change in future versions; clients reading files should not rely on how it is calculated.
+
+Files are also identified by a 16-byte value. In this 3.0 Version, it is an MD5 Hash of their length and the MD5 Hash of their first 16kB. The way of calculating this value could change in future versions; clients reading files should not rely on how it is calculated.
+
+Every byte of a PAR file is specified. There are no places to throw junk bytes that can be any value. Padding, where needed, is specified to be zero bytes. The order of items in all arrays is specified.
+
+The specification is designed so that if two clients generate a packet with the same parameters, the packets are identical (except for client-identifying or client-specific packets). Thus, client writers can compare the output of their program against the reference implementation by comparing packets byte-for-byte.
+
+
+
+
+Description:
+
+A PAR 3.0 file consists of a sequence of "packets". A packet has a fixed sized header and a variable length body. The packet header contains a checksum for the packet - if the packet is damaged, the packet is ignored. The packet header also contains a packet-type. If the client does not understand the packet type, the packet is ignored. To be compliant with this specification, a client must understand the "core" set of packets. Client may process the optional packets or create their own application-specific packets.
+
+The packet header is:
+
+Table: Packet Header
+Length (bytes)	Type	Description
+8	byte[8]	Magic sequence. Used to quickly identify location of packets. Value = {'P', 'A', 'R', '2', '\0', 'P', 'K', 'T'} (ASCII)
+8	8-byte uint	Length of the entire packet. Must be multiple of 8. (NB: Includes length of header.)
+16	MD5 Hash	MD5 Hash of packet. Used as a checksum for the packet. Calculation starts at first byte of Recovery Set ID and ends at last byte of body. Does not include the magic sequence, length field or this field. NB: The MD5 Hash, by its definition, includes the length as if it were appended to the packet.
+16	MD5 Hash	Recovery Set ID. All packets that belong together have the same recovery set ID. (See "Conventions" for how it is calculated.)
+16	byte[16]	Type. Can be anything. All beginning "PAR " (ASCII) are reserved for specification-defined packets. Application-specific packets are recommended to begin with the ASCII name of the client.
+?*8	?	Body of Packet. Must be a multiple of 8 bytes.
+
+There are various types of packets. The "core" set of packets - the set of packets that all clients must recognize and process - are listed next. For each, the value for the "type" field will be listed along with the contents of the body of the packet. 
+
+It is important to notice that the Magic Sequence and, later, some Type fields refer to "PAR2".  This is because the packet header format and those particular packet types are also valid according to the PAR 2.0 file specification.  A PAR 2.0 client will be able to read them and do something with them, even if it could not use all the features introduced in this PAR 3.0 specification.
+
+
+Creator packet
+
+This packet is used to identify the client that created the file. It is required to be in every PAR file. If a client is unable to process a recovery set, the contents of the creator packet must be shown to the user. The goal of this is that any client incompatibilities can be found and resolved quickly.
+
+The creator packet has a type value of "PAR 2.0\0Creator\0" (ASCII). The packet's body contains the following:
+
+Table: Creator Packet Body Contents
+Length (bytes)	Type	Description
+?*8	UTF-8 char array	UTF-8 text identifying the client. This should also include a way to contact the client's creator - either through a URL or an email address. NB: This is not a null terminated string!
+
+It is recommended that the text in the creator packet include any parameters used to generate the file.  For example, the command line arguments.  This will aid in debugging problems.
+
+
+Matrix packet
+
+This packet describes a Galois Field and matrix used to recover data.   
+
+The creator packet has a type value of "PAR 3.0\0Matrix\0\0" (ASCII). The packet's body contains the following:
+
+Table: Matrix Packet Body Contents
+Length (bytes)	Type	Description
+8	unsigned int	The size of the Galois field in bits.  Clients are only required to support values of 16.
+8	unsigned int	The generator of the Galois field.  Clients are only required to support values generator 0x0001100B. 
+?*8	various	
+?*8	UTF-8 char array	UTF-8 text identifying the client. This should also include a way to contact the client's creator - either through a URL or an email address. NB: This is not a null terminated string!
+
+It is recommended that the text in the creator packet include any parameters used to generate the file.  For example, the command line arguments.  This will aid in debugging problems.
+
+
+
+
+File hash  (license, code, projects using, speed, ...)
+  Blake3 :
+    8 times faster than MD5, using single thread SSE, 16kB input
+       ---> Blake3 paper says it is roughly the same speed, maybe a touch faster, than KangarooTwelve
+       ---> same paper says it is much faster on an ARM (Raspberry Pi)
+       ---> VERY multi-threadable
+    Public Domain CC0 1.0
+    Rust is default implementation; C doesn't use threads.
+    GCC or MSVC
+    256-bit output
+  KangarooTwelve:
+    Mostly Public Domain CC0
+    Python or Rust or C
+    GCC  (MSVC support is experimental)
+    Variable sized output, suggested 128-bit
+
+
+K12
+  -- require xsltproc
+
+
+
+
+NOTE: non-systematic linear codes
+
+WARNING: Unicode filenames sometimes use 1 or 2 characters for umlaut, circomflex, ...  "diaeresis"
+
+
 
-? support using Par files from other recovery sets for recovery?
-? changing filenames after generating parity files?  (Previous version sorted fileids, so changing name meant changing fileid, recovery set id, everything.)
 
-* also: empty directories, more than 2^16 files, UTF8
 
 Use cases:
 
@@ -71,4 +223,3 @@ NOTE: This use case is file-based.  I don't think we can support stream-based op
 
 
 
-