Skip to content
This repository was archived by the owner on Nov 20, 2020. It is now read-only.
/ vstruct Public archive
forked from ToxicFrog/vstruct

A Lua library for packing and unpacking binary data, supporting arbitrary (byte-aligned) widths, named fields, and repetition.

License

Notifications You must be signed in to change notification settings

LuaDist/vstruct

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

    Contents
    ========

1.	Overview
2.  Warning about numeric precision
3.  Setup
    3.1 Installation
    3.2 Testing
    3.3 Loading
4.  API
    4.1 Variables
    4.2 Functions
5. Format string syntax
    5.1 Repeat markers
    5.2 Tables
    5.3 Names
    5.4 Bitpacks
6. Data items
    6.1 Controlling endianness
    6.2 Seeking
    6.3 Reading and writing
7. Adding new formats
8. Credits



    1. Overview
    ===========

VStruct is a library for Lua 5.1. It provides functions for manipulating binary data, in particular for unpacking binary files or byte buffers into Lua values and for packing Lua values back into files or buffers. Supported data types include:

  * signed and unsigned integers of arbitrary byte width
  * fixed and floating point numbers
  * plain, counted and null-terminated strings
  * booleans and bitmasks
  * bit-packed integers, booleans and bitmasks
  
In addition, the library supports seeking, alignment, and byte order controls, repetition, grouping of data into tables, and named fields.



    2. Warning about numeric precision
    ==================================

When reading and writing numeric formats, vstruct is inherently limited by lua's number format, which is by default the IEEE 754 double. What this means in practice is that formats i, u, c, and p (see FIXME) may be subject to data loss, if they contain more than 52 significant bits. (The same is true of numeric constants declared in Lua itself, of course, and other libraries which store values in lua numbers). In other words - be careful when manipulating data, especially 64-bit ints, that may not fit in Lua's native types.

Formats not listed above are not subject to this limitation, as they either do not use Lua numbers at all, or do so only in ways that are guaranteed to be lossless.



    3. Setup
    ========
    
    3.1 Installation
    ----------------
    
vstruct is a pure-lua and module, and as such requires no seperate build step; it can be installed as-is.

To install it, copy the 'vstruct/' directory from the vstruct distribution into a directory that is searched by package.path, such that 'require "vstruct"' will load vstruct/init.lua and 'require "vstruct.foo"' will load vstruct/foo.lua. On linux, this can be accomplished by installing it to any of the following directories:

    /usr/lib/lua/5.1/
    /usr/share/lua/5.1/
    /usr/local/lib/lua/5.1/
    /usr/local/share/lua/5.1/

Or by modifying package.path to include the following entries (assuming that $LIBDIR is the directory *containing* vstruct/):

    $LIBDIR/?.lua
    $LIBDIR/?/init.lua
    
Note that installing it to ./ will *not* work unless you also add an entry for './?/init.lua' to package.path; by default, lua will look for './?.lua' but not './?/init.lua'.


    3.2 Testing
    -----------
    
vstruct comes with a number of builtin test suites. To run them, simply invoke vstruct/test.lua:

    lua vstruct/test.lua
    
If any of the tests fail, it will report them on standard output and then raise an error; if all of the tests pass, it will exit cleanly.

If vstruct is properly installed, you can also load the tests as a library:

    lua -lvstruct.test
    require "vstruct.test"
    
With the same behaviour - it will load cleanly if the tests succeeded and raise an error if any of them failed.


    3.3 Loading
    -----------
    
vstruct makes itself available under name 'vstruct'; the module returns itself when loaded, and also sets a global 'vstruct', so either of the following will work:

    require "vstruct"
    print(vstruct._VERSION)
    
    local vstruct = require "vstruct"
    print(vstruct._VERSION)
    
As automatic setting of globals by modules is increasingly discouraged, the latter form is preferred; the former is likely to go away at some point.



    4. API
    ======

vstruct, once loaded, exports the following variables and functions.

In this section, a *format string* means the string used to describe an on-disk format, controlling how data is packed or unpacked. Format strings are described in sections 5 (the actual syntax of the format string) and 6 (the individual data formats supported).
    
    4.1 Variables
    -------------
    
    vstruct.cache = (true|false|nil)

Enables or disables cacheing of compiled format strings. This can noticeably improve performance in the common case where you are re-using the same strings many times; if, on the other hand, you are generating lots of different strings, this will increase memory usage - perhaps significantly - for no performance benefit. 

If true, cacheing is enabled.
If false, new cache entries will not be created but old ones will still be used.
If nil, the cache is disabled entirely.

    4.2 Functions
    -------------
    
    vstruct.cursor(string)

Wraps a string so that it can be used as a file. The returned object ('cur') supports cur:seek, cur:read(num_bytes) and cur:write(string), with the same behaviours as the file methods of the same names. In general, vstruct will attempt to automatically wrap strings if they are passed to it where a file is expected; this function is primarily useful when more control over the process is required.

To access the wrapped string, use cur.str; to determine where the read/write pointer is, use cur.pos.

--------
    
    vstruct.unpack(fmt, <fd or string>, [unpacked])

unpack takes a format string and a buffer or file to unpack from, and returns a table of unpacked data items. The items will appear in the table with the same order and/or names that they appeared in the format string.

If the 'unpacked' argument is present and true, vstruct will return the data items as individual return values rather than as a table, as in the following example:

    vstruct.unpack("3*u1", buf)         -- { 1, 2, 3 }
    vstruct.unpack("3*u1", buf, true)   -- 1, 2, 3

If the 'unpacked' argument is present and a table, vstruct will pack the items into this table in-place and return the table. Existing entries in the table will not be cleared (but might be overwritten by named entries). Numbered entries will always be appended, not overwritten:

    t = { 0, 0, 0 }
    vstruct.unpack("3*u1", buf, t)      -- t == { 0, 0, 0, 1, 2, 3 }

--------

    vstruct.pack(fmt, [fd], data)

pack takes a format string and a table of data and packs the contents of the table. If the fd argument is present, it will write the data directly to it using standard file io methods; if fd is a string, it will wrap it with vstruct.cursor first.

If fd is omitted entirely, it will pack the data into a string and return the string (not the cursor wrapping it); in effect, the following two expressions are equivalent:

    vstruct.pack(fmt, vstruct.cursor(""), data).str
    vstruct.pack(fmt, data)

The structure of the table is expected to be the same as the structure would be created by a call to unpack with the same format string; in effect, you should be able to take the result of unpack and pass it to pack unaltered (and get the same packed data back), and vice versa.
    
--------
    
    vstruct.explode(int)
	vstruct.implode(table)

explode converts a bitmask (in integer format) into a list of booleans, and implode does the converse. In such lists, list[1] is the least significant bit, and list[n] the most significant.

--------

    vstruct.compile(format)

compile takes a format string and runs it through the compiler and code generator, but does not actually pack or unpack anything. Instead, it returns a table containing three values -

    pack(fd, data) - equivalent to struct.pack(format, fd, data)
    unpack(fd, [unpacked]) - equivalent to struct.unpack(format, fd, [unpacked])
    source - a string containing the generated source for the format string.
             Primarily of use when debugging.

In effect, the following code:

	t = vstruct.compile(fmt)
	d = t.unpack(fd)
    t.pack(fd, d)

Is equivalent to:

	d = vstruct.unpack(fmt, fd)
    vstruct.pack(fmt, fd, d)


    
    5. Format string syntax
    =======================

A format string consists of a series of *format items*. A format item is:

* a data item, seek control, or endianness control (see section 6)
* a repeat marker 'N*' followed by a format item, or a sequence of format items enclosed in '(' and ')'
* a table '{ ... }' enclosing any number of format items
* a name 'foo:' followed by a data item or table
* a bitpack '[S| ... ]' enclosing any number of *bitpack-capable* format items
* a comment, starting with '--' and ending at the next newline

These are explained in detail in the rest of this section, apart from data items, seek controls, and endianness controls, which are a sufficiently lengthy topic that they have a section of their own (section 6).

In general, whitespace may be omitted where the result is unambiguous, and when present, the amount and type of whitespace is irrelevant. Comments are considered to be whitespace.


    5.1 Repeat markers
    ------------------

A repeat marker consists of a decimal number N, followed by a *, followed by a format item (or a group of such items enclosed in parentheses). The following item is repeated N times. For example, these three format strings:

    "u4 u4 u4 u4"
    "{ u2 b1 } { u2 b1 }"
    "u2 u2 u4 u2 u2 u4 u2 u2 u4 m2"
    
Can be expressed more concisely as these:

    "4*u4"
    "2*{ u2 b1 }"
    "3*(u2 u2 u4) m2"

    
    5.2 Tables
    ----------
    
A table consists of any number of format items enclosed in curlybraces. When unpacking, the items contained in the table will be packed into their own subtable in the output; when packing, items contained in the table will e searched for in a subtable. For example, this format string:

    "{ u4 u4 u4 } { { b1 b1 } s8 }"

Describes the following table (or something similarly structured):

    {
        { 1, 2, 3 };
        {
            { true, false };
            "test";
        };
    };
    
Note the outer table; unpack returns all unpacked values in a table by default, and pack expects the values it packs to be contained in one.

Within a format string, tables may be nested arbitrarily.


    5.3 Names
    ---------
    
A name consists of a valid Lua identifier, or sequence of such identifiers separated with '.', followed by a ':'. It must be followed by a data item or a table. The following item will be stored in/retrieved from a field with the given name, rather than being (un)packed sequentially as is the default.

For example, this table:

    {
        coords = { x=1, y=2, z=3 };
    }
    
Could be expressed by either of these format strings:

    "coords:{ x:u4 y:u4 z:u4 }"
    "coords:{} coords.x:u4 coords.y:u4 coords.z:u4"

    
    5.4 Bitpacks
    ------------
    
A bitpack consists of multiple data items packed into a single item (typically an int) with no regard for byte-alignment - for example, something that stores three five-bit ints and a one-bit boolean in a single 16-bit int.

These are expressed in vstruct as '[' size '|' items ']', where 'size' is the size of the entire bitpack in *bytes* and 'items' is a sequence of data items with their sizes in 'bits'. The above example would be expressed thus:

    "[2| u5 u5 u5 b1 ]"
    
Bitpacks read their contents MSB to LSB; in the above example, the boolean is the least significant bit of the enclosing 2-byte int.

Despite being conceptually packed into ints, bitpacks are not subject to numeric precision limitations (although the data items inside them might be, if they are large enough); a 100-byte bitpack will unpack just a accurately as a 1-byte one.

Packing and unpacking of bitpacks respects *byte* endianness; in little-endian mode, the least significant *byte* is expected to come first. Bits are always MSB first, LSB last. That is to say, given the following bytes on disk:

    01 02
    
It will be read in the following manners in big- and little-endian modes

    "> [2| 4*u4 ]"      -- { 0, 1, 0, 2 }
    "< [2| 4*u4 ]"      -- { 0, 2, 0, 1 }

At present, only formats 'b', 'u', 'i', 'x' and 'm' are supported inside bitpacks.

    
    6. Data items
    =============

This section describes the individual fragments of data that make up the bulk of a format string, as well as the seek and endianness controls available.

By convention, in this section, upper-case single letters represent decimal numbers to be filled in when the format string is written; W in particular represents the width of the given field, and A an address or offset in the packed data - both in bytes.


    6.1 Controlling endianness
    --------------------------
    
At any given moment, when pacing or unpacking, vstruct is in either "big-endian" or "little-endian" most. These affect the order in which it expects bytes to appear for formats 'f', 'u', 'i', and 'm'; the initial string length in 'c'; and when reading or writing the bytes that make up a bitpack. In big-endian mode it expects the most significant byte to occur first; in little-endian mode, the least sigificant byte. (There is at present no support for more esoteric modes like middle-endian.)

Each operation starts in *host endian* mode - the endianness is set to match that of the host system. It can subsequently be controlled with the following characters in the format string:

<
    Sets the endianness to little-endian (eg, Intel processors)
>
    Sets the endianness to big-endian (eg, PPC processors)
=
    Resets the endianness to match that of the host system
    
    
    6.2 Seeking
    -----------

vstruct supports seeking in the underlying buffer or file; these operations will translate into a call to the :seek() method. Note that attempting to use these when unpacking from or packing to a non-seekable stream (such as stdout) will generate a runtime error. In that case, use 'x' (the skip/pad format) instead.

@A
	Seek to absolute address A.
+A
	Seek forward A bytes.
-A
	Seek backwards A bytes.
aW
	Align to word width W (seek to the next address which is a multiple of W)

	
	6.3 Reading and writing
	-----------------------
	
The following items perform actual reading and writing of data.
	
bW	    Boolean.
	Read: as uW, but returns true if the result is non-zero and false otherwise.
	Write: as uW with input 1 if true and 0 otherwise.
	
	Note that when writing, the output for true is integer 1, not all bits 1.

cW	    Counted string.
	Read: uW to determine the length of the string W', followed by sW'.
	Write: the length of the string as uW, followed by the string itself.
	
	The counted string is a common idiom where a string is immediately prefixed with its length, as in:
		size_t len;
		char[] str;
	The counted string format can be used to easily read and write these. The width provided is the width of the len field, which is treated as an unsigned int. Only the string itself is returned (when unpacking) or required (when packing). Internally, this is implemented as 'u' followed by 's'; consequently, it is affected by endianness.
	
fW	    IEEE 754 floating point.
	Valid widths are 4 (float) 8 (double) and 16 (quad). Note that quads have more precision than the default lua number format (double), and thus may not unpack exactly unless you're using a custom lua build.
	
	Affected by endianness.

iW	    Signed integer.
	Read: a signed integer of width W bytes.
	Write: a signed integer of width W bytes.
	
	When writing, floating point values will be truncated.
	Affected by endianness.

mW	    Bitmask.
	Read: as uW, but explodes the result into a list of booleans, one per bit.
	Write: implodes the input value, then writes it as uW.
	
	In effect, a 'u' that automatically calls vstruct.implode/vstruct.explode.
	Affected by endianness.

pW,F    Signed fixed point.
    W is, as usual, the width of the entire number in *bytes*. F is the number of *bits* of fractional precision. Thus, a 24.8 fixed point number would have format "p4,8".

	Read: a W-byte fixed point number with F bits of fractional precision.
	Write: a W-byte fixed point number long with F bits of fractional precision.
	
	When writing, values which cannot be precisely expressed in the given precision will be truncated.
	Affected by endianness.

sW	    Fixed-length string.
    W is optional.
	Read: reads exactly W bytes and returns them as a string. If W is omitted, reads until EOF.
	Write: writes exactly W bytes (truncating or nul-padding the given string, if W does not match the string length). If W is omitted, autodetects it from the string length.

uW	    Unsigned integer.
	Read: an unsigned integer of width W bytes.
	Write: an unsigned integer of width W bytes.
	
	On write, non-integer values will be truncated.
	Negative values will be written in absolute form.
	Affected by endianness.

xW	    Skip/pad.
	Read: read and discard the next W bytes.
	Write: write W zero bytes.

zW,C    Nul terminated/nul padded string.
    W and C are both optional.
    The ',' is mandatory if C is present and optional otherwise.
    
    W, if present, is the length of the string (in *bytes*, not characters).
    C, if present, is the size in bytes of individual characters in the string. The default is 1.  It is important when operating on wide-character strings to specify C correctly, so that sequences like "00 66 00 67" are not incorrectly interpreted as ending the string.
    
    Read: reads exactly W bytes, and returns everything up to the first nul *character*. If W is omitted, reads until it encounters a nul. The nul is read, but not included in the returned string.

    Write: writes exactly W bytes. If the input is shorter than W, nul pads the output; if longer, truncates it to (W - C) bytes and ends it with a nul. If W is omitted entirely, the string is written out in full and nul terminated.
    
    When nul terminating, or looking for a nul character to detect the end of the string, C zero bytes, C-aligned relative to the start of the string, are used. In particular, this means that the following 6-byte string:
    
        6600 0066 0000
        
    Will be one byte long (plus one byte termination) under "z6,1", but four bytes long (plus two bytes termination) under "z6,2" - the second and third bytes make up 0000, nul, but since it is not character-aligned it is ignored.


    7. Adding new IO operations
    ===========================
    
If you want to add support for a new data type, or modify or replace an existing one, this is how you do it. To see how the current ones are implemented, look at the files in vstruct/io/; any new formats added will use the same API.


    7.1 How IO operations are loaded
    --------------------------------
    
When vstruct first sees an operation - say "p2,8" - it first breaks it down into two parts - the op ("p") and the args (2,8). It then attempts to load a handler for this operation using:

    require("vstruct.io."..op)
    
Typically, this will load the file vstruct/io/<op>.lua - in our above example, vstruct/io/p.lua. The easiest way to install new operations, then, is just to put them in the "io" directory.

Package.preload can also be used; for example, package.preload["vstruct.io.p"] can be used to override the version of 'p' that comes with vstruct.


    7.2 How they are used
    ---------------------

When loaded, the handler for an IO operation returns a table containing some or all of the following functions. Note that '...' here means the arguments as given in the format string - '2,8' in the above example.

--------
    
    hasdata()
    
Returns true if, when packing, this format consumes a value from the table of input. Formats that actually read and write data will typically return true; formats that merely adjust the output stream or internal vstruct state (eg, seek and endianness controls) or those that generate the data on the fly (skip/pad) will return false.

--------

    width(...)

Returns the exact amount of data this format will consume from the input if unpack is called, or the exact amount it will append to the output if pack is called, in bytes. If this cannot yet be determined (for example, 'z' with no arguments or any usage of 'c'), it should return nil. 

--------

    pack(fd, data, ...)
    
If width() returned a value earlier, this *must* ignore fd, pack data into a string, and return the string - the caller will handle writing the string to the fd in an efficient manner. If it did not, this function may freely choose either to return a packed string, or to manipulate fd directly (in which case it should return nil).

Some operators, of course, must manipulate fd directly by their very nature (such as seek controls); when possible, however, one should endeavour to pre-calculate widths and return packed strings.

--------

    unpack(fd, buffer, ...)
    
If width() returned a value earlier, buffer will be a string of exactly that many bytes; unpack *must* ignore fd and return the value represented by the buffer. If width() returned nil, unpack must manipulate fd directly to get the data it needs.

--------

    packbits(bit, data, ...)
    
This is called when the operation is performed inside a bitpack. Each call to bit() should pass it either 0 or 1, and it will pack that bit, MSB first. bit() does not presently accept multiple arguments to pack multiple bits at once.

--------

    unpackbits(bits, ...)

The converse of packbits. Each call to bits() returns the next bit, MSB first.


    
    8. Credits
    ==========
    
While most of the library code was written by me (Ben Kelly), the existence of this library owes itself to many others.

* The original inspiration came from Roberto Ierusalimschy's "struct" library and Luiz Henrique de Figueiredo's "lpack" library, as well as the "struct" available in Python.
* The floating point code was contributed by Peter Cawley on lua-l.
* sanooj, from #lua on freenode, has done so much testing and bug reporting that at this point he's practically a co-author; the 'struct-test-gen' module in test/ is his work.
* The overall library design and interface are the result of much discussion with rici, sanooj, Keffo, snogglethorpe, Spark, kozure, Vornicus, McMartin, and probably several others I've forgotten about on IRC (#lua on freenode and #code on nightstar).
* Finally, without Looking Glass Studios to make System Shock, and Team TSSHP (in particular Jim "hairyjim" Cameron) to reverse engineer it, I wouldn't have had a reason to write this library in the first place.

About

A Lua library for packing and unpacking binary data, supporting arbitrary (byte-aligned) widths, named fields, and repetition.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Lua 100.0%