@@ -41,10 +41,10 @@ record batches have a particular structure, defined next.
4141
4242### Record batches
4343
44- The record batch metadata is written as
45- a flatbuffer (see format/Message.fbs -- the RecordBatch message type)
46- prefixed by its size, followed by each of the memory buffers in the batch
47- written end to end (with appropriate alignment and padding):
44+ The record batch metadata is written as a flatbuffer (see
45+ [ format/Message.fbs] [ 2 ] -- the RecordBatch message type) prefixed by its size,
46+ followed by each of the memory buffers in the batch written end to end (with
47+ appropriate alignment and padding):
4848
4949```
5050<int32: metadata flatbuffer size>
@@ -53,6 +53,48 @@ written end to end (with appropriate alignment and padding):
5353<body: buffers end to end>
5454```
5555
56+ The ` RecordBatch ` metadata contains a depth-first (pre-order) flattened set of
57+ field metadata and physical memory buffers (some comments from [ Message.fbs] [ 2 ]
58+ have been shortened / removed):
59+
60+ ```
61+ table RecordBatch {
62+ length: int;
63+ nodes: [FieldNode];
64+ buffers: [Buffer];
65+ }
66+
67+ struct FieldNode {
68+ /// The number of value slots in the Arrow array at this level of a nested
69+ /// tree
70+ length: int;
71+
72+ /// The number of observed nulls. Fields with null_count == 0 may choose not
73+ /// to write their physical validity bitmap out as a materialized buffer,
74+ /// instead setting the length of the bitmap buffer to 0.
75+ null_count: int;
76+ }
77+
78+ struct Buffer {
79+ /// The shared memory page id where this buffer is located. Currently this is
80+ /// not used
81+ page: int;
82+
83+ /// The relative offset into the shared memory page where the bytes for this
84+ /// buffer starts
85+ offset: long;
86+
87+ /// The absolute length (in bytes) of the memory buffer. The memory is found
88+ /// from offset (inclusive) to offset + length (non-inclusive).
89+ length: long;
90+ }
91+ ```
92+
93+ In the context of a file, the ` page ` is not used, and the ` Buffer ` offsets use
94+ as a frame of reference the start of the segment where they are written in the
95+ file. So, while in a general IPC setting these offsets may be anyplace in one
96+ or more shared memory regions, in the file format the offsets start from 0.
97+
5698The location of a record batch and the size of the metadata block as well as
5799the body of buffers is stored in the file footer:
58100
@@ -70,10 +112,12 @@ Some notes about this
70112* The metadata length includes the flatbuffer size, the record batch metadata
71113 flatbuffer, and any padding bytes
72114
115+
73116### Dictionary batches
74117
75118Dictionary batches have not yet been implemented, while they are provided for
76119in the metadata. For the time being, the ` DICTIONARY ` segments shown above in
77120the file do not appear in any of the file implementations.
78121
79- [ 1 ] : https://github.com/apache/arrow/blob/master/format/File.fbs
122+ [ 1 ] : https://github.com/apache/arrow/blob/master/format/File.fbs
123+ [ 1 ] : https://github.com/apache/arrow/blob/master/format/Message.fbs
0 commit comments