Skip to content

Commit c2ffde4

Browse files
committed
More notes about the file format
Change-Id: I6d77fb2944098e6c8815a3c20871838ab3c62b67
1 parent aef4382 commit c2ffde4

File tree

1 file changed

+49
-5
lines changed

1 file changed

+49
-5
lines changed

format/IPC.md

Lines changed: 49 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -41,10 +41,10 @@ record batches have a particular structure, defined next.
4141

4242
### Record batches
4343

44-
The record batch metadata is written as
45-
a flatbuffer (see format/Message.fbs -- the RecordBatch message type)
46-
prefixed by its size, followed by each of the memory buffers in the batch
47-
written end to end (with appropriate alignment and padding):
44+
The record batch metadata is written as a flatbuffer (see
45+
[format/Message.fbs][2] -- the RecordBatch message type) prefixed by its size,
46+
followed by each of the memory buffers in the batch written end to end (with
47+
appropriate alignment and padding):
4848

4949
```
5050
<int32: metadata flatbuffer size>
@@ -53,6 +53,48 @@ written end to end (with appropriate alignment and padding):
5353
<body: buffers end to end>
5454
```
5555

56+
The `RecordBatch` metadata contains a depth-first (pre-order) flattened set of
57+
field metadata and physical memory buffers (some comments from [Message.fbs][2]
58+
have been shortened / removed):
59+
60+
```
61+
table RecordBatch {
62+
length: int;
63+
nodes: [FieldNode];
64+
buffers: [Buffer];
65+
}
66+
67+
struct FieldNode {
68+
/// The number of value slots in the Arrow array at this level of a nested
69+
/// tree
70+
length: int;
71+
72+
/// The number of observed nulls. Fields with null_count == 0 may choose not
73+
/// to write their physical validity bitmap out as a materialized buffer,
74+
/// instead setting the length of the bitmap buffer to 0.
75+
null_count: int;
76+
}
77+
78+
struct Buffer {
79+
/// The shared memory page id where this buffer is located. Currently this is
80+
/// not used
81+
page: int;
82+
83+
/// The relative offset into the shared memory page where the bytes for this
84+
/// buffer starts
85+
offset: long;
86+
87+
/// The absolute length (in bytes) of the memory buffer. The memory is found
88+
/// from offset (inclusive) to offset + length (non-inclusive).
89+
length: long;
90+
}
91+
```
92+
93+
In the context of a file, the `page` is not used, and the `Buffer` offsets use
94+
as a frame of reference the start of the segment where they are written in the
95+
file. So, while in a general IPC setting these offsets may be anyplace in one
96+
or more shared memory regions, in the file format the offsets start from 0.
97+
5698
The location of a record batch and the size of the metadata block as well as
5799
the body of buffers is stored in the file footer:
58100

@@ -70,10 +112,12 @@ Some notes about this
70112
* The metadata length includes the flatbuffer size, the record batch metadata
71113
flatbuffer, and any padding bytes
72114

115+
73116
### Dictionary batches
74117

75118
Dictionary batches have not yet been implemented, while they are provided for
76119
in the metadata. For the time being, the `DICTIONARY` segments shown above in
77120
the file do not appear in any of the file implementations.
78121

79-
[1]: https://github.com/apache/arrow/blob/master/format/File.fbs
122+
[1]: https://github.com/apache/arrow/blob/master/format/File.fbs
123+
[1]: https://github.com/apache/arrow/blob/master/format/Message.fbs

0 commit comments

Comments
 (0)