Skip to content

Dimension Vector

Jeremy H. Shi edited this page Jul 19, 2019 · 5 revisions

this documentation describes the format of dimension

V1

Dimension Vector

  1. Dimension vector is to hold all dimension values and their validities before sort and reduction phase.
  2. Dimension value size can be 1 (int8), 2 (int16), 4 (int32, float) bytes.
  3. For each row, the dimension values will
  4. Each row will have the validity bytes (1 byte for each dimension value to represent True/False) following dimension values
  5. Each row will be padded into 4-byte multiple

Dimension Row Format

| dim1 | dim2 | dim3  | validity1 | validity2 | validity3|

eg.

city_id (uint16) status (uint16) vvid (uint32)
1 0 1
2 1 null
1 null 2
null null 3

will be packed into

1(int32) 0(int16) 1(int16) true(int8) true(int8) true(int8) padding
0 1 2 false true true padding
2 0 1 true false true padding
3 0 0 true fasle false padding

The total bytes used for each row in the example is 4 + 2 + 2 + 1 * 3 = 11 + 5(padding) = 16

V2

To optimize read performance we rewrote dimension vector to column oriented format. What not changed:

  1. dimension value size (1,2,4 bytes)
  2. validity bytes: 1 byte each value, indicating whether it's valid or null what's changed:
  3. instead of row oriented layout, we write all dimension values first, 1 dimension then another (dimensions are sorted in descending order of data width)
  4. then we write validity bytes, in the same dimension order

eg.

city_id (uint16) status (uint16) vvid (uint32)
1 0 1
2 1 null
1 null 2
null null 3

one valid packing will be:

vvid values (uint32) city_id values (uint16) status values (uint16) vvid validity bytes (byte) city_id validity bytes (byte) status validity bytes (byte)
4*4 2*4 2*4 1*4 1*4 1*4

the total number of bytes needed will be 16 + 8 + 8 + 4*3 = 44

number of bytes needed for each row is 4 + 2 + 2 + 3 = 11