Vortex Layouts

[edit] I've hijacked the "Refactor VortexFileWriter" issue to cover the broader work towards vortex layouts.

- [ ] Add better Debug impl for LayoutData that shows the full tree. Requires a visitor.
- [ ] Expression splitting (so we can push-down expressions to struct fields and recombine the results)
- [ ] Layout statistics for a given row mask, including whether the results are exact / inexact.
- [ ] Change the segment read/write APIs to use aligned ByteBuffer. Have ArrayParts choose a reasonable alignment.
- [ ] Configure the writer to include/exclude padding
- [ ] Make it easy to run segment reader against memory mapped file
- [ ] Cache segments
- [ ] Coalesce segments
- [ ] Add option to dedupe segments at write time
- [ ] Add option to LZ4 compress flatbuffers

**The description here is WIP**

## Vortex Layouts

Vortex Layouts should be seen as symmetrical to Vortex Arrays, except that the data is stored externally and they can be lazily queried with push-down pruning. They are fully independent of the storage system, requiring only a key(u32)/value(bytes) API.

A secondary benefit of separating the layout logic from storage, is that we remove all I/O from this code path allowing us to have much better control over when, where and how we interact with I/O (as well as removing all traces of async Rust from the layout logic).

### Built-in Layouts

Our initial prototype ships with three layouts:
* **Flat** - an atomic array (of any DType). The serialized form does not include the dtype (to prevent duplication).
* **Chunked** - row-based partitioning of an array.
* **Struct** - col-based partitioning of a struct array.

Additional layouts for the future:
* **Dictionary** - pull out a shared dictionary across another child layout, e.g. store values as flat layout, store codes in a chunked layout to use one dictionary across all chunks of a column.
  * See https://github.com/spiraldb/vortex/issues/252
* **FlatStruct** - same as Struct, except with flattened nested fields.

As with everything in Vortex, these will be extensible by consumers of the Rust library.

### Layout Strategies

The power of layouts is in how we choose them. I think by default Vortex will ship with a `StructOfChunks` layout strategy that first splits into columns, then each column is independently chunked based on a target byte size (rather than row-count). This can be pretty useful when storing columns with large values. It means the returned batches are sized based on the data that's scanned, and doesn't suffer the write-time skew that e.g. Parquet might introduce with its "ChunksOfStruct" (row-groups of columns) strategy.

A `LayoutStrategy` is fed a series of `ArrayData` batches that represent a row-wise chunk of an array. Each strategy may manipulate, buffer, split, this batch however it likes and pass it to some child strategies, eventually culminating in a FlatStrategy that simply serializes the data. At the end of this process, the layout strategy returns the `LayoutData` (remember the analogy to `ArrayData`?).

### Layout Scan

Scanning a layout is sort of analogous to an array compute function. It's probably the _only_ "compute function" we support on layouts for a while (maybe we support sum/min/max etc?).

From a `LayoutData`, we construct a `LayoutScan`, and from this we construct a `RangeScan` per horizontal "split" of the file that we wish to read. This two-level approach is so for a single scan operator layouts can share some state across ranges, e.g. DictLayout would want to share the canonicalized values array.

---

The old issue:

 Tracking a bunch of breaking changes to the format and refactoring of the writer

- [ ] VortexFileWriter to pass chunks into an abstract LayoutWriter.
  - We ship a default LayoutWriter that is struct-of-chunked.
- [ ] Impl LayoutWriter for struct, chunked, flat, and later dict layouts.
- [ ] Layout flat buffer to hold BlockIds instead of offset/length. This allows Layout to be abstract over the storage format and not assume linear bytes (e.g. can store in an arbitrary block device).
- [ ] Postscript to point to DType + FileLayout and not assume contiguous bytes.
- [ ] Ensure we don't write extra padding / framing, e.g. we sometimes use IPCMessage and MessageWriter for writing to the file, even though this adds framing for a streaming format.
- [ ] Allow configurable padding in the file to support mem-mapping with zero-copy. 
  - See #115
- [ ] Allow configurable compression for buffers.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vortex Layouts #1676

Vortex Layouts

Built-in Layouts

Layout Strategies

Layout Scan

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Vortex Layouts #1676

Description

Vortex Layouts

Built-in Layouts

Layout Strategies

Layout Scan

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions