-
Notifications
You must be signed in to change notification settings - Fork 542
Closed
Labels
epicA collection of issues with a certain themeA collection of issues with a certain theme
Description
Motivation
When we compaction data files, the row id changes. This causes us to need to update the index files whenever we compact. When the index files are updated, it invalidates them in the cache, degrading query performance. If row ids were stable when rows were moved, this would not happen.
Scope
This epic makes row ids stable after moving. It does not make them stable after updates. Rows that are updated will be deleted and appended under new ids.
A future epic will cover "primary keys", which will be the point at which row ids will be stable after updates in addition to moves. This is kept out of scope for now to keep the workload of this manageable.
Design
In very simple terms:
- Add row ids as auto-incrementing u64 id. The manifest will track
max_row_idand assign in similar process as fragment ids are assigned during the commit loop. - Each fragment metadata will contain a small row id index. This index maps from row id to row address. (Row address is what we currently call
_rowid.) In most cases, such as after an append, this will be a simple range of values(max_row_id + 1)..(physical_rows + max_row_id + 1). Deletion files will be superceded by tombstones contained in the row id index. This cuts down on total number of files to manage.- A new feature flag will be introduced to make sure older readers don't try to interpret these new row ids.
Plan
- Create the row id index data structure #2308
- Manifest changes for stable row id (feature flag + fields) feat: stable row id manifest changes #2363
- Overwrite
- Update
- Create table
- Scan with row id and address feat: scan with stable row id #2441
- Take with stable row id feat: support stable row ids in Dataset::take_rows() #2447
- Create index, query (including prefilter) feat: stable row id support in queries #2452
- Compaction and querying feat: move-stable row ids in compaction #2544
The following tasks have been moved into the Primary Keys epic:
- Follow ups for stabilization
- Replace custom bitmap implementation
- Finalize serialization format
- Optimize row id access given real benchmarks
- External files and cleanup
- Write out external files if large enough
- Cleanup implementation
Week of August 12
- Benchmark prefilter performance with stable row id perf: stable row id prefilter #2706
- Move on to #2454
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
epicA collection of issues with a certain themeA collection of issues with a certain theme