[diskann-wide] Optimize load_simd_first for 8-bit and 16-bit element types.#747
[diskann-wide] Optimize load_simd_first for 8-bit and 16-bit element types.#747hildebrandmw merged 3 commits intomainfrom
load_simd_first for 8-bit and 16-bit element types.#747Conversation
There was a problem hiding this comment.
Pull request overview
Optimizes partial SIMD loads on x86_64::V3 for u8/i8 and u16 element types by replacing the previous cascaded load-chain logic with overlapping-load strategies that preserve the “no out-of-bounds access” safety contract while improving throughput in distance-function epilogues.
Changes:
- Added a new helper to efficiently load
(8, 16)bytes using two 8-byte loads +pshufb(_mm_shuffle_epi8). - Reworked
__load_first_of_16_bytesto use the new helper forfirst > 8and overlapping GP-register reads forfirst <= 8. - Reworked
__load_first_u16_of_16_bytesto use the new helper forbytes > 8and GP-register reads forbytes <= 8, removing prior masked-load/insert logic.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
The particular benchmark results can be run locally by using the following input to JSON file{
"search_directories": [],
"jobs": [
{
"type": "simd-op",
"content": {
"query_type": "uint8",
"data_type": "uint8",
"arch": "x86-64-v3",
"runs": [
{
"distance": "squared_l2",
"dim": 100,
"num_points": 50,
"loops_per_measurement": 5000,
"num_measurements": 100
},
{
"distance": "squared_l2",
"dim": 101,
"num_points": 50,
"loops_per_measurement": 5000,
"num_measurements": 100
},
{
"distance": "squared_l2",
"dim": 102,
"num_points": 50,
"loops_per_measurement": 5000,
"num_measurements": 100
},
{
"distance": "squared_l2",
"dim": 103,
"num_points": 50,
"loops_per_measurement": 5000,
"num_measurements": 100
},
{
"distance": "squared_l2",
"dim": 104,
"num_points": 50,
"loops_per_measurement": 5000,
"num_measurements": 100
},
{
"distance": "squared_l2",
"dim": 105,
"num_points": 50,
"loops_per_measurement": 5000,
"num_measurements": 100
},
{
"distance": "squared_l2",
"dim": 128,
"num_points": 50,
"loops_per_measurement": 5000,
"num_measurements": 100
},
{
"distance": "squared_l2",
"dim": 160,
"num_points": 50,
"loops_per_measurement": 5000,
"num_measurements": 100
}
]
}
},
{
"type": "simd-op",
"content": {
"query_type": "float16",
"data_type": "float16",
"arch": "x86-64-v3",
"runs": [
{
"distance": "squared_l2",
"dim": 100,
"num_points": 50,
"loops_per_measurement": 5000,
"num_measurements": 100
},
{
"distance": "squared_l2",
"dim": 101,
"num_points": 50,
"loops_per_measurement": 5000,
"num_measurements": 100
},
{
"distance": "squared_l2",
"dim": 102,
"num_points": 50,
"loops_per_measurement": 5000,
"num_measurements": 100
},
{
"distance": "squared_l2",
"dim": 103,
"num_points": 50,
"loops_per_measurement": 5000,
"num_measurements": 100
},
{
"distance": "squared_l2",
"dim": 104,
"num_points": 50,
"loops_per_measurement": 5000,
"num_measurements": 100
},
{
"distance": "squared_l2",
"dim": 105,
"num_points": 50,
"loops_per_measurement": 5000,
"num_measurements": 100
},
{
"distance": "squared_l2",
"dim": 128,
"num_points": 50,
"loops_per_measurement": 5000,
"num_measurements": 100
},
{
"distance": "squared_l2",
"dim": 160,
"num_points": 50,
"loops_per_measurement": 5000,
"num_measurements": 100
},
{
"distance": "inner_product",
"dim": 100,
"num_points": 50,
"loops_per_measurement": 5000,
"num_measurements": 100
},
{
"distance": "inner_product",
"dim": 101,
"num_points": 50,
"loops_per_measurement": 5000,
"num_measurements": 100
},
{
"distance": "inner_product",
"dim": 102,
"num_points": 50,
"loops_per_measurement": 5000,
"num_measurements": 100
},
{
"distance": "inner_product",
"dim": 103,
"num_points": 50,
"loops_per_measurement": 5000,
"num_measurements": 100
},
{
"distance": "inner_product",
"dim": 104,
"num_points": 50,
"loops_per_measurement": 5000,
"num_measurements": 100
},
{
"distance": "inner_product",
"dim": 105,
"num_points": 50,
"loops_per_measurement": 5000,
"num_measurements": 100
},
{
"distance": "inner_product",
"dim": 128,
"num_points": 50,
"loops_per_measurement": 5000,
"num_measurements": 100
},
{
"distance": "inner_product",
"dim": 160,
"num_points": 50,
"loops_per_measurement": 5000,
"num_measurements": 100
}
]
}
}
]
} |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #747 +/- ##
==========================================
- Coverage 89.01% 89.00% -0.01%
==========================================
Files 428 428
Lines 78294 78295 +1
==========================================
- Hits 69691 69687 -4
- Misses 8603 8608 +5
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
arkrishn94
left a comment
There was a problem hiding this comment.
LGTM, cool trick with the _load_8_to_16_bytes logic.
## What's Changed ### API Breaking Changes * Remove the `experimental_avx512` feature. by @hildebrandmw in #732 * Use VirtualStorageProvider::new_overlay(test_data_root()) in tests by @Copilot in #726 * save and load max_record_size and leaf_page_size for bftrees by @backurs in #724 * [multi-vector] Verify `Standard` won't overflow in its constructor. by @hildebrandmw in #757 * VirtualStorageProvider: Make new() private, add new_physical by @Copilot in #764 * [minmax] Refactor full query by @arkrishn94 in #770 * Bump diskann-quantization to edition 2024. by @hildebrandmw in #772 ### Additions * [multi-vector] Enable cloning of `Mat` and friends. by @hildebrandmw in #759 * adding bftreepaths in mod.rs by @backurs in #775 * [quantization] Add `as_raw_ptr`. by @hildebrandmw in #774 ### Bug Fixes * Fix `diskann` compilation without default-features and add CI tests. by @hildebrandmw in #722 ### Docs and Comments * Updating the benchmark README to use diskann-benchmark by @bryantower in #709 * Fix doc comment: Windows line endings are \r\n not \n\r by @Copilot in #717 * Fix spelling errors in streaming API documentation by @Copilot in #715 * Add performance diagnostic to `diskann-benchmark` by @hildebrandmw in #744 * Add agents.md onboarding guide for coding agents by @Copilot in #765 * [doc] Fix lots of little typos in `diskann-wide` by @hildebrandmw in #771 ### Performance * [diskann-wide] Optimize `load_simd_first` for 8-bit and 16-bit element types. by @hildebrandmw in #747 ### Dependencies * Bump bytes from 1.11.0 to 1.11.1 by @dependabot[bot] in #723 * [diskann] Add note on the selection of `PruneKind` in `graph::config::Builder`. by @hildebrandmw in #734 * [diskann-providers] Remove the LRU dependency and make `vfs` and `serde_json` optional. by @hildebrandmw in #733 ### Infrastructure * Add initial QEMU tests for `diskann-wide`. by @hildebrandmw in #719 * [CI] Skip coverage for Dependabot. by @hildebrandmw in #725 * Add miri test coverage to CI workflow by @Copilot in #729 * [CI] Add minimal ARM checks by @hildebrandmw in #745 * Enable CodeQL security analysis by @Copilot in #754 ## New Contributors * @backurs made their first contribution in #724 * @arkrishn94 made their first contribution in #770 **Full Changelog**: 0.45.0...0.46.0
Optimize
SIMDVector::load_simd_firstforu8,i8andu16data type on thex86_64::V3architecture.These types use the
__load_first*algorithms since AVX2 does not have masked loads for 8/16-bit types. The current implementation uses a cascaded load-chain to ensure the safety contract is upheld. This results in a lot of fiddly conditional logic.This new implementation uses at most 2 data loads (plus sometimes one more load from a const variable for the shuffle-mask) to avoid the data dependent chain and avoids using the
u128type directly, which saves a bunch of LLVM register shenanigans.These functions are called in the epilogue handling of many distance function implementations.
Performance results are below. This is a pretty clear win for the 8-bit case. It appears to be kind of a wash for the 16-bit case though.