-
Notifications
You must be signed in to change notification settings - Fork 735
Open
Labels
area/runtimeYDB runtime issuesYDB runtime issues
Description
As was noted by @vladl2802 in #11416 we have non even distribution of values between buckets while spilling.
The root couse of it is that we rely on a hash function here:
| auto bucketId = hash % SpilledBucketCount; |
which appears to be std::hash which just returns the value itself: https://godbolt.org/z/es8dxMGeY
Hash function is set here:
| return std::hash<T>()(value.Get<T>()); |
As a temp measure we change the algorithm of bucket selection from hash%128 to XXHASH(hash)%128. pr: #11471
Also, with std::hash we can face compatibility issues while changing MKQL_RUNTIME version.
So, the proposal of this task is to change std::hash to some other hash function. Hash functions to consider:
rh hash:
| ui64 bucket = ((SelfHash ^ hash) * 11400714819323198485llu) >> capacityShift; |
xxhash: https://github.com/Cyan4973/xxHash. We already use xxhash in GraceJoin:
| XXH64_hash_t hash = XXH64(TempTuple.data() + NullsBitmapSize_, (TempTuple.size() - NullsBitmapSize_) * sizeof(ui64), 0); |
Metadata
Metadata
Assignees
Labels
area/runtimeYDB runtime issuesYDB runtime issues