You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Jun 14, 2024. It is now read-only.
As a developer, I want to introduce some similarity threshold about dataset for hybrid scan in order to avoid performance regression.
Currently, Hybrid scan considers an index as a candidate if an input relation includes one or more same source files of the index. However, in case the "diff" dataset is much larger than the common dataset, shuffling overhead for the "diff" dataset might outweigh the benefit from the index.
In order to avoid the regression by doing Hybrid scan, we could introduce some threshold so that the indexes which covers less amount data of the given relation would not be considered as a candidate index.
The threshold might be
the number of common files; only works for append-only dataset
the total size of common files
And we could adjust the threshold - "N % of similarity " - with a new Hyperspace config and set a proper default value based on some experimental results.
Acceptance criteria
The threshold should be configurable by HyperspaceConf.
The default value should be determined based on experimental results using large dataset.
Calculating threshold should be finished in reasonable time.
Success criteria
Calculating threshold overhead.
The experimental result for the default value.
Additional context
It would be good to print some log why hybrid scan is not applied.