[FEATURE REQUEST]: Introduce a similarity threshold for Hybrid Scan

**Feature requested**

As a developer, I want to introduce some similarity threshold about dataset for hybrid scan in order to avoid performance regression.

Currently, Hybrid scan considers an index as a candidate if an input relation includes one or more same source files of the index. However, in case the "diff" dataset is much larger than the common dataset, shuffling overhead for the "diff" dataset might outweigh the benefit from the index.

In order to avoid the regression by doing Hybrid scan, we could introduce some threshold so that the indexes which covers less amount data of the given relation would not be considered as a candidate index.

The threshold might be
- the number of common files; only works for append-only dataset
- the total size of common files

And we could adjust the threshold - "N % of similarity " - with a new Hyperspace config and set a proper default value based on some experimental results.

**Acceptance criteria** 

- [x] The threshold should be configurable by `HyperspaceConf`.
- [x] The default value should be determined based on experimental results using large dataset.
- [x] Calculating threshold should be finished in reasonable time.

**Success criteria**

- [ ] Calculating threshold overhead.
- [ ] The experimental result for the default value.

**Additional context**

It would be good to print some log why hybrid scan is not applied.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FEATURE REQUEST]: Introduce a similarity threshold for Hybrid Scan #159

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[FEATURE REQUEST]: Introduce a similarity threshold for Hybrid Scan #159

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions