Skip to content
This repository was archived by the owner on Jun 14, 2024. It is now read-only.
This repository was archived by the owner on Jun 14, 2024. It is now read-only.

[FEATURE REQUEST]: Introduce a similarity threshold for Hybrid Scan #159

@sezruby

Description

@sezruby

Feature requested

As a developer, I want to introduce some similarity threshold about dataset for hybrid scan in order to avoid performance regression.

Currently, Hybrid scan considers an index as a candidate if an input relation includes one or more same source files of the index. However, in case the "diff" dataset is much larger than the common dataset, shuffling overhead for the "diff" dataset might outweigh the benefit from the index.

In order to avoid the regression by doing Hybrid scan, we could introduce some threshold so that the indexes which covers less amount data of the given relation would not be considered as a candidate index.

The threshold might be

  • the number of common files; only works for append-only dataset
  • the total size of common files

And we could adjust the threshold - "N % of similarity " - with a new Hyperspace config and set a proper default value based on some experimental results.

Acceptance criteria

  • The threshold should be configurable by HyperspaceConf.
  • The default value should be determined based on experimental results using large dataset.
  • Calculating threshold should be finished in reasonable time.

Success criteria

  • Calculating threshold overhead.
  • The experimental result for the default value.

Additional context

It would be good to print some log why hybrid scan is not applied.

Metadata

Metadata

Assignees

Labels

untriagedThis is the default tag for a newly created issue

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions