Index scan analysis perf

The `generateIndexScans` rule is a noticeable on several sysbench profiles. Here is `index_scan`, about 20% is `generateIndexScans`:

<img width="1606" alt="image" src="https://github.com/dolthub/dolt/assets/18337807/92e64697-242b-46d5-81f0-48a5e781eb9a">

Secondary to making analysis faster, indexScan selection should be a costed choice. Having the memo include single-table relations and generate index scans for comparison would be a useful improvement.

With these two goals in mind, here is what `generateIndexScans` does:
- Find the set of tables with filters available for index (range) scans
- Load the table indexes and columns into a format amenable to filter comparison
- Load the filters into a format amenable for column comparisons
- Find all index scan options for each tablescan where the index range has selectivity < 1.
- Merge/simplify filter ranges
- Pick the index scan that reads the least amount of data from disk.

## Consideration 1a: Expression simplification could be upstream

We use a red/black tree to build efficient range representations for index scans. The way this is implemented is subject to a lot of duplicate work, because we will build a red black tree for all subsets of filters trying to find ranges that 1) satisfy index conditions, 2) with the fewest/smallest ranges.

It would be nice if filter expressions could be simplified/normalized upstream.

## Consideration 1b: Range merging could be pushed further into Dolt

GMS currently does a lot of work to convert filters into range expressions. Range expressions are not executable, and have to be converted to Dolt ranges on the integrator side at execution. In a perfect information world, Dolt would judge a set of range filters based on what chunks they would actually scan. Having Dolt read the beginning/end boundaries of each chunk during analysis might be too expensive, but there is probably a middle ground where a statistics tracker already has this information and can create an `sql.IndexRange` that embeds dolt-specific logic for reading chunk ranges and costing a range scan. This would ideally replace the red/black tree logic. The relevant sysbench queries are `select_random_ranges` and `select_random_points`.

These two considerations are speculative, and probably not in any first-pass.

## Consideration 2: Generate index scans in memo

`generateIndexScans` could just be another memo rule that makes new physical expressions for indexScans. The indexScans would not be allowed for use in the RHS of certain join types, but otherwise costing would work the same.

The memo is going to be fast at caching representations of columns that makes it easy to determine which set of filters can be used with which indexes. Part of the speedup will be avoiding reading indexes from disk multiple times (a PR that transfers the applicability logic with minimal changes). Maybe bit arithmetic can avoid nested index lookups, string comparisons, and awkward "index merging" for combining multi-arity index filters (PR that tries to choose indexes in a smarter way).

One barrier is that the memo does not support single-relation expressions right now. The memo is nested awkwardly in `reorderJoins` for historical reasons that side-step join planning conflicting with other rules.

## Potential plan:

Reorder joins would, perhaps confusingly, need to build the memo for most single-relation queries (the rule name could change, `exploreCostedPlans` or something). All of the actual reordering logic would need to be disabled. GroupBy and Window would need new memo nodes. Hopefully this would be an innocuous change, GroupBy and Window become pass-through expression groups, the reordering step is skipped, and single-relations are built correctly exiting the memo to resume as plan nodes. If that refactor succeeds, `generateIndexScans` could then be removed and rewritten inside. We would need new nodes to represent indexScans separately from index lookups. Costing would need to trade-off tables with various indexScan options. Some queries with edge case nodes failed to trigger `generateIndexScans` is an acceptable short term trade-off. `applyIndexesFromOuterScope` should probably still work and not be rewritten in the same PR that rewrites `generateIndexScans`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Index scan analysis perf #6553

Consideration 1a: Expression simplification could be upstream

Consideration 1b: Range merging could be pushed further into Dolt

Consideration 2: Generate index scans in memo

Potential plan:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Index scan analysis perf #6553

Description

Consideration 1a: Expression simplification could be upstream

Consideration 1b: Range merging could be pushed further into Dolt

Consideration 2: Generate index scans in memo

Potential plan:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions