Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow users to influence sorting of user-supplied completions #4813

Closed
wants to merge 16 commits into from

Conversation

krobelus
Copy link
Contributor

@krobelus krobelus commented Dec 22, 2022

This depends on other PRs for convenience, can drop them:


This adds an optional "priority" field to completion candidates, which affects
fuzzy match scoring. Example usage is in kakoune-lsp/kakoune-lsp#657.
I haven't tested this much so the scoring math might not be perfect.

May close #1709

@krobelus krobelus changed the title Allow user-specified sorting of completions Allow users to influence sorting of user-supplied completions Dec 22, 2022
…oundaries only at word start

This implies a bunch of renamings but the following commits will
replace count_word_boundaries_match() anyway, so I'll do that on-the-fly.

The only functional change in this patch is that something like "/"
at the start of a string no longer counts as word start. That should
hardly matter.
We always use the same query string when comparing two candidate
strings. Let's remove the redundant query literals before we add
more tests.
I broke most of these while working on the new implementation.
Some are redundant though.
…ters

We use candidate/query in the constructor. The next patch adds the
same names in the subsequence scoring function.  Let's use the same
terminology in the subsequence matching function.

While at it, rename the subsequence matching function to prepare
for adding an optimal subsequence matching.
This allows using CharCount in operator[] instead of casting to size_t
every time. I'm afraid that some other operations like "end()" are
not supported yet with types like CharCount; for some reason adding
a cast there didn't work. I got some error about not being able to
convert CharCount to size_t.
…ions

When using fuzzy completion with candidates from "git ls-files" in
a project with many files I often get unexpected ordering even when
one candidate is clearly the best.
This usually happens if both candidates are the same kind of match
(typically only a subsequence match), which means that the leftmost
match wins (via max_index), which is often wrong.
Also there are some other matching heuristics like FirstCharMatch
that can lead to unexpected ordering by dominating everything else
(for example with query "remcc" we rank "README.asciidoc" over
"src/remote.cc").

Fix these two issues by
1. switching to the Gotoh algorithm[1], a refinement of the
   Smith-Waterman algorithm that is optimized for affine gap
   penalties[2].  This makes us  find an optimal match instead of
   the leftmost one.
2. dropping old heuristics that are obsoleted by the new one.

Optimality is defined by a new distance heuristic which favors longer
matching subsequences (as opposed to multiple shorter ones) by adding
a penalty to each new gap between matches.

For most of our RankedMatch test cases, we match the behavior of
other popular fuzzy matchers.

The algorithm needs quadratic time/space but that's okay, since
our input is expected to be small. For example, candidates for
insert-mode completion are usually below max_word_len=50. Second,
if there's ever a very large input candidate, only match against the
first 1000 characters.  Third, a follow-up commit will switch to the
linear space variant.

Every successful match adds a temporary heap allocation.  This feels
bad but it's still dominated by the number of string allocations,
so there's not much reason to optimize that.

Besides, with the following patches, it's fast enough for me.

In src/diff.h we have an implementation of a similar algorithm, but
that one is optimized for matching strings with small differences.
It is also not guaranteed to find the optimal match.

Closes mawww#3806

[1]: as described in "An Improved Algorithm for Matching Biological Sequences", see
     https://courses.cs.duke.edu/spring21/compsci260/resources/AlignmentPapers/1982.gotoh.pdf
[2]: https://en.wikipedia.org/wiki/Smith%E2%80%93Waterman_algorithm#Affine
…and towlower

Our fuzzy-matching algorithm uses iswlower and towlower to implement
smartcase matching.  Sadly, when using glibc these functions are quite
slow, even for ASCII inputs.  With the new fuzzy-matching algorithm,
this shows up prominently in a CPU profile. We tirelessly call iswlower
and towlower on the query chars, so just cache them in a prepared query
called "RankedMatchQuery", which allows to implement smartcase_eq()
more efficiently.

When matching query "clang" against 100k files in the LLVM monorepo
(of which 30k match), this commit makes us go from 1.8 billion cycles
to just 1.2 (same as the old fuzzy-matching algorithm).

The implementation is a bit ugly because the RankedMatchQuery object
needs to outlive all uses of RankedMatch::operator<.  I guess we
could try to create a type that contains both the query and query
results to make this less ugly.  We could use the same type to get
rid of the allocations in subsequence_match_scores(), though I don't
know if that matters.

A previous approach added a fast path for ASCII input to our wrappers
like is_lower and to_lower. This was not ideal because it doesn't
solve the problem for non-ASCII input.
… only ranking matches

We can compute the fuzzy matching score without keeping in memory the
scores of all prefixes.
Currently we only use the full score matrix for debugging.  In future
I want to use it to get the positions of matched characters (which
could be underlined in the UI).
A recent commit moved "doc" completions to use the new
"complete-command" but forgot to carry over the "-menu" flag. Fix that.
…ssed as well

The "define-command -menu" flag does not do anything unless there is
a completer.  Let's forbid it so there is no confusion about whether
this can be mixed with "complete-command".
This is inspired by a recent change in [Helix] that fixes sorting of
code actions.
We have the same problem because kak-lsp uses ":prompt
-shell-script-candidates" to show code actions.
For example, on this Rust file:

	fn main() {
	    let f: FnOnce(HashMap<i32, i32>);
	}

with the cursor on "HashMap", a ":lsp-code-actions" will offer two
code actions (from rust-analyzer):

	Extract type as type alias"
	Import `std::collections::HashMap`

The first one is a refactoring and the second one is a quickfix.

If fuzzy match scores are equal, Kakoune sorts completions
lexicographically, which is suboptimal because the user will almost
always want to run the quickfix first.

Allow users to influence the order via a new  "-priority" switch.
When this switch is used, Kakoune expects a second field in
shell-script-candidates completions, like so:

	Extract type as type alias"|2
	Import `std::collections::HashMap`|1

The priority field is taken into account when computing fuzzy match
scores. Due to the lack of test cases, the math to do so does not
have a solid footing yet. Here's how it works for now.

- "distance" is the fuzzy match score (lower is better)
- "priority" is the new user-specificed ranking, a positive integer (lower is better)
- "query_length" is the length of the string that is used to filter completions

	effective_priority = priority ^ (1 / query_length) if query_length != 0 else priority
	prioritized_distance = distance * (effective_priority ^ sign(distance))

The ideas are that
1. A priority of 1 is neutral. Higher values increase the distance (making it worse).
2. The longer the query, the lower the impact of "priority".

---

Used by kakoune-lsp/kakoune-lsp#657

[Helix]: helix-editor/helix#4134
Part of mawww#1709
This function has a loop that suggests it can create a sparsely
populated array, but if we peek at the array constructor below, we
see that this is not actually the case.  Given that we always pass a
std::index_sequence with adjacent indices, we don't need this anyway,
so let's simplify.
The static_gather function has a template parameter that decides if
it shall fail when there are *more* elements than expected.

The next commit wants to add another template parameter that decides
if static_gather shall succeed when there are *fewer* elements than
expected.

Let's rename the first template parameter to avoid a perceived overlap.
Closes mawww#1709

This also allows supporting LSP's sortText (to a reasonable extent).
@krobelus
Copy link
Contributor Author

krobelus commented Mar 5, 2023

This feature is quite helpful when working with chatty language servers like rust-analzer for two scenarios:

  1. When hovering over a symbol that is missing an import statement, we can use lsp-code-actions to bring up a prompt -menu to pick a code action.
    Sometimes there are loads of code actions but 80% of the time I want to run the "Add missing imports" one.
    By allowing to specify the order, the import fix will be first hence easier to run.
  2. Sometimes there are loads of insert-mode completions, typically after typing someobject.
    Some of the completions are not even specific to the object and usually not interesting for discovery.

Both cases affect mostly scenarios where the query is empty.
I wonder if this simple change is good enough

if (query.empty())
	// use user-supplied ranking
else
	// use fuzzy ranking

The formula in this PR tries to interpolate between the two cases, for a more natural feel.


The other question is the UI. The extra priority field is ugly because it's not obvious how it's interpreted.

I wonder if we can find a safe and elegant way to allow users to pass their own ranked-match comparator.
It is a function of two completion candidates and a query (or the precomputed fuzzy-match score).
Maybe a mini-DSL. It would need to support table lookups for cases like rust-analyzer where each completion is assigned a priority.i

krobelus added a commit to krobelus/kakoune that referenced this pull request Nov 18, 2023
…cified completions

When using either of

	set-option g completers option=my_option
	prompt -shell-script-candidates ...

While the search text is empty, the completions will be sorted
alphabetically.
This is bad because it means the most important entries are not listed
first, making them harder to select or even spot.

Let's compare input order before sorting alphabetically.

Closes mawww#1709, mawww#4813

TODO: this does it for all completions, can do for user-specified ones only.
krobelus added a commit to krobelus/kakoune that referenced this pull request Nov 19, 2023
…cified completions

When using either of

	set-option g completers option=my_option
	prompt -shell-script-candidates ...

While the search text is empty, the completions will be sorted
alphabetically.
This is bad because it means the most important entries are not listed
first, making them harder to select or even spot.

Let's compare input order before sorting alphabetically.

Closes mawww#1709, mawww#4813
krobelus added a commit to krobelus/kakoune that referenced this pull request Nov 20, 2023
…cified completions

When using either of

	set-option g completers option=my_option
	prompt -shell-script-candidates ...

While the search text is empty, the completions will be sorted
alphabetically.
This is bad because it means the most important entries are not listed
first, making them harder to select or even spot.

Let's compare input order before sorting alphabetically.

In theory there is a more elegant solution: sort candidates (except
if they're user input) before passing them to RankedMatch, and only
use stable sort. However that doesn't work because we use a heap
which doesn't support stable sort.

Closes mawww#1709, mawww#4813
krobelus added a commit to krobelus/kakoune that referenced this pull request Dec 2, 2023
…cified completions

When using either of

	set-option g completers option=my_option
	prompt -shell-script-candidates ...

While the search text is empty, the completions will be sorted
alphabetically.
This is bad because it means the most important entries are not listed
first, making them harder to select or even spot.

Let's apply input order before resorting to sorting alphabetically.

In theory there is a more elegant solution: sort candidates (except
if they're user input) before passing them to RankedMatch, and then
always use stable sort. However that doesn't work because we use a
heap which doesn't support stable sort.

Closes mawww#1709, mawww#4813
krobelus added a commit to krobelus/kakoune that referenced this pull request Dec 2, 2023
…cified completions

When using either of

	set-option g completers option=my_option
	prompt -shell-script-candidates ...

While the search text is empty, the completions will be sorted
alphabetically.
This is bad because it means the most important entries are not listed
first, making them harder to select or even spot.

Let's apply input order before resorting to sorting alphabetically.

In theory there is a more elegant solution: sort candidates (except
if they're user input) before passing them to RankedMatch, and then
always use stable sort. However that doesn't work because we use a
heap which doesn't support stable sort.

Closes mawww#1709, mawww#4813
krobelus added a commit to krobelus/kakoune that referenced this pull request Dec 2, 2023
…cified completions

When using either of

	set-option g completers option=my_option
	prompt -shell-script-candidates ...

While the search text is empty, the completions will be sorted
alphabetically.
This is bad because it means the most important entries are not listed
first, making them harder to select or even spot.

Let's apply input order before resorting to sorting alphabetically.

In theory there is a more elegant solution: sort candidates (except
if they're user input) before passing them to RankedMatch, and then
always use stable sort. However that doesn't work because we use a
heap which doesn't support stable sort.

Closes mawww#1709, mawww#4813
@krobelus
Copy link
Contributor Author

fixed for practical cases by #5035

@krobelus krobelus closed this Dec 13, 2023
evannjohnson pushed a commit to evannjohnson/kakoune-ev that referenced this pull request Jan 25, 2024
…cified completions

When using either of

	set-option g completers option=my_option
	prompt -shell-script-candidates ...

While the search text is empty, the completions will be sorted
alphabetically.
This is bad because it means the most important entries are not listed
first, making them harder to select or even spot.

Let's apply input order before resorting to sorting alphabetically.

In theory there is a more elegant solution: sort candidates (except
if they're user input) before passing them to RankedMatch, and then
always use stable sort. However that doesn't work because we use a
heap which doesn't support stable sort.

Closes mawww#1709, mawww#4813
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Remove sorting of user-supplied insert completions
1 participant