Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Various search improvement suggestions #210

Open
5 of 7 tasks
joepio opened this issue Nov 12, 2021 · 0 comments
Open
5 of 7 tasks

Various search improvement suggestions #210

joepio opened this issue Nov 12, 2021 · 0 comments
Labels
help wanted Extra attention is needed server atomic-server

Comments

@joepio
Copy link
Member

joepio commented Nov 12, 2021

I've just implemented Full-Text Search #40 and it works pretty well! Good enough for now. However, I noticed some things could be improved upon:

  • Besides indexing only triples, consider indexing full resources. That way, a user could comine terms present in various fields. For example, Say I'd look for a red shirt. This shirt would have two relvant properties, its type (shirt) and its color (red). As it currently only indexes triples, it would find one triple for redand one forshirt`, but it would not find something that contains both. If we'd index a full resource, we'd fix this. Consider the new JSON fields for Tantivy full-text search #336 might be a solution. You can add a json_field to a Schema.
  • Boost titles
  • Fuzzy searching does not, at the moment, score items at all. In other words, we get kind of 'random' hits for fuzzy matches, which is what we use for all short strings. That's bad. I think there's people working on this though, see PR: Use Levenshtein distance to score documents in fuzzy term queries quickwit-oss/tantivy#998. But in another comment, the PR creator told we could think of this PR of as discarded.
  • Search inside collections or in some hierarchy Search inside collections & hierarchies #226
  • tokenize the search sentence into separate parts (a new query for each token). Inspiration permalink), (thanks @ChillFish8!)
  • There is no scoring system to make important resources rank higher (think pagerank from google). No user feedback to make the system learn from what is relevant to me. No synonyms.
  • Consider indexing connected resources, too. Say in the previous example, the red was not a literal string, but it was a resource somewhere else, possibly with a very obscure Subject URL. This would mean that we would not even hit the red shirt if we searched for red! We could fix this by indexing connected resources, and including these in the initial item. Perhaps we'd add a new field: connected, and serialize all values of all directly connected nodes in here. I think doing this for a depth of 1 is doable, although it would make indexing about 10x slower, and the size of the index, too. But it would open up some cool possibilities, such as searching for a user name + class type (e.g. joep document) and see all documents of that user - without having any form of explicit filters. That's pretty cool, right?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed server atomic-server
Projects
None yet
Development

No branches or pull requests

1 participant