Experiment/Draft: Search engine #1986

SmallCoccinelle · 2021-11-10T15:51:24Z

This branch implements a search engine in the stash backend. It is currently very much a draft, as there are numerous unclosed cases which needs tracking. But In the spirit of making the experiment a bit more visible, I'll make a draft PR for it.

Status

Currently we implement:

A new GraphQL field, Query.search(..).
Dataloaders for scene, performer, tag
Full reindexing of the sqlite database into a single search index
On-the-fly reindexing of changes, piggybacking on plugin change hooks
Limited documents for Scene, Performer, and Tag. These are deliberately kept small until the main design has settled.

We can query for scenes and performers. Scoring is TF-IDF.

Query string support is what Bleve supports. I.e., a query kitty redhead searches for documents with kitty OR redhead, but scores objects where both match higher than documents with a partial match. You can do either "kitty redhead" for an exact phrase match, +kitty +redhead for an conjunctive match (i.e., AND), or +kitty /red(head)?/ for the corresponding regex match and so on.

We also support dates. I.e., +date:>=2013 +date:<2020 searches for objects in that date range.

We support a simple date range facet tracking recent stuff (less than 30 days old) from older stuff. But clearly, this needs to be a GraphQL input type so the front-end can manipulate the facets.

Performance

Currently, indexing 20k scenes and 300 performers takes about 5-6 seconds. This is fast, but as we flesh out the documents, it's going to take a lot longer because we need to analyze more and more data. I have a ~4TB stash and the current index size is 7 megabyte. Again, this is going to climb. My Sqlite db is 148 megabyte.

Things missing

Rough code overview

The code is mostly orthogonal to the rest of stash. Our interfacing points are:

Some transactions using repositories.
The event bus, for tracking on-the-fly changes.
Dataloaders. I added them to pkg/models.

The go files are as follows:

documents/documents.go - Implements the documents which get stored in the search index, together with the index mappings of those documents.
loaders.go a collection of data loaders used by the search engine. Can be pushed into the resolver code as well.
changeset.go - implements changesets. Changesets can be turned into index batches and applied to the index.
rollup.go - A rollup goroutine tracks event changes into a changeset and hands them off to the search engine upon request.
search.go - The main API. Implements search.Search(..) the main API entry.
engine.go - Implements the meat of the search engine backend. An engine is governed by a managing goroutine, communicating with a rollup goroutine to maintain the index. The engine handles reindexing.
engine_indexing.go - Indexing code for the engine
engine_preprocess.go - Preprocessing code for the index. This analyzes a changeset to figure out collateral updates.

Playground

Add

search: /path/to/index/directory

Into your stash config. Start stash. It should generate an index and start a full reindex in 5s. Every 15s it should write out stats for the reindexing. If you trash the index.bleve folder, it will reindex. Then go to your GraphQL playground. A typical query would look something like

query Q {
  search(query: "kitty redhead") {
    took
    total

    edges {
      score
      node {
        id
        __typename
        ... on Performer {
          name
        }
        ... on Scene {
          title
          details
          date
          updated_at
          performers {
            name
            eye_color
          }
        }
      }
    }
  }
}

Introduce package event with a dispatcher subsystem. Add this to the manager and to the plugin subsystem. Whenever the plugin subsystem execute a PostHook, we dispatch an Change event on the event dispatcher bus. This currently has no effect, but allows us to register subsystems on the event bus for further processing. In particular, search. By design, we opt to hook the plugin system and pass to the event bus for now. One, it makes it easier to remove again, and two, the context handling inside the plugin subsystem doesn't want to live on the other side of an event bus. While here, write a test for the dispatcher code.

The rollup service turns events into batches for processing. The search engine wraps the rollup engine, making it private to the search subsystem.

Add a search experiment to the code: Schema in GraphQL is extended with an early search system. Engine is extended with search, and gets passed through the resolver. Some conversion is currently done to glue things together, but that structure is ugly and needs some improvementificationism.

In go 1.18 strings.Cut becomes a reality. However, since it is such a useful tool, add it to the utils package for now. Once we are on go 1.18, we can replace utils.Cut with strings.Cut

This is almost not needed, but to be safe, add the ability to protect changes to the engine, and lock most usage via an RLock().

Search results are Connection objects. Wrap each result in a contextual object. This can be used for scoring/highligting/facets later. Introduce interface SearchResultItem. Implement the interface for models.Scene. Add hydration code for scenes.

Add scores into search results. Move Search-internal NodeIDs into the search system. Introduce search.Item which protects the rest of the system against search-specific structures. Simplify hydration since it can now use search.Item.

This experiment tells us facets want to be an input type rather than the current enum of predefined facets.

Reindexing of scenes at the moment, because that's what we have. The core idea is fairly simple: batch-process a table, a 1000 entries at a time, index them. Replace the data loader every 10 rounds (10k entries) so it doesn't grow too big. While reindexing is ongoing, the online changemap is still being built in the background. If reindexing takes more than the timer ticker, it will fire immediately after. If reindexing takes more than twice the timer ticker, the ticker protects against this and only fires once.

It is really a set of changes. The map used to implement the set is an implementation detail that shouldn't be part of the name.

Pull stat tracking outward. Set up a reporting ticker and use it for reporting progress. This rolls up the log lines into something a bit more comprehensible.

Change the schema to support performer searches. Performers are SearchResultItems. Make the search type optional, default to searching everything. Enable hydration of performers. Add performers to the data loader code. Introduce a performer document for the search index. Load performers before loading scenes, to utilize the dataloader cache maximally. When considering scenes, find the needed performers, and prime the cache with them. When processing scenes, denormalize the performer into the scene.

If we update performers, all scenes those performers are in should also change. Push this in. Currently, we over-apply on a full reindex, which can be fixed later, perhaps by moving preprocessing upward, or by having a flag on the batch processing layer. It's plenty fast right now though.

Plug a hole with scenes that can be nil.

This change anticipates far better batch processing in the future. By explicitly preprocessing, we can do this in the online processing loop, but avoid it in the offline processing loop. This will avoid processing elements twice.

SmallCoccinelle · 2021-11-10T15:58:05Z

Things which are easily salvaged out of this:

Dataloaders
SearchResultItem. This interface is usually called Node in GraphQL parlance, but unfortunately, our IDs don't contain the type, so supporting

interface Node {
    id: ID!
}

type Scene implements Node {
    ...
}

type Query {
    nodes(ids: [ID]): [Node]
    ...
}

isn't really possible.

Early tag support setup.

People will expect a tag to be fairly easy to grab. So prefer a direct encoding over a nested subdocument. This allows a search for `tag:woodworking` rather than `tag.name:woodworking`. While here, add the tag ids into the scene document as well. This will help with deletion.

Changesets will keep growing.

Implement Stringer formatting for event.Change. Introduce engine_preprocess.go. Move preprocessing code into the engine itself. Use the engine to pull data which needs a change on a performer deletion. Rework changeset into changeset code only.

SmallCoccinelle · 2021-11-11T20:56:16Z

Properly handle Tag deletion. It seems quite within reach now we properly delete performers from the index.

Code is a bit spammy at the moment with logging, but that will be fixed at some point.

SmallCoccinelle · 2021-11-12T18:23:11Z

Handle tag merging. No event is being caught, so is one being generated?

Introduce studios * In data loading * In search documents * In changesets * In the search path * In the GraphQL schema No functional indexing yet.

The strategy is to fold reindexing into a worklist which we process through systematically. This reduces the full reindexer into a single loop, which then collapses the code to a far simpler code path, where the only variance is a switch on the document type. Use this new strategy to handle studios as well for full reindexing.

SmallCoccinelle · 2021-11-16T22:39:50Z

Studios should be linked to Scenes

Rather than having a single large function, split the work into smaller functions and let the function names describe what is being done. This should make the code more local and easier to read.

Introduce indexing of studios in scenes. Introduce documents.DocType to properly type the documents as an enum.

Facets are going to be a thing we add later on. An MVP doesn't need facets, and we can remove lots of complexity if we don't have to worry about them right now.

If a merge is called, we should process all sources and the destination. Create an event ofr each of these.

skier233 · 2022-11-10T15:15:34Z

@SmallCoccinelle Is there any interest in finishing this or would it need started from scratch?

SmallCoccinelle added 26 commits November 4, 2021 18:48

Also start the dispatcher.

44cd993

Introduce a rollup service and search engine

2236345

The rollup service turns events into batches for processing. The search engine wraps the rollup engine, making it private to the search subsystem.

Introduce a scene data loader.

352fe5c

Introduce a Scene Index.

9ddd308

Introduce go 1.18s strings.Cut

550a061

In go 1.18 strings.Cut becomes a reality. However, since it is such a useful tool, add it to the utils package for now. Once we are on go 1.18, we can replace utils.Cut with strings.Cut

Protect indexes by a mutex.

8d5b419

This is almost not needed, but to be safe, add the ability to protect changes to the engine, and lock most usage via an RLock().

Flesh out search results

bb863c7

Search results are Connection objects. Wrap each result in a contextual object. This can be used for scoring/highligting/facets later. Introduce interface SearchResultItem. Implement the interface for models.Scene. Add hydration code for scenes.

Add scores to search result, flesh out items

4f53631

Add scores into search results. Move Search-internal NodeIDs into the search system. Introduce search.Item which protects the rest of the system against search-specific structures. Simplify hydration since it can now use search.Item.

Add the facet experiment

a51ab18

This experiment tells us facets want to be an input type rather than the current enum of predefined facets.

Rename changeMap -> changeSet

e947cb1

It is really a set of changes. The map used to implement the set is an implementation detail that shouldn't be part of the name.

Documentation nit

95ef1de

Improve reporting ergonomics.

becf427

Pull stat tracking outward. Set up a reporting ticker and use it for reporting progress. This rolls up the log lines into something a bit more comprehensible.

Add search stats into the graphql result.

81ed4e9

Derive "year" from "date". Support quick year queries.

0e564bd

Add generated dataloaders

0b90a64

Dead Code Elimination

c79e086

ChangeMap -> ChangeSet

c6e1c8a

Doc.

b1d7600

Implement Tag, resolve nil scenes

d55670c

Plug a hole with scenes that can be nil.

More nil robustness.

4cc9c44

Move pre-processing into the changeset.

0b93723

This change anticipates far better batch processing in the future. By explicitly preprocessing, we can do this in the online processing loop, but avoid it in the offline processing loop. This will avoid processing elements twice.

SmallCoccinelle added 3 commits November 11, 2021 14:49

Support tags.

6914c1c

Early tag support setup.

Move changesets into their own file

dc32d31

Changesets will keep growing.

SmallCoccinelle added 2 commits November 11, 2021 21:50

Handle proper performer deletion

9834a79

Implement Stringer formatting for event.Change. Introduce engine_preprocess.go. Move preprocessing code into the engine itself. Use the engine to pull data which needs a change on a performer deletion. Rework changeset into changeset code only.

Move engine reindexing to its own file

63a2933

Add tag preprocessing.

a769d89

Code is a bit spammy at the moment with logging, but that will be fixed at some point.

SmallCoccinelle added 7 commits November 14, 2021 14:40

Add an event tracker while developing this

0e7d9ca

Ready ourselves for handling studios

84ef060

Introduce studios * In data loading * In search documents * In changesets * In the search path * In the GraphQL schema No functional indexing yet.

Remove a TODO which isn't relevant anymore

edc03f5

Resolve a couple of linting warnings

0c8b48f

Index studios in batch processing

8201e36

Implement hydration of studios as well.

016fd6e

SmallCoccinelle added 9 commits November 17, 2021 16:29

Split and simplify batch processing

64f8927

Rather than having a single large function, split the work into smaller functions and let the function names describe what is being done. This should make the code more local and easier to read.

Documentation.

a429f17

Index studios in scenes. More types.

249771b

Introduce indexing of studios in scenes. Introduce documents.DocType to properly type the documents as an enum.

Add studio preprocessing

bd6b48d

More doc.

2256de0

Merge branch 'develop' into search-engine

285e899

Remove facets for now

68c09c4

Facets are going to be a thing we add later on. An MVP doesn't need facets, and we can remove lots of complexity if we don't have to worry about them right now.

Merge branch 'develop' into search-engine

e398c60

Fold merge postHooks into their events

f269f04

If a merge is called, we should process all sources and the destination. Create an event ofr each of these.

QxxxGit mentioned this pull request Jan 11, 2022

[Feature] Multiple performer image #571

Open

7dJx1qP mentioned this pull request Nov 10, 2022

[Feature] Scene search includes tag names and performer names #2976

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Experiment/Draft: Search engine #1986

Experiment/Draft: Search engine #1986

SmallCoccinelle commented Nov 10, 2021 •

edited

Loading

SmallCoccinelle commented Nov 10, 2021 •

edited

Loading

SmallCoccinelle commented Nov 11, 2021 •

edited

Loading

SmallCoccinelle commented Nov 12, 2021 •

edited

Loading

SmallCoccinelle commented Nov 16, 2021 •

edited

Loading

skier233 commented Nov 10, 2022

Experiment/Draft: Search engine #1986

Are you sure you want to change the base?

Experiment/Draft: Search engine #1986

Conversation

SmallCoccinelle commented Nov 10, 2021 • edited Loading

Status

Performance

Things missing

Rough code overview

Playground

SmallCoccinelle commented Nov 10, 2021 • edited Loading

SmallCoccinelle commented Nov 11, 2021 • edited Loading

SmallCoccinelle commented Nov 12, 2021 • edited Loading

SmallCoccinelle commented Nov 16, 2021 • edited Loading

skier233 commented Nov 10, 2022

SmallCoccinelle commented Nov 10, 2021 •

edited

Loading

SmallCoccinelle commented Nov 10, 2021 •

edited

Loading

SmallCoccinelle commented Nov 11, 2021 •

edited

Loading

SmallCoccinelle commented Nov 12, 2021 •

edited

Loading

SmallCoccinelle commented Nov 16, 2021 •

edited

Loading