Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Experiment/Draft: Search engine #1986

Draft
wants to merge 48 commits into
base: develop
Choose a base branch
from

Conversation

SmallCoccinelle
Copy link
Contributor

@SmallCoccinelle SmallCoccinelle commented Nov 10, 2021

This branch implements a search engine in the stash backend. It is currently very much a draft, as there are numerous unclosed cases which needs tracking. But In the spirit of making the experiment a bit more visible, I'll make a draft PR for it.

Status

Currently we implement:

  • A new GraphQL field, Query.search(..).
  • Dataloaders for scene, performer, tag
  • Full reindexing of the sqlite database into a single search index
  • On-the-fly reindexing of changes, piggybacking on plugin change hooks
  • Limited documents for Scene, Performer, and Tag. These are deliberately kept small until the main design has settled.

We can query for scenes and performers. Scoring is TF-IDF.

Query string support is what Bleve supports. I.e., a query kitty redhead searches for documents with kitty OR redhead, but scores objects where both match higher than documents with a partial match. You can do either "kitty redhead" for an exact phrase match, +kitty +redhead for an conjunctive match (i.e., AND), or +kitty /red(head)?/ for the corresponding regex match and so on.

We also support dates. I.e., +date:>=2013 +date:<2020 searches for objects in that date range.

We support a simple date range facet tracking recent stuff (less than 30 days old) from older stuff. But clearly, this needs to be a GraphQL input type so the front-end can manipulate the facets.

Performance

Currently, indexing 20k scenes and 300 performers takes about 5-6 seconds. This is fast, but as we flesh out the documents, it's going to take a lot longer because we need to analyze more and more data. I have a ~4TB stash and the current index size is 7 megabyte. Again, this is going to climb. My Sqlite db is 148 megabyte.

Things missing

  • We are currently indexing a small subset of what is in sqlite. This is deliberate. It's relatively easy to flesh out once the other parts fall into place. That's where the hard parts are.
  • Index Management. We want to occasionally reindex everything to remove staleness. We also want to reindex if the index version changes. Doing so should happen in the background with an index swap once the reindex has caught up to the current state. The tool here is index aliases.
  • Handling of tags. With performers and tags in the experiment, we would know way more about data interlinkage.
  • Proper deletion. If a tag/performer is being deleted, it must also be removed from all scenes. This is currently not implemented, but can be done by batched index queries. We want to track a deletion list for a batch and handle it accordingly.
  • Proper configuration. I sorta hacked in a new config-knob, but it does require some more attention, and I'm not sure how to handle this correctly so it can eventually show in the UI.
  • Galleries. To be handled later.
  • Movies. To be handled later.
  • Studios. To be handled later.
  • Images. Large document set.
  • Reindexing is currently a mess. Plan is to fix this once we have tags indexed as well because there's enough working stuff going on. It's slowly being improved, but there's some repeated code in there I don't like.
  • There's a lot of batching in the indexing process to avoid using too many resources. This is currently untuned, and somewhat requires indexing not being the mess it currently is.
  • A lot of this code should be fairly easy to test. But there are way too few test cases at the moment.

Rough code overview

The code is mostly orthogonal to the rest of stash. Our interfacing points are:

  • Some transactions using repositories.
  • The event bus, for tracking on-the-fly changes.
  • Dataloaders. I added them to pkg/models.

The go files are as follows:

  • documents/documents.go - Implements the documents which get stored in the search index, together with the index mappings of those documents.
  • loaders.go a collection of data loaders used by the search engine. Can be pushed into the resolver code as well.
  • changeset.go - implements changesets. Changesets can be turned into index batches and applied to the index.
  • rollup.go - A rollup goroutine tracks event changes into a changeset and hands them off to the search engine upon request.
  • search.go - The main API. Implements search.Search(..) the main API entry.
  • engine.go - Implements the meat of the search engine backend. An engine is governed by a managing goroutine, communicating with a rollup goroutine to maintain the index. The engine handles reindexing.
  • engine_indexing.go - Indexing code for the engine
  • engine_preprocess.go - Preprocessing code for the index. This analyzes a changeset to figure out collateral updates.

Playground

Add

search: /path/to/index/directory

Into your stash config. Start stash. It should generate an index and start a full reindex in 5s. Every 15s it should write out stats for the reindexing. If you trash the index.bleve folder, it will reindex. Then go to your GraphQL playground. A typical query would look something like

query Q {
  search(query: "kitty redhead") {
    took
    total

    edges {
      score
      node {
        id
        __typename
        ... on Performer {
          name
        }
        ... on Scene {
          title
          details
          date
          updated_at
          performers {
            name
            eye_color
          }
        }
      }
    }
  }
}

Introduce package event with a dispatcher subsystem. Add this to the
manager and to the plugin subsystem.

Whenever the plugin subsystem execute a PostHook, we dispatch an
Change event on the event dispatcher bus. This currently has no
effect, but allows us to register subsystems on the event bus for
further processing. In particular, search.

By design, we opt to hook the plugin system and pass to the event bus
for now. One, it makes it easier to remove again, and two, the context
handling inside the plugin subsystem doesn't want to live on the
other side of an event bus.

While here, write a test for the dispatcher code.
The rollup service turns events into batches for processing. The search
engine wraps the rollup engine, making it private to the search
subsystem.
Add a search experiment to the code:

Schema in GraphQL is extended with an early search system.

Engine is extended with search, and gets passed through the resolver.

Some conversion is currently done to glue things together, but that
structure is ugly and needs some improvementificationism.
In go 1.18 strings.Cut becomes a reality. However, since it is such
a useful tool, add it to the utils package for now. Once we are on
go 1.18, we can replace utils.Cut with strings.Cut
This is almost not needed, but to be safe, add the ability to protect
changes to the engine, and lock most usage via an RLock().
Search results are Connection objects. Wrap each result in a contextual
object. This can be used for scoring/highligting/facets later.

Introduce interface SearchResultItem. Implement the interface for
models.Scene. Add hydration code for scenes.
Add scores into search results. Move Search-internal NodeIDs into the
search system.

Introduce search.Item which protects the rest of the system against
search-specific structures. Simplify hydration since it can now use
search.Item.
This experiment tells us facets want to be an input type rather than
the current enum of predefined facets.
Reindexing of scenes at the moment, because that's what we have. The
core idea is fairly simple: batch-process a table, a 1000 entries
at a time, index them. Replace the data loader every 10 rounds
(10k entries) so it doesn't grow too big.

While reindexing is ongoing, the online changemap is still being built
in the background. If reindexing takes more than the timer ticker,
it will fire immediately after. If reindexing takes more than twice
the timer ticker, the ticker protects against this and only fires once.
It is really a set of changes. The map used to implement the set is an
implementation detail that shouldn't be part of the name.
Pull stat tracking outward. Set up a reporting ticker and use it for
reporting progress. This rolls up the log lines into something a bit
more comprehensible.
Change the schema to support performer searches. Performers are
SearchResultItems. Make the search type optional, default to searching
everything.

Enable hydration of performers.

Add performers to the data loader code.

Introduce a performer document for the search index.

Load performers before loading scenes, to utilize the dataloader
cache maximally.

When considering scenes, find the needed performers, and prime the
cache with them.

When processing scenes, denormalize the performer into the scene.
If we update performers, all scenes those performers are in should
also change. Push this in.

Currently, we over-apply on a full reindex, which can be fixed later,
perhaps by moving preprocessing upward, or by having a flag on the
batch processing layer. It's plenty fast right now though.
Plug a hole with scenes that can be nil.
This change anticipates far better batch processing in the future.
By explicitly preprocessing, we can do this in the online processing
loop, but avoid it in the offline processing loop. This will avoid
processing elements twice.
@SmallCoccinelle
Copy link
Contributor Author

SmallCoccinelle commented Nov 10, 2021

Things which are easily salvaged out of this:

  • Dataloaders
  • SearchResultItem. This interface is usually called Node in GraphQL parlance, but unfortunately, our IDs don't contain the type, so supporting
interface Node {
    id: ID!
}

type Scene implements Node {
    ...
}

type Query {
    nodes(ids: [ID]): [Node]
    ...
}

isn't really possible.

Early tag support setup.
People will expect a tag to be fairly easy to grab. So prefer a direct
encoding over a nested subdocument. This allows a search for
`tag:woodworking` rather than `tag.name:woodworking`.

While here, add the tag ids into the scene document as well. This will
help with deletion.
Changesets will keep growing.
Implement Stringer formatting for event.Change.

Introduce engine_preprocess.go. Move preprocessing code into the engine
itself. Use the engine to pull data which needs a change on a performer
deletion. Rework changeset into changeset code only.
@SmallCoccinelle
Copy link
Contributor Author

SmallCoccinelle commented Nov 11, 2021

  • Properly handle Tag deletion. It seems quite within reach now we properly delete performers from the index.

Code is a bit spammy at the moment with logging, but that will be
fixed at some point.
@SmallCoccinelle
Copy link
Contributor Author

SmallCoccinelle commented Nov 12, 2021

  • Handle tag merging. No event is being caught, so is one being generated?

Introduce studios

* In data loading
* In search documents
* In changesets
* In the search path
* In the GraphQL schema

No functional indexing yet.
The strategy is to fold reindexing into a worklist which we
process through systematically. This reduces the full reindexer
into a single loop, which then collapses the code to a far simpler
code path, where the only variance is a switch on the document type.

Use this new strategy to handle studios as well for full reindexing.
@SmallCoccinelle
Copy link
Contributor Author

SmallCoccinelle commented Nov 16, 2021

  • Studios should be linked to Scenes

Rather than having a single large function, split the work
into smaller functions and let the function names describe what
is being done. This should make the code more local and easier to
read.
Introduce indexing of studios in scenes.

Introduce documents.DocType to properly type the documents as an enum.
Facets are going to be a thing we add later on. An MVP doesn't need
facets, and we can remove lots of complexity if we don't have to worry
about them right now.
If a merge is called, we should process all sources and the destination.
Create an event ofr each of these.
@skier233
Copy link
Contributor

@SmallCoccinelle Is there any interest in finishing this or would it need started from scratch?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants