Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explore options for fuzzy-match and search suggestions #605

Open
danielballan opened this issue Nov 14, 2023 · 9 comments
Open

Explore options for fuzzy-match and search suggestions #605

danielballan opened this issue Nov 14, 2023 · 9 comments
Assignees
Labels
smse Scientific Metadata Search Engine, everything pertaining to natural language search

Comments

@danielballan
Copy link
Member

The built-in MapAdapter and external databroker.mongo_normalized adapter supports the FullText query. We will add support for FullText in the built-in SQL-backed Catalog Adapter in #456, #457 for SQLite and PostgreSQL respectively.

Next, we should consider fuzzy match and search suggestions. This has often been done with the ELK stack, but that is a heavy stack to take on for the sake of just one of its features. What are our options?

@Kezzsim highlighted the project typesense, which is exactly targeted at serving this use case without taking on the weight of ELK.

Also, I believe there is some functionality in this space available in SQLite and PostgreSQL. While not at the level of ELK, it would be good to understand precisely how far we can get with the tech stack we already have, and what its limitations are.

@Kezzsim Kezzsim added the smse Scientific Metadata Search Engine, everything pertaining to natural language search label Dec 29, 2023
@danielballan
Copy link
Member Author

In discussions with @Kezzsim, we are going ahead with TypeSense, as an optional add-on in the same way that Prometheus is an optional add-on.

I think that this will involve:

  1. Adding a new optional argument typesense to the Catalog constructors, which takes None (default---no typense) or a config dict like
{
  'api_key': 'Hu52dwsas2AdxdE',
  'nodes': [{
    'host': 'localhost',
    'port': '8108',
    'protocol': 'http'
  }],
  'connection_timeout_seconds': 2
}

tiled/tiled/catalog/adapter.py

Lines 1135 to 1169 in c76d1b3

def in_memory(
*,
metadata=None,
specs=None,
access_policy=None,
writable_storage=None,
readable_storage=None,
echo=DEFAULT_ECHO,
adapters_by_mimetype=None,
):
uri = "sqlite+aiosqlite:///:memory:"
return from_uri(
uri=uri,
metadata=metadata,
specs=specs,
access_policy=access_policy,
writable_storage=writable_storage,
readable_storage=readable_storage,
echo=echo,
adapters_by_mimetype=adapters_by_mimetype,
)
def from_uri(
uri,
*,
metadata=None,
specs=None,
access_policy=None,
writable_storage=None,
readable_storage=None,
init_if_not_exists=False,
echo=DEFAULT_ECHO,
adapters_by_mimetype=None,
):

Tiled config like:

trees:
 - tree: catalog
   args:
     uri: postgresql+asyncpg://...
     typesense:
       api_key: $TYPESENSE_API_KEY
       nodes:
         - host: localhost
           port: 8108
           protocol: http
      connection_timeout_seconds: 2

will just work, with no code changes to the config parser.

  1. Passing that config dig into Context.__init__ and creating an instance of a typesense.Client held as self.typesense_client on the Context.

class Context:
def __init__(
self,
engine,
writable_storage=None,
readable_storage=None,
adapters_by_mimetype=None,
key_maker=lambda: str(uuid.uuid4()),
):

  1. Also in Context.__init__, registering [after_insert] (https://docs.sqlalchemy.org/en/20/orm/events.html#sqlalchemy.orm.MapperEvents.after_insert) and after_update SQLAlchemy events that make the relevant calls from self.typesense_client. (I remain not entirely clear what these hooks give you access to, but the docs look promising.)

  2. Adding a new module tiled.commandline._typesense and updating tiled.commandline.main to add a tiled typsense subcommand to the CLI. I imagine we will need:

tiled typesense init TYPESENSE_URL [ANOTHER_TYPESENSE_URL] # define schemas
tiled typesense rebuild TYPESENSE_URL [ANOTHER_TYPESENSE_URL]  # drop data (if any) and rebuild

The utility urllib.parse.urlparse can be used to get from a CLI-friendly string like http://localhost:8108?api_key=Hu52dwsas2AdxdE into the structure:

{
  'api_key': '',
  'nodes': [{
    'host': 'localhost',
    'port': '8108',
    'protocol': 'http'
  }],
  'connection_timeout_seconds': 2
}

@danielballan
Copy link
Member Author

All of above is up for a rethink, just meant as a quick sketch to highlight the relevant sections of the Tiled code that I can see will need to be touched.

@danielballan
Copy link
Member Author

From discussion on 20 Feb:

  • The TypeSense ingestion (both at initialization and via the trigger) will ignore any nodes that do not meet some list of approved "specs" that TypeSense knows what to do with.
  • There will be additional configuration, passed to tiled, along these lines:
typesense_ingestion:
 - spec: BlueskyRun
   fields:
   - name: detectors  # field name in TypeSense
     path: "start.detectors"  # path into Tiled JSON metadata
     # Also type?
 - spec: SomeOtherThing
   ...

@danielballan
Copy link
Member Author

@danielballan
Copy link
Member Author

# config.yml
authentication:
  # The default is false. Set to true to enable any HTTP client that can
  # connect to _read_. An API key is still required to write.
  allow_anonymous_access: false
  single_user_api_key: "secret"  # for dev
trees:
  - path: /
    tree: catalog
    args:
      uri: "sqlite+aiosqlite:///:memory:"
      # or, uri: "sqlite+aiosqlite:////catalog.db"
      # or, "postgresql+asyncpg://..."
      writable_storage: "data/"
      init_if_not_exists: true
      typesense_client:
        schema:
        connection_info:
$ tiled serve config config.yml

@danielballan
Copy link
Member Author

image

@danielballan
Copy link
Member Author

Diagram from discussion today:

image

@danielballan
Copy link
Member Author

Phased approach:

  1. Python client grows FuzzySearch and server grows integration with (optional) TypeSense microservice. Demonstrations use a very simple schema---for example, just extracting plan_name from the RunStart document.
  2. Explore richer schemas, with input from beamline staff.
  3. Add support to the React client (and maybe other clients, like PyMCA) for search-results-as-you-type fuzzy match in a text bar.

We do not plan to add support for FuzzySearch to the databroker.mongo_normalized adapter. (This would require use to redo a large fraction of the work.) Support for fuzzy text search will be an incentive to migrate to the new storage.

@Kezzsim
Copy link
Contributor

Kezzsim commented Nov 8, 2024

image
We're going to make some changes to the schema to have a base and run on a per beamline basis.
The name of the catalog will need to match the key in the typesense schema in order for it to be applied.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
smse Scientific Metadata Search Engine, everything pertaining to natural language search
Projects
None yet
Development

No branches or pull requests

2 participants