Description
openedon Jan 18, 2024
Summary
This is a proposal much like #346 or #351 for enabling the ability to bundle static data stream data within a package, such that when the package is installed, the data stream is created, and the bundled data is ingested into the data stream.
The specific use case here is for shipping 'Knowledge Base' content for use by the Elastic Assistants. For example, both Security and Observability Assistants are currently bundling our ES|QL docs with the Kibana distribution for each release. We then take this data, optionally chunk it, and then embed/ingest it using ELSER into a 'knowledge base' data stream so the assistants can query it for their ES|QL query generation features. Each release we'll need to update this content, and ship it as part of the Kibana distribution, with no ability to ship intermediate content updates outside of the Kibana release cycle.
Additionally, as mentioned in #346 (comment), this essentially provides us the ability to ship 'Custom GPTs' that can integrate with our assistants, and so opens up a world of possibilities for users to configure and expand the capabilities of the Security and Observability Assistants.
Requirement Details
Configuration
The core requirement here is for the ability to include the following when creating a package:
- Any number of data streams to create, though realistically one is probably sufficient
- An arbitrary number of documents, perhaps in json format, or zipped as detailed in [discuss] Support (fairly large) sample data set package #346
- This generally won't be large amounts of data as detailed in [discuss] Support (fairly large) sample data set package #346 (our ES|QL docs are 196 documents and ~125KB), however I would expect some users would push this to enable RAG over larger data sets
- Some configuration for the destination data stream of the bundled documents. If we include a raw dump of the documents from ES, perhaps we can use just the
_index
fields to route them accordingly?
Behavior
Upon installation, the package should install the included data streams, then ingest the bundled documents into their destination data stream. This initial data should stick around for as long as the package is installed. If the package is removed, the data stream + initial data should be removed as well. When the package is updated, it would be fine to wipe the data stream/initial data and treat it as a fresh install. Whatever is easiest/most resilient would be fine for the first iteration here. No need to worry about appending new data on upgrade, or dealing with mapping changes, just delete the data streams and re-install/re-ingest the initial data.
The above would be sufficient enough for us to start bundling knowledge base documents in packages, at which point we could install as needed in support of specific assistant features.