Opensearch/Elasticsearch support run-llama#542 (run-llama#548)

Co-authored-by: Jerry Liu <jerry@robustintelligence.com>
kenryu42 · Mar 1, 2023 · 0fa6423 · 0fa6423
1 parent 34e0961
commit 0fa6423
Show file tree

Hide file tree

Showing 10 changed files with 627 additions and 0 deletions.
diff --git a/examples/vector_indices/OpensearchDemo.ipynb b/examples/vector_indices/OpensearchDemo.ipynb
@@ -0,0 +1,228 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Using as a vector index.\n",
+    "\n",
+    "Elasticsearch only supports Lucene indices, so only Opensearch is supported."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**Note on setup**: We setup a local Opensearch instance through the following doc. https://opensearch.org/docs/1.0/\n",
+    "\n",
+    "If you run into SSL issues, try the following `docker run` command instead: \n",
+    "```\n",
+    "docker run -p 9200:9200 -p 9600:9600 -e \"discovery.type=single-node\" -e \"plugins.security.disabled=true\" opensearchproject/opensearch:1.0.1\n",
+    "```\n",
+    "\n",
+    "Reference: https://github.com/opensearch-project/OpenSearch/issues/1598"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": false,
+    "jupyter": {
+     "outputs_hidden": false
+    }
+   },
+   "outputs": [],
+   "source": [
+    "from os import getenv\n",
+    "from llama_index import SimpleDirectoryReader\n",
+    "from llama_index.indices.vector_store import GPTOpensearchIndex\n",
+    "from llama_index.vector_stores import OpensearchVectorClient\n",
+    "# http endpoint for your cluster (opensearch required for vector index usage)\n",
+    "endpoint = getenv(\"OPENSEARCH_ENDPOINT\", \"http://localhost:9200\")\n",
+    "# index to demonstrate the VectorStore impl\n",
+    "idx = getenv(\"OPENSEARCH_INDEX\", \"gpt-index-demo\")\n",
+    "# load some sample data\n",
+    "documents = SimpleDirectoryReader('../paul_graham_essay/data').load_data()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {
+    "collapsed": false,
+    "jupyter": {
+     "outputs_hidden": false
+    }
+   },
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "INFO:root:> [build_index_from_documents] Total LLM token usage: 0 tokens\n",
+      "INFO:root:> [build_index_from_documents] Total embedding token usage: 17598 tokens\n"
+     ]
+    }
+   ],
+   "source": [
+    "# OpensearchVectorClient stores text in this field by default\n",
+    "text_field = \"content\"\n",
+    "# OpensearchVectorClient stores embeddings in this field by default\n",
+    "embedding_field = \"embedding\"\n",
+    "# OpensearchVectorClient encapsulates logic for a\n",
+    "# single opensearch index with vector search enabled\n",
+    "client = OpensearchVectorClient(endpoint, idx, 1536, embedding_field=embedding_field, text_field=text_field)\n",
+    "# initialize an index using our sample data and the client we just created\n",
+    "index = GPTOpensearchIndex(documents=documents, client=client)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {
+    "collapsed": false,
+    "jupyter": {
+     "outputs_hidden": false
+    }
+   },
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "INFO:root:> [query] Total LLM token usage: 29628 tokens\n",
+      "INFO:root:> [query] Total embedding token usage: 8 tokens\n"
+     ]
+    },
+    {
+     "data": {
+      "text/plain": [
+       "'\\n\\nThe author grew up writing short stories, programming on an IBM 1401, and building a computer kit from Heathkit. They also wrote programs for a TRS-80, such as games, a program to predict model rocket flight, and a word processor. After years of nagging, they convinced their father to buy a TRS-80, and they wrote simple games, a program to predict how high their model rockets would fly, and a word processor that their father used to write at least one book. In college, they studied philosophy and AI, and wrote a book about Lisp hacking. They also took art classes and applied to art schools, and experimented with computer graphics and animation, exploring the use of algorithms to create art. Additionally, they experimented with machine learning algorithms, such as using neural networks to generate art, and exploring the use of numerical values to create art. They also took classes in fundamental subjects like drawing, color, and design, and applied to two art schools, RISD in the US, and the Accademia di Belli Arti in Florence. They were accepted to RISD, and while waiting to hear back from the Accademia, they learned Italian and took the entrance exam in Florence. They eventually graduated from RISD'"
+      ]
+     },
+     "execution_count": 4,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "# run query\n",
+    "res = index.query(\"What did the author do growing up?\")\n",
+    "res.response"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Use reader to check out what GPTOpensearchIndex just created in our index.\n",
+    "\n",
+    "Reader works with Elasticsearch too as it just uses the basic search features."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "metadata": {
+    "collapsed": false,
+    "jupyter": {
+     "outputs_hidden": false
+    }
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "embedding dimension: 1536\n",
+      "all fields in index: dict_keys(['content', 'embedding'])\n"
+     ]
+    }
+   ],
+   "source": [
+    "# create a reader to check out the index used in previous section.\n",
+    "from llama_index.readers import ElasticsearchReader\n",
+    "\n",
+    "rdr = ElasticsearchReader(endpoint, idx)\n",
+    "# set embedding_field optionally to read embedding data from the elasticsearch index\n",
+    "docs = rdr.load_data(text_field, embedding_field=embedding_field)\n",
+    "# docs have embeddings in them\n",
+    "print(\"embedding dimension:\", len(docs[0].embedding))\n",
+    "# full document is stored in extra_info\n",
+    "print(\"all fields in index:\", docs[0].extra_info.keys())"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "metadata": {
+    "collapsed": false,
+    "jupyter": {
+     "outputs_hidden": false
+    }
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "total number of chunks: 10\n"
+     ]
+    }
+   ],
+   "source": [
+    "# we can check out how the text was chunked by the `GPTOpensearchIndex`\n",
+    "print(\"total number of chunks created:\", len(docs))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 13,
+   "metadata": {
+    "collapsed": false,
+    "jupyter": {
+     "outputs_hidden": false
+    }
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "chunks that mention Lisp: 10\n",
+      "chunks that mention Yahoo: 8\n"
+     ]
+    }
+   ],
+   "source": [
+    "# search index using standard elasticsearch query DSL\n",
+    "docs = rdr.load_data(text_field, {\"query\": {\"match\": {text_field: \"Lisp\"}}})\n",
+    "print(\"chunks that mention Lisp:\", len(docs))\n",
+    "docs = rdr.load_data(text_field, {\"query\": {\"match\": {text_field: \"Yahoo\"}}})\n",
+    "print(\"chunks that mention Yahoo:\", len(docs))"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.8.4"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
diff --git a/gpt_index/data_structs/data_structs.py b/gpt_index/data_structs/data_structs.py
@@ -371,3 +371,12 @@ class ChromaIndexDict(IndexDict):
     def get_type(cls) -> str:
         """Get type."""
         return IndexStructType.CHROMA
+
+
+class OpensearchIndexDict(IndexDict):
+    """Index dict for Opensearch vector index."""
+
+    @classmethod
+    def get_type(cls) -> str:
+        """Get type."""
+        return IndexStructType.OPENSEARCH
diff --git a/gpt_index/data_structs/struct_type.py b/gpt_index/data_structs/struct_type.py
@@ -30,6 +30,9 @@ class IndexStructType(str, Enum):
         CHROMA ("chroma"): Chroma Vector Store Index.
             See :ref:`Ref-Indices-VectorStore`
             for more information on the Chroma vector store index.
+        OPENSEARCH ("opensearch"): Opensearch Vector Store Index.
+            See :ref:`Ref-Indices-VectorStore`
+            for more information on the Opensearch vector store index.
         SQL ("SQL"): SQL Structured Store Index.
             See :ref:`Ref-Indices-StructStore`
             for more information on the SQL vector store index.
@@ -54,6 +57,7 @@ class IndexStructType(str, Enum):
     QDRANT = "qdrant"
     CHROMA = "chroma"
     VECTOR_STORE = "vector_store"
+    OPENSEARCH = "opensearch"
 
     # for SQL index
     SQL = "sql"

diff --git a/gpt_index/indices/query/vector_store/queries.py b/gpt_index/indices/query/vector_store/queries.py
@@ -8,11 +8,13 @@
 from gpt_index.vector_stores import (
     ChromaVectorStore,
     FaissVectorStore,
+    OpensearchVectorStore,
     PineconeVectorStore,
     QdrantVectorStore,
     SimpleVectorStore,
     WeaviateVectorStore,
 )
+from gpt_index.vector_stores.opensearch import OpensearchVectorClient
 
 
 class GPTSimpleVectorIndexQuery(GPTVectorStoreIndexQuery):
@@ -190,3 +192,28 @@ def __init__(
             raise ValueError("chroma_collection is required.")
         vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
         super().__init__(index_struct=index_struct, vector_store=vector_store, **kwargs)
+
+
+class GPTOpensearchIndexQuery(GPTVectorStoreIndexQuery):
+    """GPT Opensearch vector index query.
+
+    Args:
+        text_qa_template (Optional[QuestionAnswerPrompt]): A Question-Answer Prompt
+            (see :ref:`Prompt-Templates`).
+        embed_model (Optional[BaseEmbedding]): Embedding model to use for
+            embedding similarity.
+        client (Optional[OpensearchVectorClient]): Opensearch vector client.
+
+    """
+
+    def __init__(
+        self,
+        index_struct: IndexDict,
+        client: Optional[OpensearchVectorClient] = None,
+        **kwargs: Any,
+    ) -> None:
+        """Initialize params."""
+        if client is None:
+            raise ValueError("OpensearchVectorClient client is required.")
+        vector_store = OpensearchVectorStore(client=client)
+        super().__init__(index_struct=index_struct, vector_store=vector_store, **kwargs)
diff --git a/gpt_index/indices/vector_store/__init__.py b/gpt_index/indices/vector_store/__init__.py
@@ -4,6 +4,7 @@
 from gpt_index.indices.vector_store.vector_indices import (
     GPTChromaIndex,
     GPTFaissIndex,
+    GPTOpensearchIndex,
     GPTPineconeIndex,
     GPTQdrantIndex,
     GPTSimpleVectorIndex,
@@ -18,4 +19,5 @@
     "GPTWeaviateIndex",
     "GPTQdrantIndex",
     "GPTChromaIndex",
+    "GPTOpensearchIndex",
 ]