forked from run-llama/llama_index
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Opensearch/Elasticsearch support run-llama#542 (run-llama#548)
Co-authored-by: Jerry Liu <jerry@robustintelligence.com>
- Loading branch information
1 parent
34e0961
commit 0fa6423
Showing
10 changed files
with
627 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,228 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# Using as a vector index.\n", | ||
"\n", | ||
"Elasticsearch only supports Lucene indices, so only Opensearch is supported." | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"**Note on setup**: We setup a local Opensearch instance through the following doc. https://opensearch.org/docs/1.0/\n", | ||
"\n", | ||
"If you run into SSL issues, try the following `docker run` command instead: \n", | ||
"```\n", | ||
"docker run -p 9200:9200 -p 9600:9600 -e \"discovery.type=single-node\" -e \"plugins.security.disabled=true\" opensearchproject/opensearch:1.0.1\n", | ||
"```\n", | ||
"\n", | ||
"Reference: https://github.com/opensearch-project/OpenSearch/issues/1598" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": { | ||
"collapsed": false, | ||
"jupyter": { | ||
"outputs_hidden": false | ||
} | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"from os import getenv\n", | ||
"from llama_index import SimpleDirectoryReader\n", | ||
"from llama_index.indices.vector_store import GPTOpensearchIndex\n", | ||
"from llama_index.vector_stores import OpensearchVectorClient\n", | ||
"# http endpoint for your cluster (opensearch required for vector index usage)\n", | ||
"endpoint = getenv(\"OPENSEARCH_ENDPOINT\", \"http://localhost:9200\")\n", | ||
"# index to demonstrate the VectorStore impl\n", | ||
"idx = getenv(\"OPENSEARCH_INDEX\", \"gpt-index-demo\")\n", | ||
"# load some sample data\n", | ||
"documents = SimpleDirectoryReader('../paul_graham_essay/data').load_data()" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 3, | ||
"metadata": { | ||
"collapsed": false, | ||
"jupyter": { | ||
"outputs_hidden": false | ||
} | ||
}, | ||
"outputs": [ | ||
{ | ||
"name": "stderr", | ||
"output_type": "stream", | ||
"text": [ | ||
"INFO:root:> [build_index_from_documents] Total LLM token usage: 0 tokens\n", | ||
"INFO:root:> [build_index_from_documents] Total embedding token usage: 17598 tokens\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"# OpensearchVectorClient stores text in this field by default\n", | ||
"text_field = \"content\"\n", | ||
"# OpensearchVectorClient stores embeddings in this field by default\n", | ||
"embedding_field = \"embedding\"\n", | ||
"# OpensearchVectorClient encapsulates logic for a\n", | ||
"# single opensearch index with vector search enabled\n", | ||
"client = OpensearchVectorClient(endpoint, idx, 1536, embedding_field=embedding_field, text_field=text_field)\n", | ||
"# initialize an index using our sample data and the client we just created\n", | ||
"index = GPTOpensearchIndex(documents=documents, client=client)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 4, | ||
"metadata": { | ||
"collapsed": false, | ||
"jupyter": { | ||
"outputs_hidden": false | ||
} | ||
}, | ||
"outputs": [ | ||
{ | ||
"name": "stderr", | ||
"output_type": "stream", | ||
"text": [ | ||
"INFO:root:> [query] Total LLM token usage: 29628 tokens\n", | ||
"INFO:root:> [query] Total embedding token usage: 8 tokens\n" | ||
] | ||
}, | ||
{ | ||
"data": { | ||
"text/plain": [ | ||
"'\\n\\nThe author grew up writing short stories, programming on an IBM 1401, and building a computer kit from Heathkit. They also wrote programs for a TRS-80, such as games, a program to predict model rocket flight, and a word processor. After years of nagging, they convinced their father to buy a TRS-80, and they wrote simple games, a program to predict how high their model rockets would fly, and a word processor that their father used to write at least one book. In college, they studied philosophy and AI, and wrote a book about Lisp hacking. They also took art classes and applied to art schools, and experimented with computer graphics and animation, exploring the use of algorithms to create art. Additionally, they experimented with machine learning algorithms, such as using neural networks to generate art, and exploring the use of numerical values to create art. They also took classes in fundamental subjects like drawing, color, and design, and applied to two art schools, RISD in the US, and the Accademia di Belli Arti in Florence. They were accepted to RISD, and while waiting to hear back from the Accademia, they learned Italian and took the entrance exam in Florence. They eventually graduated from RISD'" | ||
] | ||
}, | ||
"execution_count": 4, | ||
"metadata": {}, | ||
"output_type": "execute_result" | ||
} | ||
], | ||
"source": [ | ||
"# run query\n", | ||
"res = index.query(\"What did the author do growing up?\")\n", | ||
"res.response" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Use reader to check out what GPTOpensearchIndex just created in our index.\n", | ||
"\n", | ||
"Reader works with Elasticsearch too as it just uses the basic search features." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 9, | ||
"metadata": { | ||
"collapsed": false, | ||
"jupyter": { | ||
"outputs_hidden": false | ||
} | ||
}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"embedding dimension: 1536\n", | ||
"all fields in index: dict_keys(['content', 'embedding'])\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"# create a reader to check out the index used in previous section.\n", | ||
"from llama_index.readers import ElasticsearchReader\n", | ||
"\n", | ||
"rdr = ElasticsearchReader(endpoint, idx)\n", | ||
"# set embedding_field optionally to read embedding data from the elasticsearch index\n", | ||
"docs = rdr.load_data(text_field, embedding_field=embedding_field)\n", | ||
"# docs have embeddings in them\n", | ||
"print(\"embedding dimension:\", len(docs[0].embedding))\n", | ||
"# full document is stored in extra_info\n", | ||
"print(\"all fields in index:\", docs[0].extra_info.keys())" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 10, | ||
"metadata": { | ||
"collapsed": false, | ||
"jupyter": { | ||
"outputs_hidden": false | ||
} | ||
}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"total number of chunks: 10\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"# we can check out how the text was chunked by the `GPTOpensearchIndex`\n", | ||
"print(\"total number of chunks created:\", len(docs))" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 13, | ||
"metadata": { | ||
"collapsed": false, | ||
"jupyter": { | ||
"outputs_hidden": false | ||
} | ||
}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"chunks that mention Lisp: 10\n", | ||
"chunks that mention Yahoo: 8\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"# search index using standard elasticsearch query DSL\n", | ||
"docs = rdr.load_data(text_field, {\"query\": {\"match\": {text_field: \"Lisp\"}}})\n", | ||
"print(\"chunks that mention Lisp:\", len(docs))\n", | ||
"docs = rdr.load_data(text_field, {\"query\": {\"match\": {text_field: \"Yahoo\"}}})\n", | ||
"print(\"chunks that mention Yahoo:\", len(docs))" | ||
] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "Python 3 (ipykernel)", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.8.4" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 4 | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.