Skip to content

Commit

Permalink
small typo fixes to code comments and README
Browse files Browse the repository at this point in the history
  • Loading branch information
bojieli committed Mar 24, 2023
1 parent ed2f8c7 commit 9940b71
Show file tree
Hide file tree
Showing 10 changed files with 24 additions and 24 deletions.
10 changes: 5 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -95,15 +95,15 @@ The plugin exposes the following endpoints for upserting, querying, and deleting

- `/upsert`: This endpoint allows uploading one or more documents and storing their text and metadata in the vector database. The documents are split into chunks of around 200 tokens, each with a unique ID. The endpoint expects a list of documents in the request body, each with a `text` field, and optional `id` and `metadata` fields. The `metadata` field can contain the following optional subfields: `source`, `source_id`, `url`, `created_at`, and `author`. The endpoint returns a list of the IDs of the inserted documents (an ID is generated if not initially provided).

- `/upsert-file`: This endpoint allows uploading a single file (PDF, TXT, DOCX, PPTX, or MD) and store its text and metadata in the vector database. The file is converted to plain text and split into chunks of around 200 tokens, each with a unique ID. The endpoint returns a list containing the generated id of the inserted file.
- `/upsert-file`: This endpoint allows uploading a single file (PDF, TXT, DOCX, PPTX, or MD) and storing its text and metadata in the vector database. The file is converted to plain text and split into chunks of around 200 tokens, each with a unique ID. The endpoint returns a list containing the generated id of the inserted file.

- `/query`: This endpoint allows querying the vector database using one or more natural language queries and optional metadata filters. The endpoint expects a list of queries in the request body, each with a `query` and optional `filter` and `top_k` fields. The `filter` field should contain a subset of the following subfields: `source`, `source_id`, `document_id`, `url`, `created_at`, and `author`. The `top_k` field specifies how many results to return for a given query, and the default value is 3. The endpoint returns a list of objects that each contain a list of the most relevant document chunks for the given query, along with their text, metadata and similarity scores.

- `/delete`: This endpoint allows deleting one or more documents from the vector database using their IDs, a metadata filter, or a delete_all flag. The endpoint expects at least one of the following parameters in the request body: `ids`, `filter`, or `delete_all`. The `ids` parameter should be a list of document IDs to delete; all document chunks for the document with these IDS will be deleted. The `filter` parameter should contain a subset of the following subfields: `source`, `source_id`, `document_id`, `url`, `created_at`, and `author`. The `delete_all` parameter should be a boolean indicating whether to delete all documents from the vector database. The endpoint returns a boolean indicating whether the deletion was successful.

The detailed specifications and examples of the request and response models can be found by running the app locally and navigating to http://0.0.0.0:8000/openapi.json, or in the OpenAPI schema [here](/.well-known/openapi.yaml). Note that the OpenAPI schema only contains the `/query` endpoint, because that is the only function that ChatGPT needs to access. This way, ChatGPT can use the plugin only to retrieve relevant documents based on natural language queries or needs. However, if developers want to also give ChatGPT the ability to remember things for later, they can use the `/upsert` endpoint to save snippets from the conversation to the vector database. An example of a manifest and OpenAPI schema that give ChatGPT access to the `/upsert` endpoint can be found [here](/examples/memory).
The detailed specifications and examples of the request and response models can be found by running the app locally and navigating to http://0.0.0.0:8000/openapi.json, or in the OpenAPI schema [here](/.well-known/openapi.yaml). Note that the OpenAPI schema only contains the `/query` endpoint, because that is the only function that ChatGPT needs to access. This way, ChatGPT can use the plugin only to retrieve relevant documents based on natural language queries or needs. However, if developers want to also give ChatGPT the ability to remember things for later, they can use the `/upsert` endpoint to save snippets from the conversation to the vector database. An example of a manifest and OpenAPI schema that gives ChatGPT access to the `/upsert` endpoint can be found [here](/examples/memory).

To include custom metadata fields, edit the `DocumentMetadata` and `DocumentMetadataFilter` data models [here](/models/models.py), and update the OpenAPI schema [here](/.well-known/openapi.yaml). You can update this easily by running the app locally, copying the json found at http://0.0.0.0:8000/sub/openapi.json, and converting it to YAML format with [Swagger Editor](https://editor.swagger.io/). Alternatively, you can replace the `openapi.yaml` file with an `openapi.json` file.
To include custom metadata fields, edit the `DocumentMetadata` and `DocumentMetadataFilter` data models [here](/models/models.py), and update the OpenAPI schema [here](/.well-known/openapi.yaml). You can update this easily by running the app locally, copying the JSON found at http://0.0.0.0:8000/sub/openapi.json, and converting it to YAML format with [Swagger Editor](https://editor.swagger.io/). Alternatively, you can replace the `openapi.yaml` file with an `openapi.json` file.

## Quickstart

Expand Down Expand Up @@ -339,11 +339,11 @@ Find more information [here](https://zilliz.com).
**Self Hosted vs SaaS**
Zilliz is a SaaS database, but offers an open source solution, Milvus. Both options offer fast searches at the billion scale, but Zilliz handles data management for you. It automatically scales compute and storage resources and creates optimal indexes for your data. See the comparison [here](https://zilliz.com/doc/about_zilliz_cloud).
Zilliz is a SaaS database, but offers an open-source solution, Milvus. Both options offer fast searches at the billion scale, but Zilliz handles data management for you. It automatically scales compute and storage resources and creates optimal indexes for your data. See the comparison [here](https://zilliz.com/doc/about_zilliz_cloud).
##### Deploying the Database
Zilliz Cloud is deployable in a few simple steps. First, create an account [here](https://cloud.zilliz.com/signup). Once you have an account set up, follow the guide [here](https://zilliz.com/doc/quick_start) to setup a database and get the parameters needed for this application.
Zilliz Cloud is deployable in a few simple steps. First, create an account [here](https://cloud.zilliz.com/signup). Once you have an account set up, follow the guide [here](https://zilliz.com/doc/quick_start) to set up a database and get the parameters needed for this application.
Environment Variables:
Expand Down
12 changes: 6 additions & 6 deletions datastore/providers/milvus_datastore.py
Original file line number Diff line number Diff line change
Expand Up @@ -205,14 +205,14 @@ def _create_collection(self, create_new: bool) -> None:

self.col.create_index("embedding", index_params=i_p)
self.index_params = i_p
print("Creation of Milvus default index succesful")
print("Creation of Milvus default index successful")
# If create fails, most likely due to being Zilliz Cloud instance, try to create an AutoIndex
except MilvusException:
print("Attempting creation of Zilliz Cloud default index")
i_p = {"metric_type": "L2", "index_type": "AUTOINDEX", "params": {}}
self.col.create_index("embedding", index_params=i_p)
self.index_params = i_p
print("Creation of Zilliz Cloud default index succesful")
print("Creation of Zilliz Cloud default index successful")
# If an index already exists, grab its params
else:
self.index_params = self.col.indexes[0].to_dict()['index_param']
Expand Down Expand Up @@ -353,7 +353,7 @@ async def _single_query(query: QueryWithEmbedding) -> QueryResult:
# Grab the values that correspond to our fields, ignore pk and embedding.
for x in [field[0] for field in SCHEMA[2:]]:
metadata[x] = hit.entity.get(x)
# If the source isnt valid, conver to None
# If the source isn't valid, convert to None
if metadata["source"] not in Source.__members__:
metadata["source"] = None
# Text falls under the DocumentChunk
Expand Down Expand Up @@ -387,7 +387,7 @@ async def delete(
Args:
ids (Optional[List[str]], optional): The document_ids to delete. Defaults to None.
filter (Optional[DocumentMetadataFilter], optional): The filter to delet by. Defaults to None.
filter (Optional[DocumentMetadataFilter], optional): The filter to delete by. Defaults to None.
delete_all (Optional[bool], optional): Whether to drop the collection and recreate it. Defaults to None.
"""
# If deleting all, drop and create the new collection
Expand Down Expand Up @@ -416,7 +416,7 @@ async def delete(
if len(ids) != 0:
# Delete the entries for each pk
res = self.col.delete(f"pk in [{','.join(ids)}]")
# Incremet our deleted count
# Increment our deleted count
delete_count += int(res.delete_count) # type: ignore

# Check if empty filter
Expand All @@ -436,7 +436,7 @@ async def delete(
# Increment our delete count
delete_count += int(res.delete_count) # type: ignore

# This setting perfoms flushes after delete. Small delete == bad to use
# This setting performs flushes after delete. Small delete == bad to use
# self.col.flush()

return True
Expand Down
10 changes: 5 additions & 5 deletions datastore/providers/zilliz_datastore.py
Original file line number Diff line number Diff line change
Expand Up @@ -133,7 +133,7 @@ def _create_collection(self, create_new: bool) -> None:

# Check if the collection doesnt exist
if utility.has_collection(ZILLIZ_COLLECTION, using=self.alias) is False:
# If it doesnt exist use the field params from init to create a new schem
# If it doesnt exist use the field params from init to create a new schema
schema = [field[1] for field in SCHEMA]
schema = CollectionSchema(schema)
# Use the schema to create a new collection
Expand Down Expand Up @@ -201,7 +201,7 @@ async def _upsert(self, chunks: Dict[str, List[DocumentChunk]]) -> List[str]:
print(f"Error upserting batch: {e}")
raise e

# This setting perfoms flushes after insert. Small insert == bad to use
# This setting performs flushes after insert. Small insert == bad to use
# self.col.flush()

return doc_ids
Expand Down Expand Up @@ -321,7 +321,7 @@ async def delete(
Args:
ids (Optional[List[str]], optional): The document_ids to delete. Defaults to None.
filter (Optional[DocumentMetadataFilter], optional): The filter to delet by. Defaults to None.
filter (Optional[DocumentMetadataFilter], optional): The filter to delete by. Defaults to None.
delete_all (Optional[bool], optional): Whether to drop the collection and recreate it. Defaults to None.
"""
# If deleting all, drop and create the new collection
Expand Down Expand Up @@ -350,7 +350,7 @@ async def delete(
if len(ids) != 0:
# Delete the entries for each pk
res = self.col.delete(f"pk in [{','.join(ids)}]")
# Incremet our deleted count
# Increment our deleted count
delete_count += int(res.delete_count) # type: ignore

# Check if empty filter
Expand All @@ -370,7 +370,7 @@ async def delete(
# Increment our delete count
delete_count += int(res.delete_count) # type: ignore

# This setting perfoms flushes after delete. Small delete == bad to use
# This setting performs flushes after delete. Small delete == bad to use
# self.col.flush()

return True
Expand Down
4 changes: 2 additions & 2 deletions examples/memory/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -80,7 +80,7 @@ async def upsert_main(
@sub_app.post(
"/upsert",
response_model=UpsertResponse,
# NOTE: We are describing the the shape of the API endpoint input due to a current limitation in parsing arrays of objects from OpenAPI schemas. This will not be necessary in future.
# NOTE: We are describing the shape of the API endpoint input due to a current limitation in parsing arrays of objects from OpenAPI schemas. This will not be necessary in the future.
description="Save chat information. Accepts an array of documents with text (potential questions + conversation text), metadata (source 'chat' and timestamp, no ID as this will be generated). Confirm with the user before saving, ask for more details/context.",
)
async def upsert(
Expand Down Expand Up @@ -116,7 +116,7 @@ async def query_main(
@sub_app.post(
"/query",
response_model=QueryResponse,
# NOTE: We are describing the the shape of the API endpoint input due to a current limitation in parsing arrays of objects from OpenAPI schemas. This will not be necessary in future.
# NOTE: We are describing the shape of the API endpoint input due to a current limitation in parsing arrays of objects from OpenAPI schemas. This will not be necessary in the future.
description="Accepts search query objects array each with query and optional filter. Break down complex questions into sub-questions. Refine results by criteria, e.g. time / source, don't do this often. Split queries if ResponseTooLargeError occurs.",
)
async def query(
Expand Down
2 changes: 1 addition & 1 deletion examples/providers/pinecone/semantic-search.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -88,7 +88,7 @@
"INFO: Application startup complete.\n",
"```\n",
"\n",
"In that case, the app has automatically connected to our index (specified by `PINECONE_INDEX`), if no index with that name existed beforehand, the app creates one for us.\n",
"In that case, the app is automatically connected to our index (specified by `PINECONE_INDEX`), if no index with that name existed beforehand, the app creates one for us.\n",
"\n",
"Now we're ready to move on to populating our index with some data."
]
Expand Down
2 changes: 1 addition & 1 deletion scripts/process_json/process_json.py
Original file line number Diff line number Diff line change
Expand Up @@ -118,7 +118,7 @@ async def main():
"--screen_for_pii",
default=False,
type=bool,
help="A boolean flag to indicate whether to try to the PII detection function (using a language model)",
help="A boolean flag to indicate whether to try the PII detection function (using a language model)",
)
parser.add_argument(
"--extract_metadata",
Expand Down
2 changes: 1 addition & 1 deletion scripts/process_jsonl/process_jsonl.py
Original file line number Diff line number Diff line change
Expand Up @@ -116,7 +116,7 @@ async def main():
"--screen_for_pii",
default=False,
type=bool,
help="A boolean flag to indicate whether to try to the PII detection function (using a language model)",
help="A boolean flag to indicate whether to try the PII detection function (using a language model)",
)
parser.add_argument(
"--extract_metadata",
Expand Down
2 changes: 1 addition & 1 deletion scripts/process_zip/process_zip.py
Original file line number Diff line number Diff line change
Expand Up @@ -123,7 +123,7 @@ async def main():
"--screen_for_pii",
default=False,
type=bool,
help="A boolean flag to indicate whether to try to the PII detection function (using a language model)",
help="A boolean flag to indicate whether to try the PII detection function (using a language model)",
)
parser.add_argument(
"--extract_metadata",
Expand Down
2 changes: 1 addition & 1 deletion server/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -94,7 +94,7 @@ async def query_main(
@sub_app.post(
"/query",
response_model=QueryResponse,
# NOTE: We are describing the the shape of the API endpoint input due to a current limitation in parsing arrays of objects from OpenAPI schemas. This will not be necessary in future.
# NOTE: We are describing the shape of the API endpoint input due to a current limitation in parsing arrays of objects from OpenAPI schemas. This will not be necessary in the future.
description="Accepts search query objects array each with query and optional filter. Break down complex questions into sub-questions. Refine results by criteria, e.g. time / source, don't do this often. Split queries if ResponseTooLargeError occurs.",
)
async def query(
Expand Down
2 changes: 1 addition & 1 deletion services/file.py
Original file line number Diff line number Diff line change
Expand Up @@ -100,7 +100,7 @@ async def extract_text_from_form_file(file: UploadFile):

temp_file_path = "/tmp/temp_file"

# write the file to a temporary locatoin
# write the file to a temporary location
with open(temp_file_path, "wb") as f:
f.write(file_stream)

Expand Down

0 comments on commit 9940b71

Please sign in to comment.