Skip to content

Commit

Permalink
Added k-nn user guide and samples. (#449)
Browse files Browse the repository at this point in the history
* Added k-nn user guide and samples.

Signed-off-by: dblock <dblock@amazon.com>

* Added async samples.

Signed-off-by: dblock <dblock@amazon.com>

* Renamed Lucene Filters with Efficient Filters.

Signed-off-by: dblock <dblock@amazon.com>

* Fixing TOC from Lucene filters to Efficient filters

Signed-off-by: Vacha Shah <vachshah@amazon.com>

---------

Signed-off-by: dblock <dblock@amazon.com>
Signed-off-by: Vacha Shah <vachshah@amazon.com>
Co-authored-by: Vacha Shah <vachshah@amazon.com>
  • Loading branch information
dblock and VachaShah authored Jul 26, 2023
1 parent 58217d9 commit f54973e
Show file tree
Hide file tree
Showing 12 changed files with 1,265 additions and 10 deletions.
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ Inspired from [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)
- Added support for latest OpenSearch versions 2.7.0, 2.8.0 ([#445](https://github.com/opensearch-project/opensearch-py/pull/445))
- Added samples ([#447](https://github.com/opensearch-project/opensearch-py/pull/447))
- Improved CI performance of integration with unreleased OpenSearch ([#318](https://github.com/opensearch-project/opensearch-py/pull/318))
- Added k-NN guide and samples ([#449](https://github.com/opensearch-project/opensearch-py/pull/449))
### Changed
- Moved security from `plugins` to `clients` ([#442](https://github.com/opensearch-project/opensearch-py/pull/442))
### Deprecated
Expand Down
12 changes: 5 additions & 7 deletions USER_GUIDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,11 +26,7 @@ Then import it like any other module:
from opensearchpy import OpenSearch
```

For better performance we recommend the async client. To add the async client to your project, install it using [pip](https://pip.pypa.io/):

```bash
pip install opensearch-py[async]
```
For better performance we recommend the async client. See [Asynchronous I/O](guides/async.md) for more information.

In general, we recommend using a package manager, such as [poetry](https://python-poetry.org/docs/), for your projects. This is the package manager used for [samples](samples).

Expand Down Expand Up @@ -61,7 +57,7 @@ info = client.info()
print(f"Welcome to {info['version']['distribution']} {info['version']['number']}!")
```

See [hello.py](samples/hello/hello.py) for a working sample, and [guides/ssl](guides/ssl.md) for how to setup SSL certificates.
See [hello.py](samples/hello/hello.py) for a working synchronous sample, and [guides/ssl](guides/ssl.md) for how to setup SSL certificates.

### Creating an Index

Expand Down Expand Up @@ -148,6 +144,7 @@ print(response)

## Advanced Features

- [Asynchronous I/O](guides/async.md)
- [Authentication (IAM, SigV4)](guides/auth.md)
- [Configuring SSL](guides/ssl.md)
- [Bulk Indexing](guides/bulk.md)
Expand All @@ -161,4 +158,5 @@ print(response)

- [Security](guides/plugins/security.md)
- [Alerting](guides/plugins/alerting.md)
- [Index Management](guides/plugins/index_management.md)
- [Index Management](guides/plugins/index_management.md)
- [k-NN](guides/plugins/knn.md)
152 changes: 152 additions & 0 deletions guides/async.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,152 @@
- [Asynchronous I/O](#asynchronous-io)
- [Setup](#setup)
- [Async Loop](#async-loop)
- [Connect to OpenSearch](#connect-to-opensearch)
- [Create an Index](#create-an-index)
- [Index Documents](#index-documents)
- [Refresh the Index](#refresh-the-index)
- [Search](#search)
- [Delete Documents](#delete-documents)
- [Delete the Index](#delete-the-index)

# Asynchronous I/O

This client supports asynchronous I/O that improves performance and increases throughput. See [hello-async.py](../samples/hello/hello-async.py) or [knn-async-basics.py](../samples/knn/knn-async-basics.py) for a working asynchronous sample.

## Setup

To add the async client to your project, install it using [pip](https://pip.pypa.io/):

```bash
pip install opensearch-py[async]
```

In general, we recommend using a package manager, such as [poetry](https://python-poetry.org/docs/), for your projects. This is the package manager used for [samples](../samples). The following example includes `opensearch-py[async]` in `pyproject.toml`.

```toml
[tool.poetry.dependencies]
opensearch-py = { path = "../", extras=["async"] }
```

## Async Loop

```python
import asyncio

async def main():
client = AsyncOpenSearch(...)
try:
# your code here
finally:
client.close()

if __name__ == "__main__":
loop = asyncio.new_event_loop()
asyncio.set_event_loop(loop)
loop.run_until_complete(main())
loop.close()
```

## Connect to OpenSearch

```python
host = 'localhost'
port = 9200
auth = ('admin', 'admin') # For testing only. Don't store credentials in code.

client = AsyncOpenSearch(
hosts = [{'host': host, 'port': port}],
http_auth = auth,
use_ssl = True,
verify_certs = False,
ssl_show_warn = False
)

info = await client.info()
print(f"Welcome to {info['version']['distribution']} {info['version']['number']}!")
```

## Create an Index

```python
index_name = 'test-index'

index_body = {
'settings': {
'index': {
'number_of_shards': 4
}
}
}

if not await client.indices.exists(index=index_name):
await client.indices.create(
index_name,
body=index_body
)
```

## Index Documents

```python
await asyncio.gather(*[
client.index(
index = index_name,
body = {
'title': f"Moneyball {i}",
'director': 'Bennett Miller',
'year': '2011'
},
id = i
) for i in range(10)
])
```

## Refresh the Index

```python
await client.indices.refresh(index=index_name)
```

## Search

```python
q = 'miller'

query = {
'size': 5,
'query': {
'multi_match': {
'query': q,
'fields': ['title^2', 'director']
}
}
}

results = await client.search(
body = query,
index = index_name
)

for hit in results["hits"]["hits"]:
print(hit)
```

## Delete Documents

```python
await asyncio.gather(*[
client.delete(
index = index_name,
id = i
) for i in range(10)
])
```

## Delete the Index

```python
await client.indices.delete(
index = index_name
)
```
117 changes: 117 additions & 0 deletions guides/plugins/knn.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,117 @@
- [k-NN Plugin](#k-nn-plugin)
- [Basic Approximate k-NN](#basic-approximate-k-nn)
- [Create an Index](#create-an-index)
- [Index Vectors](#index-vectors)
- [Search for Nearest Neighbors](#search-for-nearest-neighbors)
- [Approximate k-NN with a Boolean Filter](#approximate-k-nn-with-a-boolean-filter)
- [Approximate k-NN with an Efficient Filter](#approximate-k-nn-with-an-efficient-filter)

# k-NN Plugin

Short for k-nearest neighbors, the k-NN plugin enables users to search for the k-nearest neighbors to a query point across an index of vectors. See [documentation](https://opensearch.org/docs/latest/search-plugins/knn/index/) for more information.

## Basic Approximate k-NN

In the following example we create a 5-dimensional k-NN index with random data. You can find a synchronous version of this working sample in [samples/knn/knn-basics.py](../../samples/knn/knn-basics.py) and an asynchronous one in [samples/knn/knn-async-basics.py](../../samples/knn/knn-async-basics.py).

```bash
$ poetry run knn/knn-basics.py

Searching for [0.61, 0.05, 0.16, 0.75, 0.49] ...
{'_index': 'my-index', '_id': '3', '_score': 0.9252405, '_source': {'values': [0.64, 0.3, 0.27, 0.68, 0.51]}}
{'_index': 'my-index', '_id': '4', '_score': 0.802375, '_source': {'values': [0.49, 0.39, 0.21, 0.42, 0.42]}}
{'_index': 'my-index', '_id': '8', '_score': 0.7826564, '_source': {'values': [0.33, 0.33, 0.42, 0.97, 0.56]}}
```

### Create an Index

```python
dimensions = 5
client.indices.create(index_name,
body={
"settings":{
"index.knn": True
},
"mappings":{
"properties": {
"values": {
"type": "knn_vector",
"dimension": dimensions
},
}
}
}
)
```

### Index Vectors

Create 10 random vectors and insert them using the bulk API.

```python
vectors = []
for i in range(10):
vec = []
for j in range(dimensions):
vec.append(round(random.uniform(0, 1), 2))

vectors.append({
"_index": index_name,
"_id": i,
"values": vec,
})

helpers.bulk(client, vectors)

client.indices.refresh(index=index_name)
```

### Search for Nearest Neighbors

Create a random vector of the same size and search for its nearest neighbors.

```python
vec = []
for j in range(dimensions):
vec.append(round(random.uniform(0, 1), 2))

search_query = {
"query": {
"knn": {
"values": {
"vector": vec,
"k": 3
}
}
}
}

results = client.search(index=index_name, body=search_query)
for hit in results["hits"]["hits"]:
print(hit)
```

## Approximate k-NN with a Boolean Filter

In [the boolean-filter.py sample](../../samples/knn/knn-boolean-filter.py) we create a 5-dimensional k-NN index with random data and a `metadata` field that contains a book genre (e.g. `fiction`). The search query is a k-NN search filtered by genre. The filter clause is outside the k-NN query clause and is applied after the k-NN search.

```bash
$ poetry run knn/knn-boolean-filter.py

Searching for [0.08, 0.42, 0.04, 0.76, 0.41] with the 'romance' genre ...

{'_index': 'my-index', '_id': '445', '_score': 0.95886475, '_source': {'values': [0.2, 0.54, 0.08, 0.87, 0.43], 'metadata': {'genre': 'romance'}}}
{'_index': 'my-index', '_id': '2816', '_score': 0.95256233, '_source': {'values': [0.22, 0.36, 0.01, 0.75, 0.57], 'metadata': {'genre': 'romance'}}}
```

## Approximate k-NN with an Efficient Filter

In [the lucene-filter.py sample](../../samples/knn/knn-efficient-filter.py) we implement the example in [the k-NN documentation](https://opensearch.org/docs/latest/search-plugins/knn/filter-search-knn/), which creates an index that uses the Lucene engine and HNSW as the method in the mapping, containing hotel location and parking data, then search for the top three hotels near the location with the coordinates `[5, 4]` that are rated between 8 and 10, inclusive, and provide parking.

```bash
$ poetry run knn/knn-efficient-filter.py

{'_index': 'hotels-index', '_id': '3', '_score': 0.72992706, '_source': {'location': [4.9, 3.4], 'parking': 'true', 'rating': 9}}
{'_index': 'hotels-index', '_id': '6', '_score': 0.3012048, '_source': {'location': [6.4, 3.4], 'parking': 'true', 'rating': 9}}
{'_index': 'hotels-index', '_id': '5', '_score': 0.24154587, '_source': {'location': [3.3, 4.5], 'parking': 'true', 'rating': 8}}
```
Loading

0 comments on commit f54973e

Please sign in to comment.