Skip to content

Commit edd6de9

Browse files
authored
feat: Add elastic store (#34)
* add elastic search store * feat add vector search store * update documentation
1 parent bc912c1 commit edd6de9

15 files changed

+569
-18
lines changed

docs/how-to/custom_views_code.py

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,5 @@
11
# pylint: disable=missing-return-doc, missing-param-doc, missing-function-docstring, missing-class-docstring, missing-raises-doc
22
import dbally
3-
import os
43
import asyncio
54
from dataclasses import dataclass
65
from typing import Iterable, Callable, Any

docs/how-to/pandas_views_code.py

Lines changed: 0 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,6 @@
11
# pylint: disable=missing-return-doc, missing-param-doc, missing-function-docstring, missing-class-docstring, missing-raises-doc
22
import dbally
3-
import os
43
import asyncio
5-
from dataclasses import dataclass
6-
from typing import Iterable, Callable, Any
74
import pandas as pd
85

96
from dbally import decorators, DataFrameBaseView

docs/how-to/use_elastic_store.md

Lines changed: 103 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,103 @@
1+
# How-To Use Elastic to Store Similarity Index
2+
3+
[ElasticStore](https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-store.html) can be used as a store in SimilarityIndex. In this guide, we will show you how to execute a similarity search using Elasticsearch.
4+
In the example, the Elasticsearch engine is provided by the official Docker image. There are two approaches available to perform similarity searches: Elastic Search Store and Elastic Vector Search.
5+
Elastic Search Store uses embeddings and kNN search to find similarities, while Elastic Vector Search, which performs semantic search, uses the ELSER (Elastic Learned Sparse EncodeR) model to encode and search the data.
6+
7+
8+
## Prerequisites
9+
10+
[Download and deploy the Elasticsearch Docker image](https://www.elastic.co/guide/en/elasticsearch/reference/current/docker.html). Please note that for Elastic Vector Search, the Elasticsearch Docker container requires at least 8GB of RAM and
11+
[license activation](https://www.elastic.co/guide/en/kibana/current/managing-licenses.html) to use Machine Learning capabilities.
12+
13+
14+
```commandline
15+
docker network create elastic
16+
docker pull docker.elastic.co/elasticsearch/elasticsearch:8.13.4
17+
docker run --name es01 --net elastic -p 9200:9200 -it -m 2GB docker.elastic.co/elasticsearch/elasticsearch:8.13.4
18+
```
19+
20+
Copy the generated elastic password and enrollment token. These credentials are only shown when you start Elasticsearch for the first time once. You can regenerate the credentials using the following commands.
21+
```commandline
22+
docker cp es01:/usr/share/elasticsearch/config/certs/http_ca.crt .
23+
curl --cacert http_ca.crt -u elastic:$ELASTIC_PASSWORD https://localhost:9200
24+
```
25+
26+
To manage elasticsearch engine create Kibana container.
27+
```commandline
28+
docker run --name kib01 --net elastic -p 5601:5601 docker.elastic.co/kibana/kibana:8.13.4
29+
```
30+
31+
By default, the Kibana management dashboard is deployed at [link](http://localhost:5601/)
32+
33+
34+
For vector search, it is necessary to enroll in an [appropriate subscription level](https://www.elastic.co/subscriptions) or trial version that supports machine learning.
35+
Additionally, the [ELSER model must be downloaded](https://www.elastic.co/guide/en/machine-learning/current/ml-nlp-elser.html), which can be done through Kibana. Instructions can be found in the hosted Kibana instance under tabs:
36+
<br />downloading and deploying model - **Analytics -> Machine Learning -> Trained Model**,
37+
<br />vector search configuration - **Search -> Elastic Search -> Vector Search.**
38+
39+
40+
* Install elasticsearch extension
41+
```commandline
42+
pip install dbally[elasticsearch]
43+
```
44+
45+
## Implementing a SimilarityIndex
46+
47+
To use similarity search it is required to define data fetcher and data store.
48+
49+
### Data fetcher
50+
51+
```python
52+
class DummyCountryFetcher(SimilarityFetcher):
53+
async def fetch(self):
54+
return ["United States", "Canada", "Mexico"]
55+
```
56+
57+
### Data store
58+
Elastic store similarity search works on embeddings. For create embeddings the embedding client is passed as an argument.
59+
You can use [one of dbally embedding clients][dbally.embeddings.EmbeddingClient], such as [LiteLLMEmbeddingClient][dbally.embeddings.LiteLLMEmbeddingClient].
60+
61+
```python
62+
from dbally.embeddings.litellm import LiteLLMEmbeddingClient
63+
64+
embedding_client=LiteLLMEmbeddingClient(api_key="your-api-key")
65+
```
66+
67+
to define your [`ElasticsearchStore`][dbally.similarity.ElasticsearchStore].
68+
69+
```python
70+
from dbally.similarity.elasticsearch_store import ElasticsearchStore
71+
72+
data_store = ElasticsearchStore(
73+
index_name="country_similarity",
74+
host="https://localhost:9200",
75+
ca_cert_path="path_to_cert/http_ca.crt",
76+
http_user="elastic",
77+
http_password="password",
78+
embedding_client=embedding_client,
79+
),
80+
81+
```
82+
83+
After this setup, you can initialize the SimilarityIndex
84+
85+
```python
86+
from dbally.similarity import SimilarityIndex
87+
88+
country_similarity = SimilarityIndex(
89+
fetcher=DummyCountryFetcher(),
90+
store=data_store
91+
)
92+
```
93+
94+
and [update it and find the closest matches in the same way as in built-in similarity indices](use_custom_similarity_store.md/#using-the-similar)
95+
96+
You can then use this store to create a similarity index that maps user input to the closest matching value.
97+
To use Elastic Vector search download and deploy [ELSER v2](https://www.elastic.co/guide/en/machine-learning/current/ml-nlp-elser.html#elser-v2) model and create [ingest pipeline](https://www.elastic.co/guide/en/machine-learning/current/ml-nlp-elser.html#elasticsearch-ingest-pipeline).
98+
Now you can use this index to map user input to the closest matching value. For example, a user may type 'United States' and our index would return 'USA'.
99+
100+
## Links
101+
* [Similarity Indexes](use_custom_similarity_store.md)
102+
* [Example Elastic Search Store](use_elasticsearch_store_code.py)
103+
* [Example Elastic Vector Search](use_elastic_vector_store_code.py)
Lines changed: 102 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,102 @@
1+
# pylint: disable=missing-return-doc, missing-param-doc, missing-function-docstring
2+
import os
3+
import asyncio
4+
from typing_extensions import Annotated
5+
6+
import asyncclick as click
7+
from dotenv import load_dotenv
8+
import sqlalchemy
9+
from sqlalchemy import create_engine
10+
from sqlalchemy.ext.automap import automap_base
11+
12+
import dbally
13+
from dbally import decorators, SqlAlchemyBaseView
14+
from dbally.audit.event_handlers.cli_event_handler import CLIEventHandler
15+
from dbally.llms.litellm import LiteLLM
16+
from dbally.similarity import SimpleSqlAlchemyFetcher, SimilarityIndex
17+
from dbally.similarity.elastic_vector_search import ElasticVectorStore
18+
19+
load_dotenv()
20+
engine = create_engine("sqlite:///candidates.db")
21+
22+
23+
Base = automap_base()
24+
Base.prepare(autoload_with=engine)
25+
26+
Candidate = Base.classes.candidates
27+
28+
country_similarity = SimilarityIndex(
29+
fetcher=SimpleSqlAlchemyFetcher(
30+
engine,
31+
table=Candidate,
32+
column=Candidate.country,
33+
),
34+
store=ElasticVectorStore(
35+
index_name="country_vector_similarity",
36+
host=os.environ["ELASTIC_STORE_CONNECTION_STRING"],
37+
ca_cert_path=os.environ["ELASTIC_CERT_PATH"],
38+
http_user=os.environ["ELASTIC_AUTH_USER"],
39+
http_password=os.environ["ELASTIC_USER_PASSWORD"],
40+
),
41+
)
42+
43+
44+
class CandidateView(SqlAlchemyBaseView):
45+
"""
46+
A view for retrieving candidates from the database.
47+
"""
48+
49+
def get_select(self) -> sqlalchemy.Select:
50+
"""
51+
Creates the initial SqlAlchemy select object, which will be used to build the query.
52+
"""
53+
return sqlalchemy.select(Candidate)
54+
55+
@decorators.view_filter()
56+
def at_least_experience(self, years: int) -> sqlalchemy.ColumnElement:
57+
"""
58+
Filters candidates with at least `years` of experience.
59+
"""
60+
return Candidate.years_of_experience >= years
61+
62+
@decorators.view_filter()
63+
def senior_data_scientist_position(self) -> sqlalchemy.ColumnElement:
64+
"""
65+
Filters candidates that can be considered for a senior data scientist position.
66+
"""
67+
return sqlalchemy.and_(
68+
Candidate.position.in_(["Data Scientist", "Machine Learning Engineer", "Data Engineer"]),
69+
Candidate.years_of_experience >= 3,
70+
)
71+
72+
@decorators.view_filter()
73+
def from_country(self, country: Annotated[str, country_similarity]) -> sqlalchemy.ColumnElement:
74+
"""
75+
Filters candidates from a specific country.
76+
"""
77+
return Candidate.country == country
78+
79+
80+
@click.command()
81+
@click.argument("country", type=str, default="United States")
82+
@click.argument("years_of_experience", type=str, default="2")
83+
async def main(country="United States", years_of_experience="2"):
84+
await country_similarity.update()
85+
86+
llm = LiteLLM(model_name="gpt-3.5-turbo", api_key=os.environ["OPENAI_API_KEY"])
87+
collection = dbally.create_collection("recruitment", llm, event_handlers=[CLIEventHandler()])
88+
collection.add(CandidateView, lambda: CandidateView(engine))
89+
90+
result = await collection.ask(
91+
f"Find someone from the {country} with more than {years_of_experience} years of experience."
92+
)
93+
94+
print(f"The generated SQL query is: {result.context.get('sql')}")
95+
print()
96+
print(f"Retrieved {len(result.results)} candidates:")
97+
for candidate in result.results:
98+
print(candidate)
99+
100+
101+
if __name__ == "__main__":
102+
asyncio.run(main())
Lines changed: 106 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,106 @@
1+
# pylint: disable=missing-return-doc, missing-param-doc, missing-function-docstring
2+
import os
3+
import asyncio
4+
from typing_extensions import Annotated
5+
6+
import asyncclick as click
7+
from dotenv import load_dotenv
8+
import sqlalchemy
9+
from sqlalchemy import create_engine
10+
from sqlalchemy.ext.automap import automap_base
11+
12+
import dbally
13+
from dbally import decorators, SqlAlchemyBaseView
14+
from dbally.audit.event_handlers.cli_event_handler import CLIEventHandler
15+
from dbally.similarity import SimpleSqlAlchemyFetcher, SimilarityIndex
16+
from dbally.embeddings.litellm import LiteLLMEmbeddingClient
17+
from dbally.llms.litellm import LiteLLM
18+
from dbally.similarity.elasticsearch_store import ElasticsearchStore
19+
20+
load_dotenv()
21+
engine = create_engine("sqlite:///candidates.db")
22+
23+
24+
Base = automap_base()
25+
Base.prepare(autoload_with=engine)
26+
27+
Candidate = Base.classes.candidates
28+
29+
country_similarity = SimilarityIndex(
30+
fetcher=SimpleSqlAlchemyFetcher(
31+
engine,
32+
table=Candidate,
33+
column=Candidate.country,
34+
),
35+
store=ElasticsearchStore(
36+
index_name="country_similarity",
37+
host=os.environ["ELASTIC_STORE_CONNECTION_STRING"],
38+
ca_cert_path=os.environ["ELASTIC_CERT_PATH"],
39+
http_user=os.environ["ELASTIC_AUTH_USER"],
40+
http_password=os.environ["ELASTIC_USER_PASSWORD"],
41+
embedding_client=LiteLLMEmbeddingClient(
42+
api_key=os.environ["OPENAI_API_KEY"],
43+
),
44+
),
45+
)
46+
47+
48+
class CandidateView(SqlAlchemyBaseView):
49+
"""
50+
A view for retrieving candidates from the database.
51+
"""
52+
53+
def get_select(self) -> sqlalchemy.Select:
54+
"""
55+
Creates the initial SqlAlchemy select object, which will be used to build the query.
56+
"""
57+
return sqlalchemy.select(Candidate)
58+
59+
@decorators.view_filter()
60+
def at_least_experience(self, years: int) -> sqlalchemy.ColumnElement:
61+
"""
62+
Filters candidates with at least `years` of experience.
63+
"""
64+
return Candidate.years_of_experience >= years
65+
66+
@decorators.view_filter()
67+
def senior_data_scientist_position(self) -> sqlalchemy.ColumnElement:
68+
"""
69+
Filters candidates that can be considered for a senior data scientist position.
70+
"""
71+
return sqlalchemy.and_(
72+
Candidate.position.in_(["Data Scientist", "Machine Learning Engineer", "Data Engineer"]),
73+
Candidate.years_of_experience >= 3,
74+
)
75+
76+
@decorators.view_filter()
77+
def from_country(self, country: Annotated[str, country_similarity]) -> sqlalchemy.ColumnElement:
78+
"""
79+
Filters candidates from a specific country.
80+
"""
81+
return Candidate.country == country
82+
83+
84+
@click.command()
85+
@click.argument("country", type=str, default="United States")
86+
@click.argument("years_of_experience", type=str, default="2")
87+
async def main(country="United States", years_of_experience="2"):
88+
await country_similarity.update()
89+
await country_similarity.update()
90+
llm = LiteLLM(model_name="gpt-3.5-turbo", api_key=os.environ["OPENAI_API_KEY"])
91+
collection = dbally.create_collection("recruitment", llm, event_handlers=[CLIEventHandler()])
92+
collection.add(CandidateView, lambda: CandidateView(engine))
93+
94+
result = await collection.ask(
95+
f"Find someone from the {country} with more than {years_of_experience} years of experience."
96+
)
97+
98+
print(f"The generated SQL query is: {result.context.get('sql')}")
99+
print()
100+
print(f"Retrieved {len(result.results)} candidates:")
101+
for candidate in result.results:
102+
print(candidate)
103+
104+
105+
if __name__ == "__main__":
106+
asyncio.run(main())

docs/quickstart/quickstart2_code.py

Lines changed: 8 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,28 +1,30 @@
11
# pylint: disable=missing-return-doc, missing-param-doc, missing-function-docstring
2-
import dbally
32
import os
43
import asyncio
54
from typing_extensions import Annotated
65

6+
from dotenv import load_dotenv
77
import sqlalchemy
88
from sqlalchemy import create_engine
99
from sqlalchemy.ext.automap import automap_base
1010

11+
import dbally
1112
from dbally import decorators, SqlAlchemyBaseView
1213
from dbally.audit.event_handlers.cli_event_handler import CLIEventHandler
1314
from dbally.similarity import SimpleSqlAlchemyFetcher, FaissStore, SimilarityIndex
1415
from dbally.embeddings.litellm import LiteLLMEmbeddingClient
1516
from dbally.llms.litellm import LiteLLM
1617

17-
engine = create_engine('sqlite:///candidates.db')
18+
load_dotenv()
19+
engine = create_engine("sqlite:///candidates.db")
1820

1921
Base = automap_base()
2022
Base.prepare(autoload_with=engine)
2123

2224
Candidate = Base.classes.candidates
2325

2426
country_similarity = SimilarityIndex(
25-
fetcher=SimpleSqlAlchemyFetcher(
27+
fetcher=SimpleSqlAlchemyFetcher(
2628
engine,
2729
table=Candidate,
2830
column=Candidate.country,
@@ -37,10 +39,12 @@
3739
),
3840
)
3941

42+
4043
class CandidateView(SqlAlchemyBaseView):
4144
"""
4245
A view for retrieving candidates from the database.
4346
"""
47+
4448
def get_select(self) -> sqlalchemy.Select:
4549
"""
4650
Creates the initial SqlAlchemy select object, which will be used to build the query.
@@ -71,6 +75,7 @@ def from_country(self, country: Annotated[str, country_similarity]) -> sqlalchem
7175
"""
7276
return Candidate.country == country
7377

78+
7479
async def main():
7580
await country_similarity.update()
7681

0 commit comments

Comments
 (0)