Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rename eland to opensearch #3

Merged
merged 20 commits into from
Sep 9, 2022
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
PR feedback
  • Loading branch information
LEFTA98 committed Sep 6, 2022
commit ed1f7f9ec802a3aef2db3646de7e46f93a295603
6 changes: 3 additions & 3 deletions .ci/jobs/defaults.yml
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@
<commitId>, etc.)
properties:
- github:
url: https://github.com/elastic/opensearch_py_ml
url: https://github.com/opensearch-project/opensearch-py-ml
- inject:
properties-content: HOME=$JENKINS_HOME
concurrent: true
Expand All @@ -32,7 +32,7 @@
reference-repo: /var/lib/jenkins/.git-references/opensearch_py_ml.git
branches:
- ${branch_specifier}
url: git@github.com:elastic/opensearch_py_ml.git
url: git@github.com:opensearch-project/opensearch-py-ml.git
basedir: ''
wipe-workspace: 'True'
triggers:
Expand All @@ -46,7 +46,7 @@
- axis:
type: yaml
filename: .ci/test-matrix.yml
name: ELASTICSEARCH_VERSION
name: OPENSEARCH_VERSION
- axis:
type: yaml
filename: .ci/test-matrix.yml
Expand Down
2 changes: 1 addition & 1 deletion .ci/jobs/elastic+eland+7.x.yml
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
- job:
name: elastic+opensearch_py_ml+7.x
display-name: 'elastic / opensearch_py_ml # 7.x'
display-name: 'opensearch-project / opensearch-py-ml # 7.x'
description: Eland is a data science client with a Pandas-like interface
junit_results: "*-junit.xml"
parameters:
Expand Down
2 changes: 1 addition & 1 deletion .ci/jobs/elastic+eland+main.yml
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
- job:
name: elastic+opensearch_py_ml+main
display-name: 'elastic / opensearch_py_ml # main'
display-name: 'opensearch-project / opensearch-py-ml # main'
description: Eland is a data science client with a Pandas-like interface
junit_results: "*-junit.xml"
parameters:
Expand Down
2 changes: 1 addition & 1 deletion .ci/jobs/elastic+eland+pull-request.yml
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
- job:
name: elastic+opensearch_py_ml+pull-request
display-name: 'elastic / opensearch_py_ml # pull-request'
display-name: 'opensearch-project / opensearch-py-ml # pull-request'
description: Testing of opensearch_py_ml pull requests.
scm:
- git:
Expand Down
5 changes: 2 additions & 3 deletions .ci/test-matrix.yml
Original file line number Diff line number Diff line change
@@ -1,8 +1,7 @@
---

ELASTICSEARCH_VERSION:
- '8.1.0-SNAPSHOT'
- '8.0.0-SNAPSHOT'
OPENSEARCH_VERSION:
- '2.2.0-SNAPSHOT'

PANDAS_VERSION:
- '1.2.0'
Expand Down
264 changes: 7 additions & 257 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,260 +1,10 @@
<div align="center">
<a href="https://github.com/elastic/eland">
<img src="https://raw.githubusercontent.com/elastic/eland/main/docs/sphinx/logo/opensearch_py_ml.png" width="30%"
alt="Eland" />
</a>
</div>
<br />
<div align="center">
<a href="https://pypi.org/project/eland"><img src="https://img.shields.io/pypi/v/opensearch_py_ml.svg" alt="PyPI Version"></a>
<a href="https://anaconda.org/conda-forge/eland"><img src="https://img.shields.io/conda/vn/conda-forge/eland"
alt="Conda Version"></a>
<a href="https://pepy.tech/project/eland"><img src="https://pepy.tech/badge/eland" alt="Downloads"></a>
<a href="https://pypi.org/project/eland"><img src="https://img.shields.io/pypi/status/opensearch_py_ml.svg"
alt="Package Status"></a>
<a href="https://clients-ci.elastic.co/job/elastic+eland+main"><img
src="https://clients-ci.elastic.co/buildStatus/icon?job=elastic%2Beland%2Bmain" alt="Build Status"></a>
<a href="https://github.com/elastic/eland/blob/main/LICENSE.txt"><img src="https://img.shields.io/pypi/l/opensearch_py_ml.svg"
alt="License"></a>
<a href="https://opensearch_py_ml.readthedocs.io"><img
src="https://readthedocs.org/projects/eland/badge/?version=latest" alt="Documentation Status"></a>
</div>

## About

Eland is a Python Elasticsearch client for exploring and analyzing data in Elasticsearch with a familiar
Pandas-compatible API.

Where possible the package uses existing Python APIs and data structures to make it easy to switch between numpy,
pandas, or scikit-learn to their Elasticsearch powered equivalents. In general, the data resides in Elasticsearch and
not in memory, which allows Eland to access large datasets stored in Elasticsearch.

Eland also provides tools to upload trained machine learning models from common libraries like
[scikit-learn](https://scikit-learn.org), [XGBoost](https://xgboost.readthedocs.io), and
[LightGBM](https://lightgbm.readthedocs.io) into Elasticsearch.

## Getting Started

Eland can be installed from [PyPI](https://pypi.org/project/eland) with Pip:

```bash
$ python -m pip install opensearch_py_ml
```

Eland can also be installed from [Conda Forge](https://anaconda.org/conda-forge/eland) with Conda:

```bash
$ conda install -c conda-forge opensearch_py_ml
```

### Compatibility

- Supports Python 3.7+ and Pandas 1.3
- Supports Elasticsearch clusters that are 7.11+, recommended 7.14 or later for all features to work.
Make sure your Eland major version matches the major version of your Elasticsearch cluster.

### Prerequisites

Users installing Eland on Debian-based distributions may need to install prerequisite packages for the transitive
dependencies of Eland:

```bash
$ sudo apt-get install -y \
build-essential pkg-config cmake \
python3-dev libzip-dev libjpeg-dev
```

Note that other distributions such as CentOS, RedHat, Arch, etc. may require using a different package manager and
specifying different package names.

### Docker

Users wishing to use Eland without installing it, in order to just run the available scripts, can build the Docker
container:

```bash
$ docker build -t elastic/opensearch_py_ml .
```

The container can now be used interactively:

```bash
$ docker run -it --rm --network host elastic/opensearch_py_ml
```

Running installed scripts is also possible without an interactive shell, e.g.:

```bash
$ docker run -it --rm --network host \
elastic/opensearch_py_ml \
eland_import_hub_model \
--url http://host.docker.internal:9200/ \
--hub-model-id elastic/distilbert-base-cased-finetuned-conll03-english \
--task-type ner \
--start
```

### Connecting to Elasticsearch

Eland uses the [Elasticsearch low level client](https://elasticsearch-py.readthedocs.io) to connect to Elasticsearch.
This client supports a range of [connection options and authentication options](https://elasticsearch-py.readthedocs.io/en/stable/api.html#elasticsearch).

You can pass either an instance of `elasticsearch.Elasticsearch` to Eland APIs
or a string containing the host to connect to:

```python
import opensearch_py_ml as ed

# Connecting to an Elasticsearch instance running on 'localhost:9200'
df = ed.DataFrame("localhost:9200", es_index_pattern="flights")

# Connecting to an Elastic Cloud instance
from elasticsearch import Elasticsearch

es = Elasticsearch(
cloud_id="cluster-name:...",
http_auth=("elastic", "<password>")
)
df = ed.DataFrame(es, es_index_pattern="flights")
```

## DataFrames in Eland

`opensearch_py_ml.DataFrame` wraps an Elasticsearch index in a Pandas-like API
and defers all processing and filtering of data to Elasticsearch
instead of your local machine. This means you can process large
amounts of data within Elasticsearch from a Jupyter Notebook
without overloading your machine.

➤ [Eland DataFrame API documentation](https://opensearch_py_ml.readthedocs.io/en/latest/reference/dataframe.html)

➤ [Advanced examples in a Jupyter Notebook](https://opensearch_py_ml.readthedocs.io/en/latest/examples/demo_notebook.html)

```python
>>> import opensearch_py_ml as ed

>>> # Connect to 'flights' index via localhost Elasticsearch node
>>> df = ed.DataFrame('localhost:9200', 'flights')

# opensearch_py_ml.DataFrame instance has the same API as pandas.DataFrame
# except all data is in Elasticsearch. See .info() memory usage.
>>> df.head()
AvgTicketPrice Cancelled ... dayOfWeek timestamp
0 841.265642 False ... 0 2018-01-01 00:00:00
1 882.982662 False ... 0 2018-01-01 18:27:00
2 190.636904 False ... 0 2018-01-01 17:11:14
3 181.694216 True ... 0 2018-01-01 10:33:28
4 730.041778 False ... 0 2018-01-01 05:13:00

[5 rows x 27 columns]

>>> df.info()
<class 'opensearch_py_ml.dataframe.DataFrame'>
Index: 13059 entries, 0 to 13058
Data columns (total 27 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 AvgTicketPrice 13059 non-null float64
1 Cancelled 13059 non-null bool
2 Carrier 13059 non-null object
...
24 OriginWeather 13059 non-null object
25 dayOfWeek 13059 non-null int64
26 timestamp 13059 non-null datetime64[ns]
dtypes: bool(2), datetime64[ns](1), float64(5), int64(2), object(17)
memory usage: 80.0 bytes
Elasticsearch storage usage: 5.043 MB

# Filtering of rows using comparisons
>>> df[(df.Carrier=="Kibana Airlines") & (df.AvgTicketPrice > 900.0) & (df.Cancelled == True)].head()
AvgTicketPrice Cancelled ... dayOfWeek timestamp
8 960.869736 True ... 0 2018-01-01 12:09:35
26 975.812632 True ... 0 2018-01-01 15:38:32
311 946.358410 True ... 0 2018-01-01 11:51:12
651 975.383864 True ... 2 2018-01-03 21:13:17
950 907.836523 True ... 2 2018-01-03 05:14:51

[5 rows x 27 columns]

# Running aggregations across an index
>>> df[['DistanceKilometers', 'AvgTicketPrice']].aggregate(['sum', 'min', 'std'])
DistanceKilometers AvgTicketPrice
sum 9.261629e+07 8.204365e+06
min 0.000000e+00 1.000205e+02
std 4.578263e+03 2.663867e+02
```

## Machine Learning in Eland

### Regression and classification

Eland allows transforming trained regression and classification models from scikit-learn, XGBoost, and LightGBM
libraries to be serialized and used as an inference model in Elasticsearch.

➤ [Eland Machine Learning API documentation](https://opensearch_py_ml.readthedocs.io/en/latest/reference/ml.html)

➤ [Read more about Machine Learning in Elasticsearch](https://www.elastic.co/guide/en/machine-learning/current/ml-getting-started.html)

```python
>>> from xgboost import XGBClassifier
>>> from opensearch_py_ml.ml import MLModel

# Train and exercise an XGBoost ML model locally
>>> xgb_model = XGBClassifier(booster="gbtree")
>>> xgb_model.fit(training_data[0], training_data[1])

>>> xgb_model.predict(training_data[0])
[0 1 1 0 1 0 0 0 1 0]

# Import the model into Elasticsearch
>>> es_model = MLModel.import_model(
es_client="localhost:9200",
model_id="xgb-classifier",
model=xgb_model,
feature_names=["f0", "f1", "f2", "f3", "f4"],
)

# Exercise the ML model in Elasticsearch with the training data
>>> es_model.predict(training_data[0])
[0 1 1 0 1 0 0 0 1 0]
```

### NLP with PyTorch

For NLP tasks, Eland allows importing PyTorch trained BERT models into Elasticsearch. Models can be either plain PyTorch
models, or supported [transformers](https://huggingface.co/transformers) models from the
[Hugging Face model hub](https://huggingface.co/models).

```bash
$ eland_import_hub_model \
--url http://localhost:9200/ \
--hub-model-id elastic/distilbert-base-cased-finetuned-conll03-english \
--task-type ner \
--start
```

```python
>>> import elasticsearch
>>> from pathlib import Path
>>> from opensearch_py_ml.ml.pytorch import PyTorchModel
>>> from opensearch_py_ml.ml.pytorch.transformers import TransformerModel

# Load a Hugging Face transformers model directly from the model hub
>>> tm = TransformerModel("elastic/distilbert-base-cased-finetuned-conll03-english", "ner")
Downloading: 100%|██████████| 257/257 [00:00<00:00, 108kB/s]
Downloading: 100%|██████████| 954/954 [00:00<00:00, 372kB/s]
Downloading: 100%|██████████| 208k/208k [00:00<00:00, 668kB/s]
Downloading: 100%|██████████| 112/112 [00:00<00:00, 43.9kB/s]
Downloading: 100%|██████████| 249M/249M [00:23<00:00, 11.2MB/s]

# Export the model in a TorchScrpt representation which Elasticsearch uses
>>> tmp_path = "models"
>>> Path(tmp_path).mkdir(parents=True, exist_ok=True)
>>> model_path, config, vocab_path = tm.save(tmp_path)
`opensearch-py-ml` is a Python client that provides a suite of data analytics and machine learning tools for OpenSearch.
It is a fork of [eland](https://github.com/elastic/eland), which provides data analysis and machine learning
support for Elasticsearch.

# Import model into Elasticsearch
>>> es = elasticsearch.Elasticsearch("http://elastic:mlqa_admin@localhost:9200", timeout=300) # 5 minute timeout
>>> ptm = PyTorchModel(es, tm.elasticsearch_model_id())
>>> ptm.import_model(model_path=model_path, config_path=None, vocab_path=vocab_path, config=config)
100%|██████████| 63/63 [00:12<00:00, 5.02it/s]
```
`opensearch-py-ml` lets users call OpenSearch indices and manipulate them as if they were pandas DataFrames, supporting
complex filtering and aggregation operations. It also provides rudimentary support for uploading models to OpenSearch
clusters using the [ml-commons](https://github.com/opensearch-project/ml-commons) plugin, and provides integration with
AWS SageMaker, allowing users to upload OpenSearch indices to deployed SageMaker endpoints for real-time prediction.