Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added download link to the notebook #73

Merged
merged 4 commits into from
Feb 6, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions docs/source/examples/demo_ml_commons_integration.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,10 @@
"metadata": {},
"source": [
"# Demo Notebook for MLCommons Integration\n",
"\n",
"#### [download notebook](https://github.com/opensearch-project/opensearch-py-ml/blob/main/docs/source/examples/demo_ml_commons_integration.ipynb)\n",
"\n",
"\n",
"This notebook provides a walkthrough guidance for users to invoke MLCommons apis to upload ml models to opensearch cluster\n",
"\n",
"Step 0: Import packages and set up client\n",
Expand Down
4 changes: 3 additions & 1 deletion docs/source/examples/demo_notebook.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,9 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"# Demo Notebook for Dataframe"
"# Demo Notebook for Dataframe\n",
"\n",
"#### [download notebook](https://github.com/opensearch-project/opensearch-py-ml/blob/main/docs/source/examples/demo_notebook.ipynb)"
]
},
{
Expand Down
213 changes: 148 additions & 65 deletions docs/source/examples/demo_tracing_model_torchscript_onnx.ipynb

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,8 @@
"source": [
"# Demo Notebook for Sentence Transformer Model Training, Saving and Uploading to OpenSearch\n",
"\n",
"#### [download notebook](https://github.com/opensearch-project/opensearch-py-ml/blob/main/docs/source/examples/demo_transformer_model_train_save_upload_to_openSearch.ipynb)\n",
"\n",
"\n",
"## Introduction\n",
"\n",
Expand All @@ -19,15 +21,10 @@
"\n",
"### Synthetic query generation\n",
"\n",
"In the absence of such labelled data we provide a synthetic query generator (SQG) model that can be used to create synthetic queries given a passage. The SQG model is a large transformer model that has been trained to generate human like queries given a passage. Thus it can be used to create a labelled dataset of (synthetic queries, passage). A BERT model can be trained on this synthetic data and used for semantic search. In fact, we have shown that such synthetically trained models beat the current state-of-the-art models.\n",
"\n",
"### Train BERT Model with synthetic query data\n",
"\n",
"After generating synthetic query we can train Sentence Transformer model to get more precise embedding. \n",
"\n",
"In the absence of such labelled data we provide a synthetic query generator (SQG) model that can be used to create synthetic queries given a passage. The SQG model is a large transformer model that has been trained to generate human like queries given a passage. Thus it can be used to create a labelled dataset of (synthetic queries, passage). A BERT model can be trained on this synthetic data and used for semantic search. In fact, we find that such synthetically trained models beat the current state-of-the-art models. Note that resulting BERT model is a customized model since it has been trained on a specific corpus (and corresponding synthetic queries).\n",
"\n",
"\n",
"This notebook provides a walkthrough guidance for users use their synthetic queries to fine tune and train a sentence transformer model. In this notebook, you use opensearch_py_ml to accomplish the following:\n",
"This notebook provides an end-to-end guide for users to generate synthetic queries and fine-tune a sentence transformer model on it using opensearch_py_ml. It consists of the following steps,\n",
"\n",
"Step 1: Import packages and set up client\n",
"\n",
Expand All @@ -37,9 +34,9 @@
"\n",
"Step 4: Read synthetic queries and train/fine-tune model using a hugging face sentence transformer model\n",
"\n",
"Step 5: (Optional) Save model\n",
"Step 5: Upload the model to OpenSearch cluster\n",
"\n",
"Step 6: Upload the model to OpenSearch cluster"
"Steps 3 and 4 are computed intensive step, and we recommend running it on a machine with 4 or more GPUS such as the EC2 `p3.8xlarge` or `p3.16xlarge`."
]
},
{
Expand Down Expand Up @@ -68,18 +65,14 @@
},
"outputs": [],
"source": [
"# pip install pandas matplotlib numpy torch accelerate sentence_transformers tqdm transformers opensearch-py opensearch-py-ml detoxify datasets "
"# !pip install pandas matplotlib numpy torch accelerate sentence_transformers tqdm transformers opensearch-py opensearch-py-ml detoxify datasets"
]
},
{
"cell_type": "code",
"execution_count": 3,
"execution_count": 1,
"id": "87c021df",
"metadata": {
"pycharm": {
"is_executing": true
}
},
"metadata": {},
"outputs": [],
"source": [
"import warnings\n",
Expand Down Expand Up @@ -311,8 +304,8 @@
"outputs": [],
"source": [
"three_step_query = ss.generate_synthetic_queries(num_machines = 1,\n",
" overwrite = True,\n",
" total_queries = 10, \n",
" tokenize_data = True,\n",
" total_queries = 10,\n",
" numseq = 5,\n",
" num_gpu = 0,\n",
" toxic_cutoff = 0.01)"
Expand All @@ -325,21 +318,21 @@
"source": [
"A lot of actions are being executed in the above cell. We elaborate them step by step, \n",
"\n",
" 1) Convert the data into a form that can be consumed by the Synthetic query generator (SQG) model. This amounts to tokenizing the data using a tokenizer. The SQG model is a fine-tuned version of the GPT-XL model https://huggingface.co/gpt2-xl and the tokenizer is the GPT tokenizer. \n",
" \n",
" 2) The tokenizer has a max input length of 512 tokens. Every passage is tokenized with the special tokens <|startoftext|> and QRY: appended to the beginning and the end of every passage respectively.\n",
" \n",
" 3) Load the SQG model i.e. 1.5B parameter GPT2-XL model that has been trained to ask questions given passages. This model has been made publicly available and can be found here https://ci.opensearch.org/ci/dbc/models/ml-models/amazon/gpt/GPT2_xl_sqg/1.0.0/GPT2_xl_sqg.zip. \n",
" \n",
" 4) Once the model has been loaded and the data has been tokenized, the model starts the process of query generation. \"total_queries\" is number of synthetic queries generated for every passage and \"numseq\" is the number of queries that are generated by a model at a given time. Ideally total_queries = numseq, but this can lead to out of memory issues. So set numseq to an integer that is around 10 or less, and is a divisor of total_queries. \n",
" \n",
" It also needs the number of GPUs and the number of machines/nodes that it can use. Since we are using a single node instance with no GPUs we pass 0 and 1 to the function. \n",
" \n",
" 5) The function now begins to generate queries and displays a progress bar. We create total_queries per passage. Empirically we find that generating more queries leads to better peformance but there are diminishing returns since the total inference time increases with total_queries.\n",
" \n",
" 6) After generating the queries, the function uses a publicly available package called Detoxify to remove innappropriate queries from the dataset. \"toxic_cutoff\" is a float. The script rejects all queries that have a toicity score greater than toxic_cutoff\n",
" \n",
" 7) Finally, the synthetic queries along with their corresponding passages are saved in a zipped file in the current working directory."
" 1) Convert the data into a form that can be consumed by the Synthetic query generator (SQG) model. This amounts to tokenizing the data using a tokenizer. The SQG model is a fine-tuned version of the GPT-XL model https://huggingface.co/gpt2-xl and the tokenizer is the GPT tokenizer.\n",
"\n",
" 2) The tokenizer has a max input length of 512 tokens. Every passage is tokenized with the special tokens <|startoftext|> and QRY: appended to the beginning and the end of every passage respectively. Note that tokenization is a time intensive process and the script saves the tokenized data after the first pass. We recommend setting tokenize_data = False subsequently. \n",
"\n",
" 3) Load the SQG model i.e. 1.5B parameter GPT2-XL model that has been trained to ask questions given passages. This model has been made publicly available and can be found here https://ci.opensearch.org/ci/dbc/models/ml-models/amazon/gpt/GPT2_xl_sqg/1.0.0/GPT2_xl_sqg.zip.\n",
"\n",
" 4) Once the model has been loaded and the data has been tokenized, the model starts the process of query generation. \"total_queries\" is number of synthetic queries generated for every passage and \"numseq\" is the number of queries that are generated by a model at a given time. Ideally total_queries = numseq, but this can lead to out of memory issues. So set numseq to an integer that is around 10 or less, and is a divisor of total_queries.\n",
"\n",
" 5) The script also requires to know the number of GPUs and the number of machines/nodes that it can use. Since we are using a single node instance with no GPUs we pass 0 and 1 to the function respectively. Our recommended setting is to use 1 machine/node with at least 4 (ideally 8) GPUs. \n",
"\n",
" 6) The script now begins to generate queries and displays a progress bar. We create total_queries per passage. Empirically we find that generating more queries leads to better performance but there are diminishing returns since the total inference time increases with total_queries.\n",
"\n",
" 7) After generating the queries, the function uses a publicly available package called Detoxify to remove inappropriate queries from the dataset. \"toxic_cutoff\" is a float. The script rejects all queries that have a toxicity score greater than toxic_cutoff\n",
"\n",
" 8) Finally, the synthetic queries along with their corresponding passages are saved in a zipped file in the current working directory."
]
},
{
Expand Down Expand Up @@ -413,60 +406,12 @@
" verbose = False)"
]
},
{
"cell_type": "markdown",
"id": "4a655a9c",
"metadata": {},
"source": [
"## Step 5: (Optional) Save model\n",
"If following step 1, the model zip will be auto generated, and the print message will indicate the zip file path as shown above. \n",
"\n",
"But if using other pretrained sentence transformer model from Hugging face, users can use `save_as_pt` function to save a pre-trained sentence transformer model for inferencing or benchmark with other models. \n",
"\n",
"The `save_as_pt` function will prepare the model in proper format(Torch Script) along with tokenizers configuration file to upload to OpenSearch. Plese visit the [SentenceTransformerModel.save_as_pt](https://opensearch-project.github.io/opensearch-py-ml/reference/api/sentence_transformer.html#opensearch_py_ml.sentence_transformer_model.SentenceTransformerModel.save_as_pt) for API Reference . "
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "503f8136",
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"model file is saved to /Volumes/workplace/upload_content/all-MiniLM-L6-v2.pt\n",
"zip file is saved to /Volumes/workplace/upload_content/all-MiniLM-L6-v2.zip \n",
"\n"
]
},
{
"data": {
"text/plain": [
"'/Volumes/workplace/upload_content/all-MiniLM-L6-v2.zip'"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# default to download model id, \"sentence-transformers/msmarco-distilbert-base-tas-b\" from hugging face \n",
"# and output a model in a zip file containing model.pt file and tokenizers.json file. \n",
"pre_trained_model = SentenceTransformerModel(folder_path = '/Volumes/workplace/upload_content/', overwrite = True)\n",
"pre_trained_model.save_as_pt(model_id = \"sentence-transformers/all-MiniLM-L6-v2\", sentences=[\"Sentences needs to be bigger\"])"
]
},
{
"cell_type": "markdown",
"id": "c9bd0405",
"metadata": {},
"source": [
"## Step 6: Upload the model to OpenSearch cluster\n",
"## Step 5: Upload the model to OpenSearch cluster\n",
"After generated a model zip file, the users will need to describe model configuration in a ml-commons_model_config.json file. The `make_model_config_json` function in sentencetransformermodel class will parse the config file from hugging-face config.son file. If users would like to use a different config than the pre-trained sentence transformer, `make_model_config_json` function provide arguuments to change the configuration content and generated a ml-commons_model_config.json file. Plese visit the [SentenceTransformerModel.make_model_config_json](https://opensearch-project.github.io/opensearch-py-ml/reference/api/sentence_transformer.html#opensearch_py_ml.sentence_transformer_model.SentenceTransformerModel.make_model_config_json) for API Reference . \n",
"\n",
"In general, the ml common client supports uploading sentence transformer models. With a zip file contains model in Torch Script format, and a configuration file for tokenizers in json format, the `upload_model` function connects to opensearch through ml client and upload the model. Plese visit the [MLCommonClient.upload_model](https://opensearch-project.github.io/opensearch-py-ml/reference/api/ml_commons_upload_api.html#opensearch_py_ml.ml_commons_integration.MLCommonClient.upload_model) for API Reference. "
Expand Down Expand Up @@ -551,14 +496,6 @@
"model_config_path = '/Volumes/workplace/upload_content/model_config.json'\n",
"ml_client.upload_model( model_path, model_config_path, isVerbose=True)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5a605df2",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
Expand Down
17 changes: 16 additions & 1 deletion docs/source/examples/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,12 +4,27 @@
Examples
========

Demo notebooks for Data Exploration Panda like DataFrame
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. toctree::
:maxdepth: 1

demo_notebook
online_retail_analysis

Demo notebooks for Model Training and Tracing
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. toctree::
:maxdepth: 1

demo_transformer_model_train_save_upload_to_openSearch
demo_ml_commons_integration
demo_tracing_model_torchscript_onnx.ipynb

Demo notebooks for ML Commons plugin integration
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. toctree::
:maxdepth: 1

demo_ml_commons_integration
4 changes: 1 addition & 3 deletions docs/source/examples/online_retail_analysis.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -7,9 +7,7 @@
"name": "#%% md\n"
}
},
"source": [
"# Online Retail analysis"
]
"source": []
},
{
"cell_type": "markdown",
Expand Down
4 changes: 2 additions & 2 deletions noxfile.py
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,7 @@ def format(session):
session.install("black", "isort", "flynt")
session.run("python", "utils/license-headers.py", "fix", *SOURCE_FILES)
session.run("flynt", *SOURCE_FILES)
session.run("black", "--target-version=py37", *SOURCE_FILES)
session.run("black", "--target-version=py38", *SOURCE_FILES)
session.run("isort", "--profile=black", *SOURCE_FILES)
lint(session)

Expand All @@ -73,7 +73,7 @@ def lint(session):
# Install numpy to use its mypy plugin
# https://numpy.org/devdocs/reference/typing.html#mypy-plugin
session.install("black", "flake8", "mypy", "isort", "numpy")
session.install("--pre", "opensearch-py>=2")
session.install("--pre", "opensearch-py==2.1.1")
session.run("python", "utils/license-headers.py", "check", *SOURCE_FILES)
session.run("black", "--check", "--target-version=py37", *SOURCE_FILES)
session.run("isort", "--check", "--profile=black", *SOURCE_FILES)
Expand Down
1 change: 0 additions & 1 deletion opensearch_py_ml/ml_commons/ml_commons_client.py
Original file line number Diff line number Diff line change
Expand Up @@ -183,7 +183,6 @@ def unload_model(self, model_id: str, node_ids: List[str] = []) -> object:
)

def delete_model(self, model_id: str) -> object:

"""
This method deletes a model from opensearch cluster (using ml commons api)

Expand Down
1 change: 0 additions & 1 deletion opensearch_py_ml/ml_commons/model_uploader.py
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,6 @@ def __init__(self, os_client: OpenSearch):
def _upload_model(
self, model_path: str, model_meta_path: str, isVerbose: bool
) -> str:

"""
This method uploads model into opensearch cluster using ml-common plugin's api.
first this method creates a model id to store model metadata and then breaks the model zip file into
Expand Down
Loading