opensearch-project · dhrubo-os · Feb 6, 2023 · Feb 6, 2023 · Feb 6, 2023 · Feb 6, 2023
@@ -6,6 +6,10 @@
    "metadata": {},
    "source": [
     "# Demo Notebook for MLCommons Integration\n",
+    "\n",
+    "#### [download notebook](https://github.com/opensearch-project/opensearch-py-ml/blob/main/docs/source/examples/demo_ml_commons_integration.ipynb)\n",
+    "\n",
+    "\n",
     "This notebook provides a walkthrough guidance for users to invoke MLCommons apis to upload ml models to opensearch cluster\n",
     "\n",
     "Step 0: Import packages and set up client\n",

@@ -4,7 +4,9 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# Demo Notebook for Dataframe"
+    "# Demo Notebook for Dataframe\n",
+    "\n",
+    "#### [download notebook](https://github.com/opensearch-project/opensearch-py-ml/blob/main/docs/source/examples/demo_notebook.ipynb)"
    ]
   },
   {

@@ -7,6 +7,8 @@
    "source": [
     "# Demo Notebook for Sentence Transformer Model Training, Saving and Uploading to OpenSearch\n",
     "\n",
+    "#### [download notebook](https://github.com/opensearch-project/opensearch-py-ml/blob/main/docs/source/examples/demo_transformer_model_train_save_upload_to_openSearch.ipynb)\n",
+    "\n",
     "\n",
     "## Introduction\n",
     "\n",
@@ -19,15 +21,10 @@
     "\n",
     "### Synthetic query generation\n",
     "\n",
-    "In the absence of such labelled data we provide a synthetic query generator (SQG) model that can be used to create synthetic queries given a passage. The SQG model is a large transformer model that has been trained to generate human like queries given a passage. Thus it can be used to create a labelled dataset of (synthetic queries, passage). A BERT model can be trained on this synthetic data and used for semantic search. In fact, we have shown that such synthetically trained models beat the current state-of-the-art models.\n",
-    "\n",
-    "### Train BERT Model with synthetic query data\n",
-    "\n",
-    "After generating synthetic query we can train Sentence Transformer model to get more precise embedding. \n",
-    "\n",
+    "In the absence of such labelled data we provide a synthetic query generator (SQG) model that can be used to create synthetic queries given a passage. The SQG model is a large transformer model that has been trained to generate human like queries given a passage. Thus it can be used to create a labelled dataset of (synthetic queries, passage). A BERT model can be trained on this synthetic data and used for semantic search. In fact, we find that such synthetically trained models beat the current state-of-the-art models. Note that resulting BERT model is a customized model since it has been trained on a specific corpus (and corresponding synthetic queries).\n",
     "\n",
     "\n",
-    "This notebook provides a walkthrough guidance for users use their synthetic queries to fine tune and train a sentence transformer model. In this notebook, you use opensearch_py_ml to accomplish the following:\n",
+    "This notebook provides an end-to-end guide for users to generate synthetic queries and fine-tune a sentence transformer model on it using opensearch_py_ml. It consists of the following steps,\n",
     "\n",
     "Step 1: Import packages and set up client\n",
     "\n",
@@ -37,9 +34,9 @@
     "\n",
     "Step 4: Read synthetic queries and train/fine-tune model using a hugging face sentence transformer model\n",
     "\n",
-    "Step 5: (Optional) Save model\n",
+    "Step 5: Upload the model to OpenSearch cluster\n",
     "\n",
-    "Step 6: Upload the model to OpenSearch cluster"
+    "Steps 3 and 4 are computed intensive step, and we recommend running it on a machine with 4 or more GPUS such as the EC2 `p3.8xlarge` or `p3.16xlarge`."
    ]
   },
   {
@@ -68,18 +65,14 @@
    },
    "outputs": [],
    "source": [
-    "# pip install pandas matplotlib numpy torch accelerate sentence_transformers tqdm transformers opensearch-py opensearch-py-ml detoxify datasets "
+    "# !pip install pandas matplotlib numpy torch accelerate sentence_transformers tqdm transformers opensearch-py opensearch-py-ml detoxify datasets"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 3,
+   "execution_count": 1,
    "id": "87c021df",
-   "metadata": {
-    "pycharm": {
-     "is_executing": true
-    }
-   },
+   "metadata": {},
    "outputs": [],
    "source": [
     "import warnings\n",
@@ -311,8 +304,8 @@
    "outputs": [],
    "source": [
     "three_step_query = ss.generate_synthetic_queries(num_machines = 1,\n",
-    "                                                 overwrite = True,\n",
-    "                                                 total_queries = 10,                                            \n",
+    "                                                 tokenize_data = True,\n",
+    "                                                 total_queries = 10,\n",
     "                                                 numseq = 5,\n",
     "                                                 num_gpu = 0,\n",
     "                                                 toxic_cutoff = 0.01)"
@@ -325,21 +318,21 @@
    "source": [
     "A lot of actions are being executed in the above cell. We elaborate them step by step, \n",
     "\n",
-    "    1) Convert the data into a form that can be consumed by the Synthetic query generator (SQG) model. This amounts to tokenizing the data using a tokenizer. The SQG model is a fine-tuned version of the GPT-XL model https://huggingface.co/gpt2-xl and the tokenizer is the GPT tokenizer. \n",
-    "    \n",
-    "    2) The tokenizer has a max input length of 512 tokens. Every passage is tokenized with the special tokens <|startoftext|> and QRY: appended to the beginning and the end of every passage respectively.\n",
-    "    \n",
-    "    3) Load the SQG model i.e. 1.5B parameter GPT2-XL model that has been trained to ask questions given passages. This model has been made publicly available and can be found here https://ci.opensearch.org/ci/dbc/models/ml-models/amazon/gpt/GPT2_xl_sqg/1.0.0/GPT2_xl_sqg.zip. \n",
-    "    \n",
-    "    4) Once the model has been loaded and the data has been tokenized, the model starts the process of query generation. \"total_queries\" is number of synthetic queries generated for every passage and \"numseq\" is the number of queries that are generated by a model at a given time. Ideally total_queries = numseq, but this can lead to out of memory issues. So set numseq to an integer that is around 10 or less, and is a divisor of total_queries. \n",
-    "    \n",
-    "    It also needs the number of GPUs and the number of machines/nodes that it can use. Since we are using a single node instance with no GPUs we pass 0 and 1 to the function.   \n",
-    "    \n",
-    "    5) The function now begins to generate queries and displays a progress bar. We create total_queries per passage. Empirically we find that generating more queries leads to better peformance but there are diminishing returns since the total inference time increases with total_queries.\n",
-    "    \n",
-    "    6) After generating the queries, the function uses a publicly available package called Detoxify to remove innappropriate queries from the dataset. \"toxic_cutoff\" is a float. The script rejects all queries that have a toicity score greater than toxic_cutoff\n",
-    "    \n",
-    "    7) Finally, the synthetic queries along with their corresponding passages are saved in a zipped file in the current working directory."
+    "    1) Convert the data into a form that can be consumed by the Synthetic query generator (SQG) model. This amounts to tokenizing the data using a tokenizer. The SQG model is a fine-tuned version of the GPT-XL model https://huggingface.co/gpt2-xl and the tokenizer is the GPT tokenizer.\n",
+    "\n",
+    "    2) The tokenizer has a max input length of 512 tokens. Every passage is tokenized with the special tokens <|startoftext|> and QRY: appended to the beginning and the end of every passage respectively. Note that tokenization is a time intensive process and the script saves the tokenized data after the first pass. We recommend setting tokenize_data = False subsequently. \n",
+    "\n",
+    "    3) Load the SQG model i.e. 1.5B parameter GPT2-XL model that has been trained to ask questions given passages. This model has been made publicly available and can be found here https://ci.opensearch.org/ci/dbc/models/ml-models/amazon/gpt/GPT2_xl_sqg/1.0.0/GPT2_xl_sqg.zip.\n",
+    "\n",
+    "    4) Once the model has been loaded and the data has been tokenized, the model starts the process of query generation. \"total_queries\" is number of synthetic queries generated for every passage and \"numseq\" is the number of queries that are generated by a model at a given time. Ideally total_queries = numseq, but this can lead to out of memory issues. So set numseq to an integer that is around 10 or less, and is a divisor of total_queries.\n",
+    "\n",
+    "    5) The script also requires to know the number of GPUs and the number of machines/nodes that it can use. Since we are using a single node instance with no GPUs we pass 0 and 1 to the function respectively. Our recommended setting is to use 1 machine/node with at least 4 (ideally 8) GPUs. \n",
+    "\n",
+    "    6) The script now begins to generate queries and displays a progress bar. We create total_queries per passage. Empirically we find that generating more queries leads to better performance but there are diminishing returns since the total inference time increases with total_queries.\n",
+    "\n",
+    "    7) After generating the queries, the function uses a publicly available package called Detoxify to remove inappropriate queries from the dataset. \"toxic_cutoff\" is a float. The script rejects all queries that have a toxicity score greater than toxic_cutoff\n",
+    "\n",
+    "    8) Finally, the synthetic queries along with their corresponding passages are saved in a zipped file in the current working directory."
    ]
   },
   {
@@ -413,60 +406,12 @@
     "                        verbose = False)"
    ]
   },
-  {
-   "cell_type": "markdown",
-   "id": "4a655a9c",
-   "metadata": {},
-   "source": [
-    "## Step 5: (Optional) Save model\n",
-    "If following step 1, the model zip will be auto generated, and the print message will indicate the zip file path as shown above. \n",
-    "\n",
-    "But if using other pretrained sentence transformer model from Hugging face, users can use `save_as_pt` function to save a pre-trained sentence transformer model for inferencing or benchmark with other models. \n",
-    "\n",
-    "The `save_as_pt`  function will prepare the model in proper format(Torch Script) along with tokenizers configuration file to upload to OpenSearch. Plese visit the [SentenceTransformerModel.save_as_pt](https://opensearch-project.github.io/opensearch-py-ml/reference/api/sentence_transformer.html#opensearch_py_ml.sentence_transformer_model.SentenceTransformerModel.save_as_pt) for API Reference . "
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 10,
-   "id": "503f8136",
-   "metadata": {
-    "scrolled": true
-   },
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "model file is saved to  /Volumes/workplace/upload_content/all-MiniLM-L6-v2.pt\n",
-      "zip file is saved to  /Volumes/workplace/upload_content/all-MiniLM-L6-v2.zip \n",
-      "\n"
-     ]
-    },
-    {
-     "data": {
-      "text/plain": [
-       "'/Volumes/workplace/upload_content/all-MiniLM-L6-v2.zip'"
-      ]
-     },
-     "execution_count": 10,
-     "metadata": {},
-     "output_type": "execute_result"
-    }
-   ],
-   "source": [
-    "# default to download model id, \"sentence-transformers/msmarco-distilbert-base-tas-b\" from hugging face \n",
-    "# and output a model in a zip file containing model.pt file and tokenizers.json file. \n",
-    "pre_trained_model = SentenceTransformerModel(folder_path = '/Volumes/workplace/upload_content/', overwrite = True)\n",
-    "pre_trained_model.save_as_pt(model_id = \"sentence-transformers/all-MiniLM-L6-v2\", sentences=[\"Sentences needs to be bigger\"])"
-   ]
-  },
   {
    "cell_type": "markdown",
    "id": "c9bd0405",
    "metadata": {},
    "source": [
-    "## Step 6: Upload the model to OpenSearch cluster\n",
+    "## Step 5: Upload the model to OpenSearch cluster\n",
     "After generated a model zip file, the users will need to describe model configuration in a ml-commons_model_config.json file. The `make_model_config_json` function in sentencetransformermodel class will parse the config file from hugging-face config.son file. If users would like to use a different config than the pre-trained sentence transformer, `make_model_config_json` function provide arguuments to change the configuration content and generated a ml-commons_model_config.json file. Plese visit the [SentenceTransformerModel.make_model_config_json](https://opensearch-project.github.io/opensearch-py-ml/reference/api/sentence_transformer.html#opensearch_py_ml.sentence_transformer_model.SentenceTransformerModel.make_model_config_json) for API Reference . \n",
     "\n",
     "In general, the ml common client supports uploading sentence transformer models. With a zip file contains model in  Torch Script format, and a configuration file for tokenizers in json format, the `upload_model` function connects to opensearch through ml client and upload the model. Plese visit the [MLCommonClient.upload_model](https://opensearch-project.github.io/opensearch-py-ml/reference/api/ml_commons_upload_api.html#opensearch_py_ml.ml_commons_integration.MLCommonClient.upload_model) for API Reference. "
@@ -551,14 +496,6 @@
     "model_config_path = '/Volumes/workplace/upload_content/model_config.json'\n",
     "ml_client.upload_model( model_path, model_config_path, isVerbose=True)"
    ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "5a605df2",
-   "metadata": {},
-   "outputs": [],
-   "source": []
   }
  ],
  "metadata": {

@@ -4,12 +4,27 @@
 Examples
 ========
 
+Demo notebooks for Data Exploration Panda like DataFrame
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. toctree::
    :maxdepth: 1
 
    demo_notebook
    online_retail_analysis
+
+Demo notebooks for Model Training and Tracing
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. toctree::
+   :maxdepth: 1
+
    demo_transformer_model_train_save_upload_to_openSearch
-   demo_ml_commons_integration
    demo_tracing_model_torchscript_onnx.ipynb
 
+Demo notebooks for ML Commons plugin integration
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. toctree::
+   :maxdepth: 1
+
+   demo_ml_commons_integration
@@ -7,9 +7,7 @@
      "name": "#%% md\n"
     }
    },
-   "source": [
-    "# Online Retail analysis"
-   ]
+   "source": []
   },
   {
    "cell_type": "markdown",

@@ -63,7 +63,7 @@ def format(session):
     session.install("black", "isort", "flynt")
     session.run("python", "utils/license-headers.py", "fix", *SOURCE_FILES)
     session.run("flynt", *SOURCE_FILES)
-    session.run("black", "--target-version=py37", *SOURCE_FILES)
+    session.run("black", "--target-version=py38", *SOURCE_FILES)
     session.run("isort", "--profile=black", *SOURCE_FILES)
     lint(session)
 
@@ -73,7 +73,7 @@ def lint(session):
     # Install numpy to use its mypy plugin
     # https://numpy.org/devdocs/reference/typing.html#mypy-plugin
     session.install("black", "flake8", "mypy", "isort", "numpy")
-    session.install("--pre", "opensearch-py>=2")
+    session.install("--pre", "opensearch-py==2.1.1")
     session.run("python", "utils/license-headers.py", "check", *SOURCE_FILES)
     session.run("black", "--check", "--target-version=py37", *SOURCE_FILES)
     session.run("isort", "--check", "--profile=black", *SOURCE_FILES)

@@ -183,7 +183,6 @@ def unload_model(self, model_id: str, node_ids: List[str] = []) -> object:
             )
 
     def delete_model(self, model_id: str) -> object:
-
         """
         This method deletes a model from opensearch cluster (using ml commons api)
 

@@ -43,7 +43,6 @@ def __init__(self, os_client: OpenSearch):
     def _upload_model(
         self, model_path: str, model_meta_path: str, isVerbose: bool
     ) -> str:
-
         """
         This method uploads model into opensearch cluster using ml-common plugin's api.
         first this method creates a model id to store model metadata and then breaks the model zip file into
-Original file line number
+Diff line change
@@ Expand Up @@
                 )
         def delete_model(self, model_id: str) -> object:
             """
             This method deletes a model from opensearch cluster (using ml commons api)
@@ Expand Down @@