Update all notebooks to GPT-4o and GPT-4o-mini and new datasets

pablomarin · Oct 4, 2024 · dd052c0 · dd052c0
1 parent b40b2c2
commit dd052c0
Show file tree

Hide file tree

Showing 25 changed files with 31,541 additions and 2,277 deletions.
diff --git a/.gitignore b/.gitignore
@@ -8,7 +8,6 @@ common/__pycache__/
 .streamlit/
 *.amltmp 
 *.amltemp
-data/
 credentials.env
 .azure/
 .vscode/

diff --git a/01-Load-Data-ACogSearch.ipynb b/01-Load-Data-ACogSearch.ipynb
@@ -24,13 +24,9 @@
     "In this Jupyter Notebook, we create and run enrichment steps to unlock searchable content in the specified Azure blob. It performs operations over mixed content in Azure Storage, such as images and application files, using a skillset that analyzes and extracts text information that becomes searchable in Azure Cognitive Search. \n",
     "The reference sample can be found at [Tutorial: Use Python and AI to generate searchable content from Azure blobs](https://docs.microsoft.com/azure/search/cognitive-search-tutorial-blob-python).\n",
     "\n",
-    "In this demo we are going to be using a private (so we can mimic a private data lake scenario) Blob Storage container that has ~9.8k Computer Science publication PDFs from the Arxiv dataset.\n",
-    "https://www.kaggle.com/datasets/Cornell-University/arxiv\n",
+    "In this demo we are going to be using a private (so we can mimic a private data lake scenario) Blob Storage container that has all the dialogues of each episode of the TV Series show: FRIENDS. 3.1k text files.\n",
     "\n",
-    "If you want to explore the dataset, go [HERE](https://console.cloud.google.com/storage/browser/arxiv-dataset/arxiv/cs/pdf?pageState=(%22StorageObjectListTable%22:(%22f%22:%22%255B%255D%22))&prefix=&forceOnObjectsSortingFiltering=false)<br>\n",
-    "Note: This dataset has been copy to a public azure blob container for this demo\n",
-    "\n",
-    "Although only  PDF files are used here, this can be done at a much larger scale and Azure Cognitive Search supports a range of other file formats including: Microsoft Office (DOCX/DOC, XSLX/XLS, PPTX/PPT, MSG), HTML, XML, ZIP, and plain text files (including JSON).\n",
+    "Although only  TXT files are used here, this can be done at a much larger scale and Azure Cognitive Search supports a range of other file formats including: Microsoft Office (DOCX/DOC, XSLX/XLS, PPTX/PPT, MSG), HTML, XML, ZIP, and plain text files (including JSON).\n",
     "Azure Search support the following sources: [Data Sources Gallery](https://learn.microsoft.com/EN-US/AZURE/search/search-data-sources-gallery)\n",
     "\n",
     "This notebook creates the following objects on your search service:\n",
@@ -57,23 +53,27 @@
   {
    "cell_type": "code",
    "execution_count": 1,
-   "metadata": {},
+   "metadata": {
+    "tags": []
+   },
    "outputs": [],
    "source": [
     "import os\n",
     "import json\n",
+    "import shutil\n",
     "import requests\n",
     "from dotenv import load_dotenv\n",
     "load_dotenv(\"credentials.env\")\n",
     "\n",
-    "# Name of the container in your Blob Storage Datasource ( in credentials.env)\n",
-    "BLOB_CONTAINER_NAME = \"arxivcs\""
+    "from common.utils import upload_file_to_blob, extract_zip_file, upload_directory_to_blob\n"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": 2,
-   "metadata": {},
+   "metadata": {
+    "tags": []
+   },
    "outputs": [],
    "source": [
     "# Define the names for the data source, skillset, index and indexer\n",
@@ -86,7 +86,9 @@
   {
    "cell_type": "code",
    "execution_count": 3,
-   "metadata": {},
+   "metadata": {
+    "tags": []
+   },
    "outputs": [],
    "source": [
     "# Setup the Payloads header\n",
@@ -98,13 +100,74 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Create Data Source (Blob container with the Arxiv CS pdfs)"
+    "## Upload local dataset to Blob Container"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": 4,
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Extracting ./data/friends_transcripts.zip ... \n",
+      "Extracted ./data/friends_transcripts.zip to ./data/temp_extract\n"
+     ]
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "Uploading Files: 100%|██████████████████████████████████████████| 3107/3107 [08:47<00:00,  5.89it/s]\n"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Temp Folder: ./data/temp_extract removed\n",
+      "CPU times: user 34.9 s, sys: 5.76 s, total: 40.6 s\n",
+      "Wall time: 11min 21s\n"
+     ]
+    }
+   ],
+   "source": [
+    "%%time\n",
+    "\n",
+    "# Define connection string and other parameters\n",
+    "BLOB_CONTAINER_NAME = \"friends\"\n",
+    "BLOB_NAME = \"friends_transcripts.zip\"\n",
+    "LOCAL_FILE_PATH = \"./data/\" + BLOB_NAME  # Path to the local file you want to upload\n",
+    "upload_directory = \"./data/temp_extract\"  # Temporary directory to extract the zip file\n",
+    "\n",
+    "# Extract the zip file\n",
+    "extract_zip_file(LOCAL_FILE_PATH, upload_directory)\n",
+    "\n",
+    "# Upload the extracted files and folder structure\n",
+    "upload_directory_to_blob(upload_directory, BLOB_CONTAINER_NAME)\n",
+    "\n",
+    "# Clean up: Optionally, you can remove the temp folder after uploading\n",
+    "shutil.rmtree(upload_directory)\n",
+    "print(f\"Temp Folder: {upload_directory} removed\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
    "metadata": {},
+   "source": [
+    "## Create Data Source (Blob container with the Arxiv CS pdfs)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {
+    "tags": []
+   },
    "outputs": [
     {
      "name": "stdout",
@@ -120,7 +183,7 @@
     "\n",
     "datasource_payload = {\n",
     "    \"name\": datasource_name,\n",
-    "    \"description\": \"Demo files to demonstrate cognitive search capabilities.\",\n",
+    "    \"description\": \"Demo files to demonstrate ai search capabilities.\",\n",
     "    \"type\": \"azureblob\",\n",
     "    \"credentials\": {\n",
     "        \"connectionString\": os.environ['BLOB_CONNECTION_STRING']\n",
@@ -154,8 +217,10 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 5,
-   "metadata": {},
+   "execution_count": 6,
+   "metadata": {
+    "tags": []
+   },
    "outputs": [],
    "source": [
     "# If you have a 403 code, probably you have a wrong endpoint or key, you can debug by uncomment this\n",
@@ -191,8 +256,10 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 6,
-   "metadata": {},
+   "execution_count": 7,
+   "metadata": {
+    "tags": []
+   },
    "outputs": [
     {
      "name": "stdout",
@@ -284,8 +351,10 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 7,
-   "metadata": {},
+   "execution_count": 8,
+   "metadata": {
+    "tags": []
+   },
    "outputs": [],
    "source": [
     "# r.text"
@@ -337,8 +406,10 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 8,
-   "metadata": {},
+   "execution_count": 9,
+   "metadata": {
+    "tags": []
+   },
    "outputs": [
     {
      "name": "stdout",
@@ -490,8 +561,10 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 9,
-   "metadata": {},
+   "execution_count": 10,
+   "metadata": {
+    "tags": []
+   },
    "outputs": [],
    "source": [
     "# print(r.text)"
@@ -513,8 +586,10 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 10,
-   "metadata": {},
+   "execution_count": 11,
+   "metadata": {
+    "tags": []
+   },
    "outputs": [
     {
      "name": "stdout",
@@ -569,7 +644,9 @@
   {
    "cell_type": "code",
    "execution_count": 12,
-   "metadata": {},
+   "metadata": {
+    "tags": []
+   },
    "outputs": [],
    "source": [
     "# Uncomment if you find an error\n",
@@ -585,7 +662,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 18,
+   "execution_count": 20,
    "metadata": {
     "tags": []
    },
@@ -596,7 +673,7 @@
      "text": [
       "200\n",
       "Status: inProgress\n",
-      "Items Processed: 154\n",
+      "Items Processed: 2180\n",
       "True\n"
      ]
     }
@@ -620,7 +697,9 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "**When the indexer finishes running we will have all 9.8k documents indexed in your Search Engine!.**"
+    "**When the indexer finishes running we will have all 994 documents indexed in your Search Engine!.**\n",
+    "\n",
+    "**Note:** Noticed that it only index 1 document (the zip file) but the AI Search service did the work of uncompressing it and indexing each individual doc**"
    ]
   },
   {
@@ -666,7 +745,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.10.11"
+   "version": "3.10.14"
   },
   "vscode": {
    "interpreter": {