-
Notifications
You must be signed in to change notification settings - Fork 582
docs: adding repetitions cookbook #9774
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
SrilakshmiC
wants to merge
3
commits into
main
Choose a base branch
from
sri-rep-notebook
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
3 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
373 changes: 373 additions & 0 deletions
373
tutorials/experiments/running_experiments_with_repetitions.ipynb
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,373 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"id": "dca316ce", | ||
"metadata": {}, | ||
"source": [ | ||
"<center>\n", | ||
" <p style=\"text-align:center\">\n", | ||
" <img alt=\"phoenix logo\" src=\"https://storage.googleapis.com/arize-phoenix-assets/assets/phoenix-logo-light.svg\" width=\"200\"/>\n", | ||
" <br>\n", | ||
" <a href=\"https://docs.arize.com/phoenix/\">Docs</a>\n", | ||
" |\n", | ||
" <a href=\"https://github.com/Arize-ai/phoenix\">GitHub</a>\n", | ||
" |\n", | ||
" <a href=\"https://join.slack.com/t/arize-ai/shared_invite/zt-1px8dcmlf-fmThhDFD_V_48oU7ALan4Q\">Community</a>\n", | ||
" </p>\n", | ||
"</center>" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "e0dbe604", | ||
"metadata": {}, | ||
"source": [ | ||
"# Evaluation of Customer Reviews using Repetitions in Phoenix\n", | ||
"\n", | ||
"This notebook walks through how to generate synthetic customer reviews, upload them into **Phoenix**, and run evaluations to identify patterns and repetitions. \n", | ||
"We’ll go step by step: generating data, structuring it into a dataset, and finally running experiments inside Phoenix to compare model outputs against reference labels. \n", | ||
"Along the way, we’ll also look at screenshots of the Phoenix UI to see how datasets and experiments are visualized. \n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "a82d6871", | ||
"metadata": {}, | ||
"source": [ | ||
"### Setup & Installation\n", | ||
"We start by installing the required dependencies: \n", | ||
"- **pandas** for data manipulation \n", | ||
"- **openai** for LLM calls \n", | ||
"- **arize-phoenix** to log and evaluate results \n", | ||
"- **nest_asyncio** to avoid issues when running async code in notebooks \n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "2935272c", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"%pip install pandas openai arize-phoenix nest_asyncio" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "6facc3b2", | ||
"metadata": {}, | ||
"source": [ | ||
"### Importing Libraries\n", | ||
"Next, we import the libraries needed to: \n", | ||
"- Generate synthetic customer reviews using the OpenAI API \n", | ||
"- Register Phoenix for tracking experiments \n", | ||
"- Prepare data and send it into Phoenix for evaluation \n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "91129268", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"import json\n", | ||
"import os\n", | ||
"import re\n", | ||
"from getpass import getpass\n", | ||
"\n", | ||
"import nest_asyncio\n", | ||
"import pandas as pd\n", | ||
"from openai import OpenAI\n", | ||
"\n", | ||
"from phoenix.client import Client\n", | ||
"from phoenix.otel import register\n", | ||
"\n", | ||
"nest_asyncio.apply()\n", | ||
"\n", | ||
"if not (openai_api_key := os.getenv(\"OPENAI_API_KEY\")):\n", | ||
" openai_api_key = getpass(\"🔑 Enter your OpenAI API key: \")\n", | ||
"\n", | ||
"os.environ[\"OPENAI_API_KEY\"] = openai_api_key\n", | ||
"\n", | ||
"openai_client = OpenAI()\n", | ||
"\n", | ||
"client = Client()\n", | ||
"\n", | ||
"tracer_provider = register(project_name=\"generating-datasets\", auto_instrument=True)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "137a42d3", | ||
"metadata": {}, | ||
"source": [ | ||
"### Generating Synthetic Customer Reviews\n", | ||
"Here, we create a **few-shot prompt** that instructs the model to generate 25 product reviews for clothing items. \n", | ||
"This ensures we have a realistic dataset to evaluate with multiple tones and sentiments. \n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "6a8a6926", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"few_shot_prompt = \"\"\"\n", | ||
"You are a creative writer simulating customer product reviews for a clothing brand.\n", | ||
"Generate 25 unique reviews. Each review should be a few sentences long (max 200 words each) and sound like something a real customer might write.\n", | ||
"\n", | ||
"Balance them across the following categories:\n", | ||
"1. Highly Positive & Actionable → clear praise AND provides constructive suggestions for improvement.\n", | ||
"2. Positive but Generic → generally favorable but vague.\n", | ||
"3. Neutral / Mixed → highlights both pros and cons.\n", | ||
"4. Negative but Actionable → critical but with constructive feedback.\n", | ||
"5. Highly Negative & Non-Constructive → strongly negative, unhelpful venting.\n", | ||
"6. Off-topic → not about clothing at all (e.g., a review mistakenly left about a different product or service). Don't say anything about how the product is not about clothing.\n", | ||
"\n", | ||
"Constraints:\n", | ||
"- Cover all 6 categories across the 25 reviews\n", | ||
"- Make them varied: mention different clothing items (e.g., jeans, jackets, dresses, shirts, shoes).\n", | ||
"- Use a natural human voice, with realistic details.\n", | ||
"- Constructive feedback should be specific and actionable.\n", | ||
"- Do not number or label the reviews. Output just the raw reviews as paragraphs, separated by line breaks.\n", | ||
"- Make some of them ambiguous and harder to classify.\n", | ||
"- Decide the classified label first and then write the review. Double check all the reviews and make sure you classify them correctly.\n", | ||
"\n", | ||
"OUTPUT SHAPE (JSON array ONLY; no extra text):\n", | ||
"[\n", | ||
" {\n", | ||
" \"input\": str,\n", | ||
" \"label\": \"highly positive & actionable\" | \"positive but generic\" | \"neutral\" | \"negative but actionable\" | \"highly negative\" | \"off-topic\",\n", | ||
" }\n", | ||
"]\n", | ||
"\n", | ||
"Style Examples, Here are examples for guidance (do not repeat):\n", | ||
"{\n", | ||
" \"input\": \"I absolutely love the new denim jacket I purchased. The fit is perfect, the stitching feels durable, and I’ve already gotten compliments. The inside lining is soft and makes it comfortable to wear for hours. One small suggestion would be to add an inner pocket for a phone or keys — that would make it perfect. Overall, I’ll definitely be back for more.\",\n", | ||
" \"label\": \"highly positive & actionable\"\n", | ||
"}\n", | ||
"{\n", | ||
" \"input\": \"The T-shirt I bought was nice. The color was good and it felt comfortable. I liked it overall and would probably buy again.\",\n", | ||
" \"label\": \"positive but generic\"\n", | ||
"}\n", | ||
"{\n", | ||
" \"input\": \"The dress arrived on time and the material is soft. However, the sizing runs a bit small, and the shade of blue was lighter than pictured. It’s not bad, but I’m not as excited about it as I hoped.\",\n", | ||
" \"label\": \"neutral\"\n", | ||
"}\n", | ||
"{\n", | ||
" \"input\": \"The shoes looked stylish but the soles wore down quickly after just a month. If the company improved the durability of the soles, these would be a great buy. Right now, I don’t think they’re worth the price.\",\n", | ||
" \"label\": \"negative but actionable\"\n", | ||
"}\n", | ||
"{\n", | ||
" \"input\": \"This sweater is terrible. The worst thing I’ve ever bought. Waste of money.\",\n", | ||
" \"label\": \"highly negative & non-constructive\"\n", | ||
"}\n", | ||
"{\n", | ||
" \"input\": \"I'm very disappointed in my delivery. The dog food arrived late and was leaking.\",\n", | ||
" \"label\": \"off-topic\"\n", | ||
"}\n", | ||
"\"\"\"" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "e07ca64a", | ||
"metadata": {}, | ||
"source": [ | ||
"### Running the LLM to Generate Data\n", | ||
"We send our prompt to the OpenAI model (`gpt-4o-mini`) to generate the reviews. \n", | ||
"The output will be a structured set of text responses simulating customer feedback. \n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "e88bc163", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"resp = openai_client.chat.completions.create(\n", | ||
" model=\"gpt-4o-mini\",\n", | ||
" messages=[{\"role\": \"user\", \"content\": few_shot_prompt}],\n", | ||
" temperature=0.9,\n", | ||
")\n", | ||
"content = resp.choices[0].message.content.strip()\n", | ||
"\n", | ||
"try:\n", | ||
" data = json.loads(content)\n", | ||
"except json.JSONDecodeError:\n", | ||
" m = re.search(r\"\\[\\s*{.*}\\s*\\]\\s*$\", content, re.S)\n", | ||
" assert m, \"Model did not return a JSON array.\"\n", | ||
" data = json.loads(m.group(0))" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "c08662cd", | ||
"metadata": {}, | ||
"source": [ | ||
"### Creating a DataFrame\n", | ||
"We load the generated responses into a **pandas DataFrame** with two columns: \n", | ||
"- `input`: the customer review text \n", | ||
"- `label`: the sentiment category we will later evaluate \n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "659fd27e", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"df = pd.DataFrame(data)[[\"input\", \"label\"]]\n", | ||
"df" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "d91ac12e", | ||
"metadata": {}, | ||
"source": [ | ||
"### Uploading to Phoenix\n", | ||
"We now create a **Phoenix dataset** named `clothing-product-reviews` from our DataFrame. This allows us to track, explore, and evaluate the generated reviews inside Phoenix. \n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "ed69072e", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"dataset = client.datasets.create_dataset(\n", | ||
" name=\"clothing-product-reviews\",\n", | ||
" dataframe=df,\n", | ||
" input_keys=[\"input\"],\n", | ||
" output_keys=[\"label\"],\n", | ||
")\n", | ||
"print(\"Dataset created.\")\n", | ||
"\n", | ||
"dataset = client.datasets.get_dataset(dataset=\"clothing-product-reviews\")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "22b4a183", | ||
"metadata": {}, | ||
"source": [ | ||
"This is what your uploaded dataset will look like in the Phoenix UI! \n", | ||
"\n", | ||
"<img alt=\"uploaded dataset image\" src=\"https://storage.googleapis.com/arize-phoenix-assets/assets/images/phoenix-docs-images/repetitions_dataset_view.png\" width=\"900\"/>" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "96c013ba", | ||
"metadata": {}, | ||
"source": [ | ||
"### Defining the Evaluation Task\n", | ||
"We define a task function that represents how we want to evaluate each review. \n", | ||
"This is where you could run another LLM pass (or a heuristic) to classify the review. \n", | ||
"Phoenix wraps each run into an **Example** object for easy logging. \n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "6714c759", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"from phoenix.experiments.types import Example\n", | ||
"\n", | ||
"\n", | ||
"# Define your task\n", | ||
"def my_task(example: Example) -> str:\n", | ||
" TASK_PROMPT = f\"\"\"\n", | ||
" You are an expert in customer experience analysis for fashion and apparel.\n", | ||
"\n", | ||
" TASK\n", | ||
" You will be given a single customer review. Your job is to classify the overall type of review into exactly one category.\n", | ||
" Consider the review’s overall tone, level of detail, presence of suggestions, and whether it is truly about clothing.\n", | ||
" Do not overthink minor wording differences—choose the label that feels most appropriate overall.\n", | ||
" Double check your review and make sure you classify them correctly.\n", | ||
"\n", | ||
" Labels:\n", | ||
" - Highly Positive & Actionable\n", | ||
" - Positive but Generic\n", | ||
" - Neutral / Mixed\n", | ||
" - Negative but Actionable\n", | ||
SrilakshmiC marked this conversation as resolved.
Show resolved
Hide resolved
|
||
" - Highly Negative & Non-Constructive\n", | ||
" - Off-topic\n", | ||
SrilakshmiC marked this conversation as resolved.
Show resolved
Hide resolved
|
||
"\n", | ||
" Here is the customer review: {example.input}\n", | ||
"\n", | ||
" RESPONSE FORMAT:\n", | ||
" Return ONLY the label string from the allowed list.\n", | ||
" No punctuation, no extra words, no explanation, no JSON.\n", | ||
" \"\"\"\n", | ||
" resp = openai_client.chat.completions.create(\n", | ||
" model=\"gpt-4o-mini\",\n", | ||
" messages=[{\"role\": \"user\", \"content\": TASK_PROMPT}],\n", | ||
" temperature=0.9,\n", | ||
" )\n", | ||
" content = resp.choices[0].message.content.strip()\n", | ||
" return content" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "9c1e8957", | ||
"metadata": {}, | ||
"source": [ | ||
"### Running an Experiment\n", | ||
"We run an experiment on our dataset using the defined task. \n", | ||
"This produces a labeled set of outputs that we can compare against our expectations. \n", | ||
"Phoenix records: \n", | ||
"- inputs (customer reviews) \n", | ||
"- outputs (model classifications) \n", | ||
"- metadata (timing, tokens, cost, etc.) \n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "8c5afa1f", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"experiment = client.experiments.run_experiment(\n", | ||
" dataset=dataset, task=my_task, experiment_name=\"testing labels\", repetitions=3\n", | ||
")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "cbb164c6", | ||
"metadata": {}, | ||
"source": [ | ||
"This is what your uploaded experiment will look like in the Phoenix UI! You can click through the arrows as you want to look through each of the repetitions\n", | ||
"\n", | ||
"<img alt=\"uploaded dataset image\" src=\"https://storage.googleapis.com/arize-phoenix-assets/assets/images/phoenix-docs-images/repetitions_experiment_view.png\" width=\"900\"/>" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "bfa60a8c", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [] | ||
} | ||
], | ||
"metadata": { | ||
"language_info": { | ||
"name": "python" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 5 | ||
} |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.