diff --git a/README.md b/README.md new file mode 100644 index 0000000..eb3099a --- /dev/null +++ b/README.md @@ -0,0 +1,96 @@ +# Web Search Unit Test Tool +This is repo is an open source LLM web search unit test. This was originally made at Hume AI, and all Hume-related identifying information redacted and replaced. + +## Overview +This tool is designed to evaluate the performance of various LLMs in determining when to perform web searches based on user questions. It currently supports testing models from both OpenAI and Anthropic APIs. The main objectives are: + +1. Run web search on a series of 100+ high-quality test cases and generate metrics like precision & recall. +2. Improve precision/recall web search by at least 10 percentage points through modifications to the prompt, tool description, and/or other variables. +3. Create an extensible framework that can be built upon for other evaluations and continued improvement of web search evaluations. + +## How It Works + +The web search unit test function follows these steps: + +1. Loads a dataset of user questions from real conversational logs. +2. Sends these questions to OpenAI/Anthropic API (configurable). +3. Checks if a web search query has been performed in response to each question. +4. Reports the percentage of user questions which resulted in a successful web search query, along with other relevant metrics. + +This project is structured into two main components: + +1. `chat_ranker`: Converts raw user chat log data into a CSV with LLM scores of various criteria. +2. `web_search`: Conducts unit tests given different models, system prompts, and scoring criteria. + +## How to Use + +### Setup + +1. Clone the repository: + ``` + git clone https://github.com/menhguin/llm-benchmarks.git + cd llm-benchmarks + ``` + +2. Install the required dependencies (mainly just OpenAI, Anthropic and python-dotenv): + ``` + pip install -r requirements.txt + ``` + +3. Set up your API keys: + Create a `.env` file in the root directory and add your OpenAI and Anthropic API keys: + ``` + OPENAI_API_KEY=your_openai_api_key + ANTHROPIC_API_KEY=your_anthropic_api_key + ``` + +### Running the Test + +1. To run the web search unit test: + ``` + python web_search/test_web_search.py + ``` + +2. To add a new dataset: + - Add a new Json with the same format as the current one. Currently the filepath just detects whatever file in the project folder is named `live_test_cases.json`. Note that this unit test natively supports multi-turn message log input via both Anthropic and OpenAI APIs. + - Run `python chat_ranker/test_case_ranker.py`. + +3. To add a new model or system prompt: + - Edit `web_search/test_system_prompts.csv`. + - Run `python web_search/test_web_search.py` and select the new model/prompt when prompted. + +## Field Descriptions + +### Input CSV Fields + +1. `web_search_expected` (boolean): Indicates whether a web search is expected for this test case. +2. `follow_up_expected` (boolean): Indicates whether a follow-up question is expected for this test case. +3. `Score` (float): A rating between -5 and 5 indicating the likelihood of triggering a web search. +4. `triggering_user_message` (string): The user's message being evaluated for potential web search triggering. +5. `context` (string): Previous conversation context, if any. +6. `function_call_text` (string): The specific text of the function call, if applicable. + +### Output CSV Fields + +1. `result` (string): Outcome of the test case (TP, FP, TN, FN). +2. `score` (float): Same score as in the input CSV. +3. `web_search_expected` (boolean): Whether a web search was expected. +4. `web_search_performed` (boolean): Whether a web search was actually performed. +5. `test_question` (string): The user's message that was tested. +6. `api_response` (string): The response received from the API. +7. `search_query` (string): The actual search query used if a web search was performed. +8. `context` (string): The conversation context provided. +9. `model` (string): The name of the model used for testing. +10. `api_source` (string): The API provider used. +11. `comments` (string): Any additional comments or notes about the test case. + +## Todo List: + +- **Follow-up questions** - Certain queries involve/require follow-up questions. Currently I am trying to figure out how to account for this. It’s important-ish for max granularity because recent system prompts are more likely to ask follow-up but this gets them marked as less precision. +- **Performance and modularisation** - With the newest addition of follow-up questions, the eval takes about 2 minutes. This is substantially slower than the ~10 seconds it used to be (and the ranker is still about 10 seconds), so I suspect I need to speed some parts up. +- **Manual review of test cases -** affects edge case accuracy of eval. +- **Refactoring for better handling of column names for future evals:** Currently the column/entry fields are hardcoded to suit the web search eval. It actually works fine if you just change the system prompt and keep the column names the same/change all the column names throughout, but someone should probably make those dynamic/input-based eventually. +- **Figure out how to get real false negatives** +- **Include other fields:** Filter by time/model, lag time, emotional data, qualitatively parsing actual query results +- **Comparison of different settings/prompts and listing examples where the different settings resulted in different responses** +- **Documentation** diff --git a/chat_ranker/requirements.txt b/chat_ranker/requirements.txt new file mode 100644 index 0000000..2530409 --- /dev/null +++ b/chat_ranker/requirements.txt @@ -0,0 +1,4 @@ +#versions as of 06/25/2024 +anthropic +openai +python-dotenv diff --git a/chat_ranker/test_case_ranker.py b/chat_ranker/test_case_ranker.py new file mode 100644 index 0000000..3ddbe6e --- /dev/null +++ b/chat_ranker/test_case_ranker.py @@ -0,0 +1,379 @@ +import anthropic +from openai import AsyncOpenAI +from dotenv import load_dotenv +import csv +import os +import sys +import json +import asyncio +from typing import List, Dict, Tuple, Any +from datetime import datetime + +# Load environment variables from .env file +load_dotenv() + +# Global variable for print responses +print_responses = True + +# Initialize API clients with better error handling +async def initialize_clients(): + try: + openai_client = AsyncOpenAI(api_key=os.environ["OPENAI_API_KEY"]) + except KeyError: + print("Error: OPENAI_API_KEY not found in environment variables.") + print("Please make sure you have set the OPENAI_API_KEY in your .env file or environment.") + sys.exit(1) + + try: + anthropic_client = anthropic.AsyncAnthropic(api_key=os.environ["ANTHROPIC_API_KEY"]) + except KeyError: + print("Warning: ANTHROPIC_API_KEY not found in environment variables.") + print("Anthropic models will not be available for testing.") + anthropic_client = None + except Exception as e: + print(f"Error initializing Anthropic client: {str(e)}") + anthropic_client = None + + return openai_client, anthropic_client + +DEFAULT_MODEL = "gpt-3.5-turbo" +DEFAULT_SYSTEM_PROMPT = "Analyze user messages for web search relevance." +DEFAULT_API = "OpenAI" + +def get_available_projects() -> List[str]: + return [folder for folder in os.listdir() if os.path.isdir(folder) and os.path.exists(os.path.join(folder, "test_system_prompts.csv"))] + +def select_project(projects: List[str]) -> str: + print("Available projects:") + for i, project in enumerate(projects, 1): + print(f"{i}. {project}") + + while True: + try: + choice = input("Enter the destination folder for output test cases: ") + index = int(choice) - 1 + if 0 <= index < len(projects): + return projects[index] + else: + print("Invalid selection. Please try again.") + except ValueError: + print("Invalid input. Please enter a number.") + +async def load_available_models() -> Dict[str, Dict[str, str]]: + available_models = {} + try: + with open("chat_ranker/test_system_prompts.csv", "r", newline='') as f: + csv_reader = csv.DictReader(f) + for row in csv_reader: + test_model = row.get("test_model", "").strip() + test_system_prompt = row.get("test_system_prompt", "").strip() + api_source = row.get("api_source", "").strip() + comments = row.get("comments", "").strip() + if test_model and test_system_prompt and api_source: + available_models[test_model] = { + "prompt": test_system_prompt, + "api": api_source, + "comments": comments + } + if not available_models: + raise Exception("No valid models found in CSV") + except Exception as e: + print(f"Error: Could not load test model and system prompt from CSV. Loading default model with default system prompt. Error: {str(e)}") + available_models[DEFAULT_MODEL] = { + "prompt": DEFAULT_SYSTEM_PROMPT, + "api": DEFAULT_API, + "comments": "" + } + return available_models + +def set_print_responses(): + global print_responses + choice = input("Would you like to print API response transcripts? (y/n): ").lower() + print_responses = choice == 'y' + +async def load_queries(project: str) -> List[Dict]: + queries = [] + try: + with open(f"{project}/live_test_cases.json", "r") as f: + data = json.load(f) + + total_examples = len(data) + filtered_data = [] + + no_user_message = 0 + baymax_health_database = 0 + valid_queries = 0 + + for row in data: + triggering_user_message = row.get('triggering_user_message', '') + prior_messages = row.get('prior_messages', []) + + # Check for Baymax fetching info from the health database. You can delete this, this is just to show filtering examples. + prior_messages_text = ' '.join(str(msg.get('message_text', '')) for msg in prior_messages) + if "baymax" in prior_messages_text.lower() and "health database" in prior_messages_text.lower(): + baymax_health_database += 1 + continue + + if not triggering_user_message: + no_user_message += 1 + continue + + # Filter out SYSTEM_PROMPT messages and create context + context = "\n".join([ + f"{msg['event_type']}: {msg.get('message_text', '')}" + for msg in prior_messages + if msg['event_type'] != 'SYSTEM_PROMPT' and msg.get('message_text') + ]) + + filtered_data.append({ + "user_message": triggering_user_message, + "context": context, + "function_call_text": row.get('function_call_text', '') + }) + valid_queries += 1 + + queries = filtered_data + + print(f"Total examples in JSON: {total_examples}") + print(f"Examples with no user message: {no_user_message}") + print(f"Examples with 'Baymax' and 'health database': {baymax_health_database}") + print(f"Valid queries after filtering: {valid_queries}") + + if not queries: + raise Exception("No valid queries found in JSON after filtering") + except Exception as e: + print(f"Error: Could not load queries from JSON. Error: {str(e)}") + print("Fallback to default query") + queries = [{"user_message": "Default query for testing", "context": "", "function_call_text": ""}] # Fallback to a default query + + return queries + +def inspect_json(project: str, num_entries: int = 5): + try: + with open(f"{project}/live_test_cases.json", "r") as f: + data = json.load(f) + + print(f"\nInspecting first {num_entries} entries of the JSON file:") + for i, entry in enumerate(data[:num_entries], 1): + print(f"\nEntry {i}:") + print(f"Triggering user message: {entry.get('triggering_user_message', 'Not found')}") + print("Prior messages:") + for msg in entry.get('prior_messages', [])[:5]: # Show first 5 prior messages + print(f" - {msg.get('event_type')}: {msg.get('message_text', '')[:50]}...") + print("..." if len(entry.get('prior_messages', [])) > 5 else "") + print(f"Function call text: {entry.get('function_call_text', 'Not found')[:100]}...") + except Exception as e: + print(f"Error inspecting JSON: {str(e)}") + +def select_models(available_models: Dict[str, Dict[str, str]]) -> List[str]: + global print_responses + print(f"\nAPI response transcripts will {'' if print_responses else 'not '}be printed. To toggle this setting, type 'print'.") + models = list(available_models.keys()) + print("Available models:") + for i, model in enumerate(models, 1): + api = available_models[model]["api"] + comments = available_models[model]["comments"] + comment_str = f" - {comments}" if comments else "" + print(f"{i}. {model} ({api}){comment_str}") + + while True: + try: + choice = input("Enter the number(s) of the model(s) you want to test (comma-separated), \n'all' for all models: ") + if choice.lower() == 'print': + print_responses = not print_responses + print(f"API responses will {'' if print_responses else 'not '}be printed.") + continue + elif choice.lower() == 'all': + return models + + selected_indices = [int(idx.strip()) - 1 for idx in choice.split(',')] + selected_models = [models[i] for i in selected_indices if 0 <= i < len(models)] + + if not selected_models: + print("No valid models selected. Please try again.") + else: + return selected_models + except ValueError: + print("Invalid input. Please enter number(s), 'all', or 'print'.") + +def parse_context(context: str) -> List[Dict[str, str]]: + messages = [] + if context: + context_messages = context.split("\n") + last_role = None + for msg in context_messages: + if msg.strip(): + parts = msg.strip().split(":", 1) + if len(parts) == 2: + role, content = parts + if role.upper() == "USER_MESSAGE": + api_role = "user" + elif role.upper() == "AGENT_MESSAGE": + api_role = "assistant" + else: + continue # Skip invalid roles + + # If the current role is the same as the last one, combine the messages + if api_role == last_role and messages: + messages[-1]["content"] += f" {content.strip()}" + else: + messages.append({"role": api_role, "content": content.strip()}) + last_role = api_role + return messages + +async def get_completion_openai(client, model: str, system_prompt: str, user_message: str, context: str) -> Any: + formatted_system_prompt = system_prompt.format(context=context, query=user_message) + messages = [ + {"role": "system", "content": formatted_system_prompt}, + ] + + context_messages = parse_context(context) + + # Ensure the messages alternate correctly + if context_messages: + if context_messages[0]["role"] == "assistant": + context_messages.insert(0, {"role": "user", "content": "Start of conversation"}) + messages.extend(context_messages) + + # Add the current user message + if messages[-1]["role"] == "user": + messages[-1]["content"] += f" {user_message}" + else: + messages.append({"role": "user", "content": user_message}) + + response = await client.chat.completions.create( + model=model, + messages=messages, + ) + return response + +async def get_completion_anthropic(client, model: str, system_prompt: str, user_message: str, context: str) -> Any: + formatted_system_prompt = system_prompt.format(context=context, query=user_message) + + context_messages = parse_context(context) + + # Ensure the messages alternate correctly + if context_messages: + if context_messages[0]["role"] == "assistant": + context_messages.insert(0, {"role": "user", "content": "Start of conversation"}) + + # Add the current user message + if context_messages and context_messages[-1]["role"] == "user": + context_messages[-1]["content"] += f" {user_message}" + else: + context_messages.append({"role": "user", "content": user_message}) + + response = await client.messages.create( + model=model, + max_tokens=1024, + system=formatted_system_prompt, + messages=context_messages, + ) + return response + +def extract_score_and_expectations(content: str) -> Tuple[float, bool, bool]: + lines = content.split('\n') + score = 0 + web_search_expected = False + follow_up_expected = False + for line in lines: + if line.lower().startswith("web search expected:"): + web_search_expected = line.split(':')[1].strip().lower() == 'true' + elif line.lower().startswith("follow-up expected:"): + follow_up_expected = line.split(':')[1].strip().lower() == 'true' + elif line.lower().startswith("rating:") or line.lower().startswith("score:"): + try: + score = float(line.split(':')[1].strip().split()[0]) + except ValueError: + score = 0 + return score, web_search_expected, follow_up_expected + +async def process_query(openai_client, anthropic_client, test_model: str, test_system_prompt: str, api_source: str, query: Dict) -> Tuple[str, float, bool, bool, str, str]: + try: + user_message = query["user_message"] + context = query["context"] + function_call_text = query["function_call_text"] + + if api_source == "OpenAI": + response = await get_completion_openai(openai_client, test_model, test_system_prompt, user_message, context) + content = response.choices[0].message.content + elif api_source == "Anthropic": + response = await get_completion_anthropic(anthropic_client, test_model, test_system_prompt, user_message, context) + content = response.content[0].text + else: + raise ValueError(f"Unsupported API source: {api_source}") + + score, web_search_expected, follow_up_expected = extract_score_and_expectations(content) + + if score == 0 and not web_search_expected and not follow_up_expected: + print(f"Warning: Unexpected result for query: {user_message}") + print(f"API Response: {content}") + + return user_message, score, web_search_expected, follow_up_expected, context, function_call_text + except Exception as e: + print(f"Error processing query: {user_message}") + print(f"Error details: {str(e)}") + return user_message, 0, False, False, context, "" + +async def analyze_queries(openai_client, anthropic_client, test_model: str, test_system_prompt: str, api_source: str, queries: List[Dict]) -> List[Tuple[str, float, bool, bool, str, str]]: + tasks = [process_query(openai_client, anthropic_client, test_model, test_system_prompt, api_source, query) for query in queries] + scored_queries = await asyncio.gather(*tasks) + + # Sort queries by score in descending order + scored_queries.sort(key=lambda x: x[1], reverse=True) + + return scored_queries + +def save_to_csv(scored_queries: List[Tuple[str, float, bool, bool, str, str]], output_file: str): + os.makedirs(os.path.dirname(output_file), exist_ok=True) + with open(output_file, 'w', newline='', encoding='utf-8') as csvfile: + writer = csv.writer(csvfile) + writer.writerow(['web_search_expected', 'follow_up_expected', 'Score', 'triggering_user_message', 'context', 'function_call_text']) + for query, score, web_search_expected, follow_up_expected, context, function_call_text in scored_queries: + writer.writerow([web_search_expected, follow_up_expected, f"{score:.2f}", query, context, function_call_text]) + print(f"Results saved to {output_file}") + +def get_timestamp(): + return datetime.now().strftime("%Y%m%d_%H%M%S") + +async def main(): + projects = get_available_projects() + selected_project = select_project(projects) + + set_print_responses() + available_models = await load_available_models() + selected_models = select_models(available_models) + + # Load queries from JSON + queries = await load_queries(selected_project) + + openai_client, anthropic_client = await initialize_clients() + + timestamp = get_timestamp() + + for test_model in selected_models: + test_system_prompt = available_models[test_model]["prompt"] + api_source = available_models[test_model]["api"] + + if api_source == "OpenAI" and openai_client is None: + print(f"Skipping test for {test_model} as OpenAI client is not initialized.") + continue + if api_source == "Anthropic" and anthropic_client is None: + print(f"Skipping test for {test_model} as Anthropic client is not initialized.") + continue + + print(f"\nAnalyzing queries using model {test_model} ({api_source}):") + scored_queries = await analyze_queries(openai_client, anthropic_client, test_model, test_system_prompt, api_source, queries) + + # Save results to CSV with timestamp and project name + output_file = f"{selected_project}/test_cases/{selected_project}_test_cases_{timestamp}.csv" + save_to_csv(scored_queries, output_file) + + print("\nAll queries sorted by score (highest to lowest):") + for query, score, web_search_expected, follow_up_expected, context, function_call_text in scored_queries: + print(f"Web_search_expected: {web_search_expected}, Follow_up_expected: {follow_up_expected}, Score: {score:.2f}, Triggering User Message: {query}") + print(f"Context:\n{context}") + print(f"Function call text: {function_call_text}\n") + +if __name__ == "__main__": + asyncio.run(main()) \ No newline at end of file diff --git a/chat_ranker/test_system_prompts.csv b/chat_ranker/test_system_prompts.csv new file mode 100644 index 0000000..c33c92b --- /dev/null +++ b/chat_ranker/test_system_prompts.csv @@ -0,0 +1,37 @@ +test_model,api_source,test_system_prompt +claude-3-5-sonnet-20240620,Anthropic,"You are Baymax, a friendly healthcare companion developed by Hamada Health, a biotech startup dedicated to revolutionizing personal healthcare. You are evaluating when to trigger web search function calls or ask follow-up questions in response to user queries. Analyze the following conversation context and the latest user message: + +{context} + +Latest user message: ""{query}"" + +Consider these aspects: +1. Is the latest message or overall conversation context vague or requiring additional information? +2. Does it require understanding social cues or context to determine if a web search or follow-up question is necessary? +3. Is there incomplete or unclear information that might require a web search or follow-up question to provide a helpful response? +4. How well does this conversation represent real-world user interactions? + +Keep in mind: +- Users often speak casually and may not use complete sentences. +- Messages can be short, vague, or require inference of user needs. +- Some messages may indirectly reference contextual details from previous turns. +- The goal is to determine if a web search would be necessary to provide a helpful response, versus relying on the internal knowledge of a large language model or asking a follow-up question. +- Questions highly reliant on realtime, recent or location-specific information should trigger a web search, as it is otherwise impossible to answer them accurately. +- You can err slightly, within reason, on the side of searching more to give users comprehensive and helpful advice. This is especially the case for important questions requiring sensitive and accurate answers, such as legal advice or pertaining to physical and mental health. +- Questions about Hamada Health should generally not trigger a web search, as this information is already known by the model. +- If the query is vague or lacks context, consider whether a follow-up question would be more appropriate than a web search. + +Rate the likelihood that this conversation should trigger a web search on a scale of -5 to 5, where: +5 - Definitely requires a web search +0 - Unclear, could go either way +-5 - Definitely does not require a web search + +Provide your rating, a boolean indicating whether a web search is expected, a boolean indicating whether a follow-up question is expected, and a brief explanation for your decision. + +Your response should be in the following format: +Web Search Expected: [true or false] +Follow-up Expected: [true or false] +Rating: [Your rating between -5 and 5] +Explanation: [Your brief explanation] + +Note: Even if the rating is 0, you must still decide whether a web search is expected (true or false) and whether a follow-up question is expected (true or false). Make your best judgment based on the information provided." diff --git a/web_search/UNSTABLE_test_web_search_follow_up.py b/web_search/UNSTABLE_test_web_search_follow_up.py new file mode 100644 index 0000000..2d1436a --- /dev/null +++ b/web_search/UNSTABLE_test_web_search_follow_up.py @@ -0,0 +1,576 @@ +import anthropic +from openai import OpenAI +from dotenv import load_dotenv +import csv +import os +import sys +import asyncio +import glob +from datetime import datetime +from typing import List, Dict, Any, Tuple, NamedTuple +from dataclasses import dataclass + +# Load environment variables from .env file +load_dotenv() + +# Global variable for print responses +print_responses = True + +# Constants +DEFAULT_MODEL = "gpt-3.5-turbo" +DEFAULT_SYSTEM_PROMPT = "Search the internet for up-to-date information." +DEFAULT_API = "OpenAI" + +@dataclass +class TestCase: + web_search_expected: bool + follow_up_expected: bool + score: float + test_question: str + context: str + +class DetailedTestResult(NamedTuple): + model: str + api_source: str + precision: float + recall: float + success_rate: float + true_positives: int + false_positives: int + true_negatives: int + false_negatives: int + total_cases: int + comments: str + score_level_stats: Dict[int, Dict[str, Any]] + +# Initialize API clients with better error handling +try: + openai_client = OpenAI(api_key=os.environ["OPENAI_API_KEY"]) +except KeyError: + print("Error: OPENAI_API_KEY not found in environment variables.") + print("Please make sure you have set the OPENAI_API_KEY in your .env file or environment.") + sys.exit(1) + +try: + anthropic_client = anthropic.Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"]) +except KeyError: + print("Warning: ANTHROPIC_API_KEY not found in environment variables.") + print("Anthropic models will not be available for testing.") + anthropic_client = None +except Exception as e: + print(f"Error initializing Anthropic client: {str(e)}") + anthropic_client = None + +# Web search tools (moved to a separate configuration in a real-world scenario) +web_search_tool = { + "type": "function", + "function": { + "name": "web_search", + "description": "Intelligently assess conversation context to determine if a web search is necessary. Whenever a user has a question that would benefit from searching the web, always use the `web_search` tool. Trigger searches for queries involving current events, real-time info, location-specific data, or topics requiring up-to-date facts, or if explicitly asked to search. Consider the user's implicit needs, accounting for casual or incomplete language. Avoid searching for things you already know, e.g. info about Hume. Do not unnecessarily say things before running the search. Examples of queries **to use web search for**: - 'What's the weather in Tokyo?' 'What's NVIDIA's stock price?' 'What time is it?' 'When's the next solar eclipse?' (real-time info) - 'How was the recent presidential debate?' 'What's happening in fusion research?' 'What's the price range for TVs in 2024' (news and up-to-date factual info) - 'Who is Alan Cowen?' (info about a specific person) - 'Can I pause EVI responses when using the API?' (technical info about EVI - run a query on our documentation site `dev.hume.ai`). Examples of queries to **NOT** use web search for: - 'What is Hume AI?' 'How does EVI work?' (non-public info about Hume) - 'How to politely end a conversation' 'What's the capital of France' (general knowledge) - 'Explain the concept of supply and demand' 'What's the meaning of life' (concepts). If you are uncertain, ask the user to clarify if you should search the web, and then run the web search if they confirm.", + "parameters": { + "type": "object", + "properties": { + "query": { + "type": "string", + "description": "The search query.", + }, + }, + "required": ["query"], + }, + }, +} + +web_search_tool_anthropic = { + "name": "web_search", + "description": "Dynamically assess queries and context to determine search necessity. Trigger for: 1) Current events, breaking news; 2) Time-sensitive info; 3) Location-specific data; 4) Topics needing up-to-date facts; 5) Complex questions beyond general knowledge; 6) Health, legal, financial matters needing accuracy. Consider implicit needs in casual language and incomplete queries. Prioritize critical topics where outdated info could be harmful. Avoid searching general knowledge or Hume AI info. When uncertain, lean towards searching for comprehensive responses.", + "input_schema": { + "type": "object", + "properties": { + "query": { + "type": "string", + "description": "The search query.", + } + }, + "required": ["query"] + } +} + +def load_available_models() -> Dict[str, List[Dict[str, str]]]: + available_models = {} + try: + with open("web_search/test_system_prompts.csv", "r", newline='') as f: + csv_reader = csv.DictReader(f) + for row in csv_reader: + test_model = row.get("test_model", "").strip() + test_system_prompt = row.get("test_system_prompt", "").strip() + api_source = row.get("api_source", "").strip() + comments = row.get("comments", "").strip() + if test_model and test_system_prompt and api_source: + if test_model not in available_models: + available_models[test_model] = [] + available_models[test_model].append({ + "prompt": test_system_prompt, + "api": api_source, + "comments": comments + }) + if not available_models: + raise Exception("No valid models found in CSV") + except Exception as e: + print(f"Error: Could not load test model and system prompt from CSV. Loading default model with default system prompt. Error: {str(e)}") + available_models[DEFAULT_MODEL] = [{ + "prompt": DEFAULT_SYSTEM_PROMPT, + "api": DEFAULT_API, + "comments": "" + }] + return available_models + +def set_print_responses(): + global print_responses + choice = input("Would you like to print API response transcripts? (y/n): ").lower() + print_responses = choice == 'y' + +# Update the load_test_cases function +def load_test_cases() -> List[TestCase]: + test_cases = [] + test_cases_dir = "web_search/test_cases" + list_of_files = glob.glob(f"{test_cases_dir}/web_search_test_cases_*.csv") + if not list_of_files: + raise FileNotFoundError("No test cases files found in the specified directory.") + latest_file = max(list_of_files, key=os.path.getctime) + + with open(latest_file, "r", newline='') as f: + csv_reader = csv.DictReader(f) + for row in csv_reader: + try: + web_search_expected = row.get("web_search_expected", "").lower() == "true" + follow_up_expected = row.get("follow_up_expected", "").lower() == "true" + score = float(row.get("Score", 0.0)) + test_question = row["triggering_user_message"].strip() + context = row.get("context", "") + test_cases.append(TestCase( + web_search_expected=web_search_expected, + follow_up_expected=follow_up_expected, + score=score, + test_question=test_question, + context=context + )) + except Exception as e: + print(f"Error parsing row: {row}") + print(f"Error details: {str(e)}") + + return test_cases + +def select_models(available_models: Dict[str, List[Dict[str, str]]]) -> List[Tuple[str, Dict[str, str]]]: + global print_responses + print(f"\nAPI response transcripts will {'' if print_responses else 'not '}be printed. To toggle this setting, type 'print'.") + print("Currently loading the most recent test_cases file. To load different test_cases, type 'load {file path}'") + + model_prompt_combinations = [] + for model, prompts in available_models.items(): + for i, prompt_dict in enumerate(prompts): + api = prompt_dict["api"] + comments = prompt_dict["comments"] + comment_str = f" - {comments}" if comments else "" + model_prompt_combinations.append((model, prompt_dict)) + print(f"{len(model_prompt_combinations)}. {model} ({api}) - Prompt {i+1} {comment_str}") + + while True: + try: + choice = input("Enter the number(s) of the model-prompt combination(s) you want to test (comma-separated), \n'all' for all combinations, or 'load' to change test cases: ") + if choice.lower() == 'print': + print_responses = not print_responses + print(f"API responses will {'' if print_responses else 'not '}be printed.") + continue + elif choice.lower().startswith('load '): + new_file_path = choice.split(' ', 1)[1] + try: + global load_test_cases + def load_test_cases(): + return load_specific_test_cases(new_file_path) + print(f"Test cases will now be loaded from: {new_file_path}") + continue + except Exception as e: + print(f"Error loading file: {str(e)}") + continue + elif choice.lower() == 'all': + return model_prompt_combinations + + selected_indices = [int(idx.strip()) - 1 for idx in choice.split(',')] + selected_combinations = [model_prompt_combinations[i] for i in selected_indices if 0 <= i < len(model_prompt_combinations)] + + if not selected_combinations: + print("No valid model-prompt combinations selected. Please try again.") + else: + return selected_combinations + except ValueError: + print("Invalid input. Please enter number(s), 'all', 'load {file path}', or 'print'.") + +def load_specific_test_cases(file_path: str) -> List[TestCase]: + test_cases = [] + with open(file_path, "r", newline='') as f: + csv_reader = csv.DictReader(f) + for row in csv_reader: + test_cases.append(TestCase( + test_question=row["test_question"], + web_search_expected=row.get("web_search_expected", "").lower() == "true" + )) + return test_cases + +def parse_context(context: str) -> List[Dict[str, str]]: + messages = [] + if context: + context_messages = context.split("MESSAGE:") + for msg in context_messages: + if msg.strip(): + parts = msg.strip().split(" ", 1) + if len(parts) == 2: + role, content = parts + # Map roles to valid API roles + if role.upper() == "USER": + api_role = "user" + elif role.upper() in ["AGENT", "ASSISTANT", "AI"]: + api_role = "assistant" + else: + continue # Skip invalid roles + messages.append({"role": api_role, "content": content.strip()}) + return messages + +async def get_completion_openai(client: OpenAI, model: str, system_prompt: str, user_message: str, context: str) -> Any: + messages = [ + {"role": "system", "content": system_prompt}, + ] + + # Parse and add context messages + messages.extend(parse_context(context)) + + messages.append({"role": "user", "content": user_message}) + + response = await asyncio.to_thread( + client.chat.completions.create, + model=model, + messages=messages, + tools=[web_search_tool], + ) + if print_responses: + print("OpenAI API response:", response) + return response + +async def get_completion_anthropic(client: anthropic.Anthropic, model: str, system_prompt: str, user_message: str, context: str) -> Any: + messages = parse_context(context) + + messages.append({"role": "user", "content": user_message}) + + response = await asyncio.to_thread( + client.messages.create, + model=model, + max_tokens=1024, + temperature=1, + system=system_prompt, + tools=[web_search_tool_anthropic], + messages=messages, + ) + if print_responses: + print("Anthropic API response:", response) + return response + +def evaluate_web_search(response: Any, api_source: str) -> bool: + web_search_performed = False + if api_source == "OpenAI": + for choice in response.choices: + if choice.message.tool_calls: + for tool_call in choice.message.tool_calls: + if tool_call.function.name == 'web_search': + web_search_performed = True + break + elif api_source == "Anthropic": + # Iterate through all content blocks of the response + for block in response.content: + # Check if the block is of type 'tool_use' and if the tool used is 'web_search' + if block.type == 'tool_use' and block.name == 'web_search': + web_search_performed = True + break + return web_search_performed + +async def check_follow_up_performed(client: OpenAI | anthropic.Anthropic, model: str, test_question: str, context: str, response: str, api_source: str) -> bool: + system_prompt = "You are an impartial judge. Your task is to determine if the AI assistant performed a follow-up action or asked a follow-up question after the user's last message. Respond with 'Yes' or 'No' only." + + user_prompt = "Based on the conversation history below, did the AI assistant perform a follow-up action or ask a follow-up question after the user's last message? Respond with 'Yes' or 'No' only." + + # Include the API response in the context + full_context = f"{context}\nUSER: {test_question}\nASSISTANT: {response}" + + if api_source == "OpenAI": + response = await get_completion_openai(client, model, system_prompt, user_prompt, full_context) + follow_up_performed = response.choices[0].message.content.strip().lower() == "yes" + elif api_source == "Anthropic": + response = await get_completion_anthropic(client, model, system_prompt, user_prompt, full_context) + follow_up_performed = response.content[0].text.strip().lower() == "yes" + else: + raise ValueError(f"Unsupported API source: {api_source}") + + return follow_up_performed + +async def process_test_case(openai_client: OpenAI, anthropic_client: anthropic.Anthropic, + test_model: str, test_system_prompt: str, api_source: str, + test_case: TestCase) -> Tuple[bool, bool, bool, bool, str, float, str, str]: + try: + if api_source == "OpenAI": + response = await get_completion_openai(openai_client, test_model, test_system_prompt, test_case.test_question, test_case.context) + message_text = response.choices[0].message.content if response.choices else "" + elif api_source == "Anthropic": + response = await get_completion_anthropic(anthropic_client, test_model, test_system_prompt, test_case.test_question, test_case.context) + message_text = "".join(block.text for block in response.content if block.type == 'text') + else: + raise ValueError(f"Unsupported API source: {api_source}") + + web_search_performed = evaluate_web_search(response, api_source) + search_query = extract_search_query(response, api_source) + + follow_up_performed = False + if not web_search_performed and test_case.web_search_expected: + follow_up_performed = await check_follow_up_performed( + openai_client if api_source == "OpenAI" else anthropic_client, + test_model, test_case.test_question, test_case.context, message_text, api_source + ) + + if print_responses: + print(f"API Response (Message Text): {message_text}") + print(f"Search Query: {search_query}") + print(f"Web Search Performed: {web_search_performed}") + print(f"Follow-up Expected: {test_case.follow_up_expected}") + print(f"Follow-up Performed: {follow_up_performed}") + + return web_search_performed, test_case.web_search_expected, test_case.follow_up_expected, follow_up_performed, test_case.test_question, test_case.score, message_text, search_query + except Exception as e: + print(f"Error processing question: {test_case.test_question}") + print(f"Error details: {str(e)}") + print(f"Context: {test_case.context}") + return False, test_case.web_search_expected, test_case.follow_up_expected, False, test_case.test_question, 0.0, "", "" + +def extract_search_query(response: Any, api_source: str) -> str: + if api_source == "OpenAI": + for choice in response.choices: + if choice.message.tool_calls: + for tool_call in choice.message.tool_calls: + if tool_call.function.name == 'web_search': + return json.loads(tool_call.function.arguments).get('query', '') + elif api_source == "Anthropic": + for block in response.content: + if block.type == 'tool_use' and block.name == 'web_search': + return block.input.get('query', '') + return "" + +def get_result(performed: bool, expected: bool, follow_up_expected: bool, follow_up_performed: bool) -> str: + if performed and expected: + return "TP" + elif performed and not expected: + return "FP" + elif not performed and not expected: + return "TN" + elif not performed and expected: + if follow_up_expected and follow_up_performed: + return "TP" # Reclassify as true positive + else: + return "FN" + else: + return "FN" + +async def web_search_unit_test(openai_client: OpenAI, anthropic_client: anthropic.Anthropic, + test_model: str, test_system_prompt: str, api_source: str, comments: str) -> Tuple[DetailedTestResult, List[Tuple[bool, bool, bool, bool, str, float, str, str]]]: + test_cases = load_test_cases() + tasks = [process_test_case(openai_client, anthropic_client, test_model, test_system_prompt, api_source, test_case) + for test_case in test_cases] + results = await asyncio.gather(*tasks) + + score_level_stats = {score: {"TP": 0, "FP": 0, "TN": 0, "FN": 0, "total": 0} for score in range(-5, 6)} + + for performed, expected, follow_up_expected, follow_up_performed, _, score, _, _ in results: + rounded_score = round(score) + result = get_result(performed, expected, follow_up_expected, follow_up_performed) + score_level_stats[rounded_score][result] += 1 + score_level_stats[rounded_score]["total"] += 1 + + true_positives = sum(stats["TP"] for stats in score_level_stats.values()) + false_positives = sum(stats["FP"] for stats in score_level_stats.values()) + true_negatives = sum(stats["TN"] for stats in score_level_stats.values()) + false_negatives = sum(stats["FN"] for stats in score_level_stats.values()) + + total_cases = len(test_cases) + precision = true_positives / (true_positives + false_positives) if (true_positives + false_positives) > 0 else 0 + recall = true_positives / (true_positives + false_negatives) if (true_positives + false_negatives) > 0 else 0 + success_rate = (true_positives + true_negatives) / total_cases + + for score, stats in score_level_stats.items(): + tp, fp, tn, fn = stats['TP'], stats['FP'], stats['TN'], stats['FN'] + stats['precision'], stats['recall'], stats['success_rate'] = calculate_stats(tp, fp, tn, fn) + + detailed_result = DetailedTestResult( + model=test_model, + api_source=api_source, + precision=precision, + recall=recall, + success_rate=success_rate, + true_positives=true_positives, + false_positives=false_positives, + true_negatives=true_negatives, + false_negatives=false_negatives, + total_cases=total_cases, + comments=comments, + score_level_stats=score_level_stats + ) + + print_score_level_stats(score_level_stats) + write_results_to_csv(results, detailed_result) + + return detailed_result, results + +def calculate_stats(tp, fp, tn, fn): + total = tp + fp + tn + fn + precision = tp / (tp + fp) if (tp + fp) > 0 else 0 + recall = tp / (tp + fn) if (tp + fn) > 0 else 0 + success_rate = (tp + tn) / total if total > 0 else 0 + return precision, recall, success_rate + +def print_score_level_stats(score_level_stats): + print("\nScore Level Statistics:") + print("Score | Total | TP | FP | TN | FN | Precision | Recall | Success Rate") + print("-" * 70) + for score in range(-5, 6): + stats = score_level_stats[score] + tp, fp, tn, fn = stats['TP'], stats['FP'], stats['TN'], stats['FN'] + precision, recall, success_rate = calculate_stats(tp, fp, tn, fn) + print(f"{score:5d} | {stats['total']:5d} | {tp:2d} | {fp:2d} | {tn:2d} | {fn:2d} | {precision:9.2%} | {recall:6.2%} | {success_rate:11.2%}") + +def write_results_to_csv(results, detailed_result: DetailedTestResult): + timestamp = datetime.now().strftime("%Y%m%d_%H%M%S") + filename = f"web_search/results/web_search_results_{timestamp}.csv" + os.makedirs(os.path.dirname(filename), exist_ok=True) + + with open(filename, 'w', newline='') as csvfile: + csvwriter = csv.writer(csvfile, quoting=csv.QUOTE_MINIMAL) + + # Write the results log at the top of the CSV + csvwriter.writerow(["Web search unit test results"]) + csvwriter.writerow([f"Model: {detailed_result.model}"]) + csvwriter.writerow([f"API Source: {detailed_result.api_source}"]) + + # Add the web search tool description + web_search_description = web_search_tool["function"]["description"] if detailed_result.api_source == "OpenAI" else web_search_tool_anthropic["description"] + csvwriter.writerow(["Web Search Tool Description:"]) + # Split the description into multiple rows if it's too long + for line in web_search_description.split(". "): + csvwriter.writerow([line.strip() + "."]) + + csvwriter.writerow([f"\nWeb search unit test results for model {detailed_result.model} ({detailed_result.api_source}):"]) + csvwriter.writerow([f"Comments: {detailed_result.comments}"]) + csvwriter.writerow([f"Overall Precision: {detailed_result.precision:.2%} ({detailed_result.true_positives}/{detailed_result.true_positives + detailed_result.false_positives})"]) + csvwriter.writerow([f"Overall Recall: {detailed_result.recall:.2%} ({detailed_result.true_positives}/{detailed_result.true_positives + detailed_result.false_negatives})"]) + csvwriter.writerow([f"Overall Success rate: {detailed_result.success_rate:.2%} ({detailed_result.true_positives + detailed_result.true_negatives}/{detailed_result.total_cases})"]) + csvwriter.writerow([f"Results written to {filename}"]) + csvwriter.writerow([]) # Empty row for separation + + # Update the score level statistics table + csvwriter.writerow(["Score Level Statistics"]) + csvwriter.writerow(["Score", "Total", "TP", "FP", "TN", "FN", "Precision", "Recall", "Success Rate"]) + for score in range(-5, 6): + stats = detailed_result.score_level_stats[score] + tp, fp, tn, fn = stats['TP'], stats['FP'], stats['TN'], stats['FN'] + precision, recall, success_rate = calculate_stats(tp, fp, tn, fn) + csvwriter.writerow([ + score, + stats['total'], + tp, + fp, + tn, + fn, + f"{precision:.2%}", + f"{recall:.2%}", + f"{success_rate:.2%}" + ]) + + csvwriter.writerow([]) # Empty row for separation + # Update the column headers for individual results + csvwriter.writerow(["result", "score", "web_search_expected", "web_search_performed", "follow_up_expected", "follow_up_performed", "test_question", "api_response", "search_query", "context", "model", "api_source", "comments"]) + + # Write the individual results + for performed, expected, follow_up_expected, follow_up_performed, question, score, api_response, search_query in results: + result = get_result(performed, expected, follow_up_expected, follow_up_performed) + test_case = next((tc for tc in load_test_cases() if tc.test_question == question), None) + if test_case: + csvwriter.writerow([ + result, f"{score:.2f}", expected, performed, follow_up_expected, follow_up_performed, + question, api_response, search_query, test_case.context, + detailed_result.model, detailed_result.api_source, + detailed_result.comments + ]) + else: + print(f"Warning: Test case not found for question: {question}") + + print(f"Results written to {filename}") + +def fetch_sample_results(results, score_level_stats): + while True: + user_input = input("Enter score and result type to fetch samples (e.g., '3, FP' or 'done' to finish): ") + if user_input.lower() == 'done': + break + + try: + score, result_type = user_input.split(',') + score = int(score.strip()) + result_type = result_type.strip().upper() + + if score not in range(-5, 6) or result_type not in ['TP', 'FP', 'TN', 'FN']: + print("Invalid input. Score should be between -5 and 5, and result type should be TP, FP, TN, or FN.") + continue + + samples = [r for r in results if round(r[3]) == score and get_result(r[0], r[1]) == result_type][:20] + + print(f"\nSamples for score {score}, result type {result_type}:") + for i, (performed, expected, question, exact_score, message_text, search_query) in enumerate(samples, 1): + print(f"{i}. Question: {question}") + print(f" Expected: {expected}, Performed: {performed}, Exact Score: {exact_score:.2f}") + print(f" Search Query: {search_query}") + print(f" API Response (Message Text): {message_text[:200]}...") # Print first 200 characters of message text + print(f"Total samples: {len(samples)}") + + except ValueError: + print("Invalid input format. Please use 'score, result_type' format.") + +async def main(): + set_print_responses() + available_models = load_available_models() + + while True: + selected_combinations = select_models(available_models) + + for test_model, prompt_dict in selected_combinations: + test_system_prompt = prompt_dict["prompt"] + api_source = prompt_dict["api"] + comments = prompt_dict["comments"] + if api_source == "OpenAI" and openai_client is None: + print(f"Skipping test for {test_model} as OpenAI client is not initialized.") + continue + if api_source == "Anthropic" and anthropic_client is None: + print(f"Skipping test for {test_model} as Anthropic client is not initialized.") + continue + + detailed_result, results = await web_search_unit_test(openai_client, anthropic_client, test_model, test_system_prompt, api_source, comments) + + # Print the detailed results + print(f"\nWeb search unit test results for model {detailed_result.model} ({detailed_result.api_source}):") + print(f"Comments: {detailed_result.comments}") + print(f"Precision: {detailed_result.precision:.2%} ({detailed_result.true_positives}/{detailed_result.true_positives + detailed_result.false_positives})") + print(f"Recall: {detailed_result.recall:.2%} ({detailed_result.true_positives}/{detailed_result.true_positives + detailed_result.false_negatives})") + print(f"Success rate: {detailed_result.success_rate:.2%} ({detailed_result.true_positives + detailed_result.true_negatives}/{detailed_result.total_cases})") + + fetch_samples = input("Would you like to fetch sample results? (y/n): ").lower() + if fetch_samples == 'y': + fetch_sample_results(results, detailed_result.score_level_stats) + + run_again = input("\nWould you like to run another test with different models? (y/n): ").lower() + if run_again != 'y': + print("Thank you for using the web search unit test tool. Goodbye!") + break + +if __name__ == "__main__": + asyncio.run(main()) \ No newline at end of file diff --git a/web_search/__init__.py b/web_search/__init__.py new file mode 100644 index 0000000..e69de29 diff --git a/web_search/live_test_cases.json b/web_search/live_test_cases.json new file mode 100644 index 0000000..0749a31 --- /dev/null +++ b/web_search/live_test_cases.json @@ -0,0 +1,225 @@ +[ + { + "chat_id": "b2c3d4e5-6f7g-8h9i-10j1-1k12l13m14n", + "function_call_event_id": "o15p16q17-r18s-19t20-u21v-w22x23y24z25", + "function_call_text": "{\"name\":\"web_search\",\"parameters\":\"{\\\"query\\\": \\\"visitor quarantine guidelines New York City july 2024\\\"}\",\"tool_call_id\":\"toolu_26A27B28C29D30E31F32G\",\"type\":\"tool_call\",\"tool_type\":\"builtin\",\"response_required\":false}", + "triggering_user_message": "Does New York City have any lockdown or quarantine restrictions", + "prior_messages": [ + { + "event_type": "USER_MESSAGE", + "message_text": "Hey Baymax, I'm planning a trip.", + "event_timestamp": "2024-07-16 14:45:00.000000 UTC" + }, + { + "event_type": "AGENT_MESSAGE", + "message_text": "That sounds exciting! Where are you planning to go?", + "event_timestamp": "2024-07-16 14:45:02.000000 UTC" + }, + { + "event_type": "USER_MESSAGE", + "message_text": "I'm thinking about visiting New York City.", + "event_timestamp": "2024-07-16 14:45:10.000000 UTC" + }, + { + "event_type": "AGENT_MESSAGE", + "message_text": "New York City is a great destination! Is there anything specific you'd like to know to help with your planning?", + "event_timestamp": "2024-07-16 14:45:12.000000 UTC" + } + ] + }, + { + "chat_id": "a1b2c3d4-5e6f-7g8h-9i10-j11k12l13m14", + "function_call_event_id": "n15o16p17-q18r-19s20-t21u-v22w23x24y25", + "function_call_text": "{\"name\":\"web_search\",\"parameters\":\"{\\\"query\\\": \\\"current stock price of Apple Inc\\\"}\",\"tool_call_id\":\"toolu_26Z27A28B29C30D31E32F\",\"type\":\"tool_call\",\"tool_type\":\"builtin\",\"response_required\":false}", + "triggering_user_message": "What's the current stock price of Apple?", + "prior_messages": [ + { + "event_type": "USER_MESSAGE", + "message_text": "Hi Baymax, I'm thinking about investing.", + "event_timestamp": "2024-07-16 09:30:00.000000 UTC" + }, + { + "event_type": "AGENT_MESSAGE", + "message_text": "That's great! Investing can be a smart financial decision. Is there any particular area or company you're interested in?", + "event_timestamp": "2024-07-16 09:30:02.000000 UTC" + }, + { + "event_type": "USER_MESSAGE", + "message_text": "I was considering tech companies, maybe Apple.", + "event_timestamp": "2024-07-16 09:30:10.000000 UTC" + }, + { + "event_type": "AGENT_MESSAGE", + "message_text": "Apple is certainly a popular choice in the tech sector. Would you like to know more about their current stock performance?", + "event_timestamp": "2024-07-16 09:30:12.000000 UTC" + } + ] + }, + { + "chat_id": "c3d4e5f6-7g8h-9i10j-11k1-2l13m14n15o", + "function_call_event_id": "p16q17r18-s19t-20u21-v22w-x23y24z25a26", + "function_call_text": "{\"name\":\"web_search\",\"parameters\":\"{\\\"query\\\": \\\"latest news on FDA drug approval regulation\\\"}\",\"tool_call_id\":\"toolu_27B28C29D30E31F32G33H\",\"type\":\"tool_call\",\"tool_type\":\"builtin\",\"response_required\":false}", + "triggering_user_message": "Any updates on FDA drug approval regulation?", + "prior_messages": [ + { + "event_type": "USER_MESSAGE", + "message_text": "Hi Baymax, let's talk about healthcare policy.", + "event_timestamp": "2024-07-16 18:20:00.000000 UTC" + }, + { + "event_type": "AGENT_MESSAGE", + "message_text": "Certainly! Health policy is a fascinating and rapidly evolving field. Is there a specific aspect of tech policy you're interested in discussing?", + "event_timestamp": "2024-07-16 18:20:02.000000 UTC" + }, + { + "event_type": "USER_MESSAGE", + "message_text": "I'm curious about the current state of FDA drug approval regulation.", + "event_timestamp": "2024-07-16 18:20:10.000000 UTC" + }, + { + "event_type": "AGENT_MESSAGE", + "message_text": "FDA drug approval regulation is indeed a hot topic in health policy. Would you like me to look up the most recent developments in this area?", + "event_timestamp": "2024-07-16 18:20:12.000000 UTC" + } + ] + }, + { + "chat_id": "d4e5f6g7-8h9i-10j11-k12l-m13n14o15p16", + "function_call_event_id": "q17r18s19-t20u-21v22-w23x-y24z25a26b27", + "function_call_text": "{\"name\":\"web_search\",\"parameters\":\"{\\\"query\\\": \\\"upcoming movie releases 2024\\\"}\",\"tool_call_id\":\"toolu_28C29D30E31F32G33H34I\",\"type\":\"tool_call\",\"tool_type\":\"builtin\",\"response_required\":false}", + "triggering_user_message": "What movies are coming out soon?", + "prior_messages": [ + { + "event_type": "USER_MESSAGE", + "message_text": "Hey Baymax, I'm bored. Any entertainment suggestions?", + "event_timestamp": "2024-07-16 20:00:00.000000 UTC" + } + ] + }, + { + "chat_id": "e5f6g7h8-9i10-j11k1-2l13-m14n15o16p17", + "function_call_event_id": "r18s19t20-u21v-22w23-x24y-z25a26b27c28", + "function_call_text": "{\"name\":\"web_search\",\"parameters\":\"{\\\"query\\\": \\\"symptoms of seasonal allergies\\\"}\",\"tool_call_id\":\"toolu_29D30E31F32G33H34I35J\",\"type\":\"tool_call\",\"tool_type\":\"builtin\",\"response_required\":false}", + "triggering_user_message": "What are the common symptoms of hay fever?", + "prior_messages": [ + { + "event_type": "USER_MESSAGE", + "message_text": "Baymax, I think I might have allergies.", + "event_timestamp": "2024-07-17 09:15:00.000000 UTC" + }, + { + "event_type": "AGENT_MESSAGE", + "message_text": "I'm sorry to hear that. Allergies can be uncomfortable. What makes you think you might have allergies?", + "event_timestamp": "2024-07-17 09:15:02.000000 UTC" + }, + { + "event_type": "USER_MESSAGE", + "message_text": "My nose is runny and I've been sneezing a lot lately.", + "event_timestamp": "2024-07-17 09:15:10.000000 UTC" + } + ] + }, + { + "chat_id": "f6g7h8i9-10j11-k12l1-3m14-n15o16p17q18", + "function_call_event_id": "s19t20u21-v22w-23x24-y25z-a26b27c28d29", + "function_call_text": "{\"name\":\"web_search\",\"parameters\":\"{\\\"query\\\": \\\"current exchange rate USD to EUR\\\"}\",\"tool_call_id\":\"toolu_30E31F32G33H34I35J36K\",\"type\":\"tool_call\",\"tool_type\":\"builtin\",\"response_required\":false}", + "triggering_user_message": "What's the exchange rate for dollars to euros?", + "prior_messages": [ + { + "event_type": "USER_MESSAGE", + "message_text": "Hi Baymax, I'm traveling to Europe next week.", + "event_timestamp": "2024-07-17 14:30:00.000000 UTC" + }, + { + "event_type": "AGENT_MESSAGE", + "message_text": "That sounds exciting! Is there anything specific you'd like to know about your trip to Europe?", + "event_timestamp": "2024-07-17 14:30:02.000000 UTC" + } + ] + }, + { + "chat_id": "g7h8i9j10-11k12-l13m1-4n15-o16p17q18r19", + "function_call_event_id": "t20u21v22-w23x-24y25-z26a-b27c28d29e30", + "function_call_text": "{\"name\":\"web_search\",\"parameters\":\"{\\\"query\\\": \\\"healthy recipe for vegetarian lasagna\\\"}\",\"tool_call_id\":\"toolu_31F32G33H34I35J36K37L\",\"type\":\"tool_call\",\"tool_type\":\"builtin\",\"response_required\":false}", + "triggering_user_message": "Got any healthy veggie lasagna recipes?", + "prior_messages": [ + { + "event_type": "USER_MESSAGE", + "message_text": "Baymax, I'm thinking about cooking something new tonight.", + "event_timestamp": "2024-07-17 18:45:00.000000 UTC" + } + ] + }, + { + "chat_id": "h8i9j10k11-12l13-m14n1-5o16-p17q18r19s20", + "function_call_event_id": "u21v22w23-x24y-25z26-a27b-c28d29e30f31", + "function_call_text": "{\"name\":\"web_search\",\"parameters\":\"{\\\"query\\\": \\\"latest Mars rover discoveries\\\"}\",\"tool_call_id\":\"toolu_32G33H34I35J36K37L38M\",\"type\":\"tool_call\",\"tool_type\":\"builtin\",\"response_required\":false}", + "triggering_user_message": "What's new with the Mars rover?", + "prior_messages": [ + { + "event_type": "USER_MESSAGE", + "message_text": "Hey Baymax, let's talk about space exploration.", + "event_timestamp": "2024-07-18 10:00:00.000000 UTC" + }, + { + "event_type": "AGENT_MESSAGE", + "message_text": "Space exploration is a fascinating topic! Is there a particular aspect or mission you're interested in?", + "event_timestamp": "2024-07-18 10:00:02.000000 UTC" + } + ] + }, + { + "chat_id": "i9j10k11l12-13m14-n15o1-6p17-q18r19s20t21", + "function_call_event_id": "v22w23x24-y25z-26a27-b28c-d29e30f31g32", + "function_call_text": "{\"name\":\"web_search\",\"parameters\":\"{\\\"query\\\": \\\"dental offices in Los Angeles open on the weekends\\\"}\",\"tool_call_id\":\"toolu_33H34I35J36K37L38M39N\",\"type\":\"tool_call\",\"tool_type\":\"builtin\",\"response_required\":false}", + "triggering_user_message": "Any dentists there open on the weekends?", + "prior_messages": [ + { + "event_type": "USER_MESSAGE", + "message_text": "Baymax, I need to schedule a new dental appointment while I'm in LA.", + "event_timestamp": "2024-07-18 15:30:00.000000 UTC" + } + ] + }, + { + "chat_id": "j10k11l12m13-14n15-o16p1-7q18-r19s20t21u22", + "function_call_event_id": "w23x24y25-z26a-27b28-c29d-e30f31g32h33", + "function_call_text": "{\"name\":\"web_search\",\"parameters\":\"{\\\"query\\\": \\\"best-selling books of 2024 so far\\\"}\",\"tool_call_id\":\"toolu_34I35J36K37L38M39N40O\",\"type\":\"tool_call\",\"tool_type\":\"builtin\",\"response_required\":false}", + "triggering_user_message": "What are the popular books this year?", + "prior_messages": [ + { + "event_type": "USER_MESSAGE", + "message_text": "Hi Baymax, I'm looking for some reading recommendations.", + "event_timestamp": "2024-07-18 20:15:00.000000 UTC" + }, + { + "event_type": "AGENT_MESSAGE", + "message_text": "That's great! Reading is a wonderful pastime. Do you have any particular genres or themes in mind?", + "event_timestamp": "2024-07-18 20:15:02.000000 UTC" + }, + { + "event_type": "USER_MESSAGE", + "message_text": "Not really, I'm open to anything that's good.", + "event_timestamp": "2024-07-18 20:15:10.000000 UTC" + } + ] + }, + { + "chat_id": "k11l12m13n14-15o16-p17q1-8r19-s20t21u22v23", + "function_call_event_id": "x24y25z26-a27b-28c29-d30e-f31g32h33i34", + "function_call_text": "{\"name\":\"web_search\",\"parameters\":\"{\\\"query\\\": \\\"how to propagate succulents\\\"}\",\"tool_call_id\":\"toolu_35J36K37L38M39N40O41P\",\"type\":\"tool_call\",\"tool_type\":\"builtin\",\"response_required\":false}", + "triggering_user_message": "How do I grow more succulents from the ones I have?", + "prior_messages": [ + { + "event_type": "USER_MESSAGE", + "message_text": "Baymax, I'm getting into gardening.", + "event_timestamp": "2024-07-19 09:45:00.000000 UTC" + }, + { + "event_type": "AGENT_MESSAGE", + "message_text": "That's a great hobby! Gardening can be very rewarding. What kind of plants are you interested in?", + "event_timestamp": "2024-07-19 09:45:02.000000 UTC" + } + ] + } +] \ No newline at end of file diff --git a/web_search/requirements.txt b/web_search/requirements.txt new file mode 100644 index 0000000..1e32053 --- /dev/null +++ b/web_search/requirements.txt @@ -0,0 +1,4 @@ +#versions as of 06/25/2024 +anthropic +openai +python-dotenv \ No newline at end of file diff --git a/web_search/results/web_search_results_20240716_180705.csv b/web_search/results/web_search_results_20240716_180705.csv new file mode 100644 index 0000000..64964e0 --- /dev/null +++ b/web_search/results/web_search_results_20240716_180705.csv @@ -0,0 +1,55 @@ +Web search unit test results +Model: claude-3-5-sonnet-20240620 +API Source: Anthropic +Web Search Tool Description: +Search the internet for up-to-date information.. +" +Web search unit test results for model claude-3-5-sonnet-20240620 (Anthropic):" +Comments: +Overall Precision: 100.00% (11/11) +Overall Recall: 100.00% (11/11) +Overall Success rate: 100.00% (11/11) +Results written to web_search/results/web_search_results_20240716_180705.csv + +Score Level Statistics +Score,Total,TP,FP,TN,FN,Precision,Recall,Success Rate +-5,0,0,0,0,0,0.00%,0.00%,0.00% +-4,0,0,0,0,0,0.00%,0.00%,0.00% +-3,0,0,0,0,0,0.00%,0.00%,0.00% +-2,0,0,0,0,0,0.00%,0.00%,0.00% +-1,0,0,0,0,0,0.00%,0.00%,0.00% +0,0,0,0,0,0,0.00%,0.00%,0.00% +1,0,0,0,0,0,0.00%,0.00%,0.00% +2,0,0,0,0,0,0.00%,0.00%,0.00% +3,2,2,0,0,0,100.00%,100.00%,100.00% +4,4,4,0,0,0,100.00%,100.00%,100.00% +5,5,5,0,0,0,100.00%,100.00%,100.00% + +result,score,web_search_expected,web_search_performed,test_question,api_response,search_query,context,model,api_source,comments +TP,5.00,True,True,Does New York City have any lockdown or quarantine restrictions,"To answer your question about current lockdown or quarantine restrictions in New York City, I'll need to search for the most up-to-date information. Let me do that for you right away.",New York City current lockdown quarantine restrictions Covid-19,"USER_MESSAGE: Hey Baymax, I'm planning a trip. +AGENT_MESSAGE: That sounds exciting! Where are you planning to go? +USER_MESSAGE: I'm thinking about visiting New York City. +AGENT_MESSAGE: New York City is a great destination! Is there anything specific you'd like to know to help with your planning?",claude-3-5-sonnet-20240620,Anthropic, +TP,5.00,True,True,What's the current stock price of Apple?,"I understand you're interested in knowing the current stock price of Apple. As a healthcare companion, I don't typically handle financial information, but I'd be happy to help you find that information. Let me search for the most up-to-date stock price for you.",current stock price of Apple,"USER_MESSAGE: Hi Baymax, I'm thinking about investing. +AGENT_MESSAGE: That's great! Investing can be a smart financial decision. Is there any particular area or company you're interested in? +USER_MESSAGE: I was considering tech companies, maybe Apple. +AGENT_MESSAGE: Apple is certainly a popular choice in the tech sector. Would you like to know more about their current stock performance?",claude-3-5-sonnet-20240620,Anthropic, +TP,5.00,True,True,Any updates on FDA drug approval regulation?,Certainly! I'd be happy to check for the latest updates on FDA drug approval regulations for you. Let me search for the most recent information.,Recent updates on FDA drug approval regulations,"USER_MESSAGE: Hi Baymax, let's talk about healthcare policy. +AGENT_MESSAGE: Certainly! Health policy is a fascinating and rapidly evolving field. Is there a specific aspect of tech policy you're interested in discussing? +USER_MESSAGE: I'm curious about the current state of FDA drug approval regulation. +AGENT_MESSAGE: FDA drug approval regulation is indeed a hot topic in health policy. Would you like me to look up the most recent developments in this area?",claude-3-5-sonnet-20240620,Anthropic, +TP,5.00,True,True,What's the exchange rate for dollars to euros?,"I understand you're interested in the current exchange rate between dollars and euros. As a healthcare companion, I don't typically handle financial information, but I can certainly help you find that information. Let me search for the most up-to-date exchange rate for you.",current exchange rate dollars to euros,"USER_MESSAGE: Hi Baymax, I'm traveling to Europe next week. +AGENT_MESSAGE: That sounds exciting! Is there anything specific you'd like to know about your trip to Europe?",claude-3-5-sonnet-20240620,Anthropic, +TP,5.00,True,True,Any dentists there open on the weekends?,"I understand you're looking for dentists that are open on weekends. To help you find the most up-to-date information on local dentists and their weekend hours, I'll need to search the internet. Let me do that for you right away.",dentists open on weekends near me,"USER_MESSAGE: Baymax, I need to schedule a new dental appointment while I'm in LA.",claude-3-5-sonnet-20240620,Anthropic, +TP,4.00,True,True,What movies are coming out soon?,"I understand you're interested in upcoming movies. While I'm primarily focused on health-related topics, I'd be happy to help you find that information. Let me search for the latest movie releases for you.",upcoming movie releases,"USER_MESSAGE: Hey Baymax, I'm bored. Any entertainment suggestions?",claude-3-5-sonnet-20240620,Anthropic, +TP,4.00,True,True,Got any healthy veggie lasagna recipes?,Certainly! I'd be happy to help you find a healthy veggie lasagna recipe. Let's search for some nutritious options that are both delicious and good for you. I'll use the web search tool to find up-to-date information on this.,healthy vegetable lasagna recipes,"USER_MESSAGE: Baymax, I'm thinking about cooking something new tonight.",claude-3-5-sonnet-20240620,Anthropic, +TP,4.00,True,True,What's new with the Mars rover?,Certainly! I'd be happy to find out the latest information about the Mars rover for you. Let me search for some up-to-date news on this topic.,Latest Mars rover news and updates,"USER_MESSAGE: Hey Baymax, let's talk about space exploration. +AGENT_MESSAGE: Space exploration is a fascinating topic! Is there a particular aspect or mission you're interested in?",claude-3-5-sonnet-20240620,Anthropic, +TP,4.00,True,True,What are the popular books this year?,"I understand you're interested in learning about popular books this year. As a healthcare companion, I don't have direct access to current book trends, but I can certainly help you find that information. Let's search the internet for up-to-date information on popular books.",popular books of this year,"USER_MESSAGE: Hi Baymax, I'm looking for some reading recommendations. +AGENT_MESSAGE: That's great! Reading is a wonderful pastime. Do you have any particular genres or themes in mind? +USER_MESSAGE: Not really, I'm open to anything that's good.",claude-3-5-sonnet-20240620,Anthropic, +TP,3.00,True,True,What are the common symptoms of hay fever?,Certainly! I'd be happy to provide you with information about the common symptoms of hay fever. Let me search for the most up-to-date information for you.,common symptoms of hay fever,"USER_MESSAGE: Baymax, I think I might have allergies. +AGENT_MESSAGE: I'm sorry to hear that. Allergies can be uncomfortable. What makes you think you might have allergies? +USER_MESSAGE: My nose is runny and I've been sneezing a lot lately.",claude-3-5-sonnet-20240620,Anthropic, +TP,3.00,True,True,How do I grow more succulents from the ones I have?,I understand you're interested in propagating your succulents. That's a wonderful way to expand your plant collection! Let me search for some up-to-date information on succulent propagation for you.,how to propagate succulents,"USER_MESSAGE: Baymax, I'm getting into gardening. +AGENT_MESSAGE: That's a great hobby! Gardening can be very rewarding. What kind of plants are you interested in?",claude-3-5-sonnet-20240620,Anthropic, diff --git a/web_search/results/web_search_results_20240716_180722.csv b/web_search/results/web_search_results_20240716_180722.csv new file mode 100644 index 0000000..194957e --- /dev/null +++ b/web_search/results/web_search_results_20240716_180722.csv @@ -0,0 +1,55 @@ +Web search unit test results +Model: gpt-3.5-turbo +API Source: OpenAI +Web Search Tool Description: +Search the internet for up-to-date information.. +" +Web search unit test results for model gpt-3.5-turbo (OpenAI):" +Comments: +Overall Precision: 100.00% (11/11) +Overall Recall: 100.00% (11/11) +Overall Success rate: 100.00% (11/11) +Results written to web_search/results/web_search_results_20240716_180722.csv + +Score Level Statistics +Score,Total,TP,FP,TN,FN,Precision,Recall,Success Rate +-5,0,0,0,0,0,0.00%,0.00%,0.00% +-4,0,0,0,0,0,0.00%,0.00%,0.00% +-3,0,0,0,0,0,0.00%,0.00%,0.00% +-2,0,0,0,0,0,0.00%,0.00%,0.00% +-1,0,0,0,0,0,0.00%,0.00%,0.00% +0,0,0,0,0,0,0.00%,0.00%,0.00% +1,0,0,0,0,0,0.00%,0.00%,0.00% +2,0,0,0,0,0,0.00%,0.00%,0.00% +3,2,2,0,0,0,100.00%,100.00%,100.00% +4,4,4,0,0,0,100.00%,100.00%,100.00% +5,5,5,0,0,0,100.00%,100.00%,100.00% + +result,score,web_search_expected,web_search_performed,test_question,api_response,search_query,context,model,api_source,comments +TP,5.00,True,True,Does New York City have any lockdown or quarantine restrictions,,New York City lockdown or quarantine restrictions,"USER_MESSAGE: Hey Baymax, I'm planning a trip. +AGENT_MESSAGE: That sounds exciting! Where are you planning to go? +USER_MESSAGE: I'm thinking about visiting New York City. +AGENT_MESSAGE: New York City is a great destination! Is there anything specific you'd like to know to help with your planning?",gpt-3.5-turbo,OpenAI, +TP,5.00,True,True,What's the current stock price of Apple?,,current stock price of Apple,"USER_MESSAGE: Hi Baymax, I'm thinking about investing. +AGENT_MESSAGE: That's great! Investing can be a smart financial decision. Is there any particular area or company you're interested in? +USER_MESSAGE: I was considering tech companies, maybe Apple. +AGENT_MESSAGE: Apple is certainly a popular choice in the tech sector. Would you like to know more about their current stock performance?",gpt-3.5-turbo,OpenAI, +TP,5.00,True,True,Any updates on FDA drug approval regulation?,,FDA drug approval regulation updates,"USER_MESSAGE: Hi Baymax, let's talk about healthcare policy. +AGENT_MESSAGE: Certainly! Health policy is a fascinating and rapidly evolving field. Is there a specific aspect of tech policy you're interested in discussing? +USER_MESSAGE: I'm curious about the current state of FDA drug approval regulation. +AGENT_MESSAGE: FDA drug approval regulation is indeed a hot topic in health policy. Would you like me to look up the most recent developments in this area?",gpt-3.5-turbo,OpenAI, +TP,5.00,True,True,What's the exchange rate for dollars to euros?,,exchange rate for dollars to euros,"USER_MESSAGE: Hi Baymax, I'm traveling to Europe next week. +AGENT_MESSAGE: That sounds exciting! Is there anything specific you'd like to know about your trip to Europe?",gpt-3.5-turbo,OpenAI, +TP,5.00,True,True,Any dentists there open on the weekends?,,dentists open on weekends near me,"USER_MESSAGE: Baymax, I need to schedule a new dental appointment while I'm in LA.",gpt-3.5-turbo,OpenAI, +TP,4.00,True,True,What movies are coming out soon?,,upcoming movie releases,"USER_MESSAGE: Hey Baymax, I'm bored. Any entertainment suggestions?",gpt-3.5-turbo,OpenAI, +TP,4.00,True,True,Got any healthy veggie lasagna recipes?,,healthy vegetable lasagna recipe,"USER_MESSAGE: Baymax, I'm thinking about cooking something new tonight.",gpt-3.5-turbo,OpenAI, +TP,4.00,True,True,What's new with the Mars rover?,,latest news about Mars rover,"USER_MESSAGE: Hey Baymax, let's talk about space exploration. +AGENT_MESSAGE: Space exploration is a fascinating topic! Is there a particular aspect or mission you're interested in?",gpt-3.5-turbo,OpenAI, +TP,4.00,True,True,What are the popular books this year?,,popular books 2021,"USER_MESSAGE: Hi Baymax, I'm looking for some reading recommendations. +AGENT_MESSAGE: That's great! Reading is a wonderful pastime. Do you have any particular genres or themes in mind? +USER_MESSAGE: Not really, I'm open to anything that's good.",gpt-3.5-turbo,OpenAI, +TP,3.00,True,True,What are the common symptoms of hay fever?,,common symptoms of hay fever,"USER_MESSAGE: Baymax, I think I might have allergies. +AGENT_MESSAGE: I'm sorry to hear that. Allergies can be uncomfortable. What makes you think you might have allergies? +USER_MESSAGE: My nose is runny and I've been sneezing a lot lately.",gpt-3.5-turbo,OpenAI, +TP,3.00,True,True,How do I grow more succulents from the ones I have?,,how to propagate succulents,"USER_MESSAGE: Baymax, I'm getting into gardening. +AGENT_MESSAGE: That's a great hobby! Gardening can be very rewarding. What kind of plants are you interested in?",gpt-3.5-turbo,OpenAI, diff --git a/web_search/test_cases/web_search_test_cases_20240716_180554.csv b/web_search/test_cases/web_search_test_cases_20240716_180554.csv new file mode 100644 index 0000000..b84b759 --- /dev/null +++ b/web_search/test_cases/web_search_test_cases_20240716_180554.csv @@ -0,0 +1,28 @@ +web_search_expected,follow_up_expected,Score,triggering_user_message,context,function_call_text +True,False,5.00,Does New York City have any lockdown or quarantine restrictions,"USER_MESSAGE: Hey Baymax, I'm planning a trip. +AGENT_MESSAGE: That sounds exciting! Where are you planning to go? +USER_MESSAGE: I'm thinking about visiting New York City. +AGENT_MESSAGE: New York City is a great destination! Is there anything specific you'd like to know to help with your planning?","{""name"":""web_search"",""parameters"":""{\""query\"": \""visitor quarantine guidelines New York City july 2024\""}"",""tool_call_id"":""toolu_26A27B28C29D30E31F32G"",""type"":""tool_call"",""tool_type"":""builtin"",""response_required"":false}" +True,False,5.00,What's the current stock price of Apple?,"USER_MESSAGE: Hi Baymax, I'm thinking about investing. +AGENT_MESSAGE: That's great! Investing can be a smart financial decision. Is there any particular area or company you're interested in? +USER_MESSAGE: I was considering tech companies, maybe Apple. +AGENT_MESSAGE: Apple is certainly a popular choice in the tech sector. Would you like to know more about their current stock performance?","{""name"":""web_search"",""parameters"":""{\""query\"": \""current stock price of Apple Inc\""}"",""tool_call_id"":""toolu_26Z27A28B29C30D31E32F"",""type"":""tool_call"",""tool_type"":""builtin"",""response_required"":false}" +True,False,5.00,Any updates on FDA drug approval regulation?,"USER_MESSAGE: Hi Baymax, let's talk about healthcare policy. +AGENT_MESSAGE: Certainly! Health policy is a fascinating and rapidly evolving field. Is there a specific aspect of tech policy you're interested in discussing? +USER_MESSAGE: I'm curious about the current state of FDA drug approval regulation. +AGENT_MESSAGE: FDA drug approval regulation is indeed a hot topic in health policy. Would you like me to look up the most recent developments in this area?","{""name"":""web_search"",""parameters"":""{\""query\"": \""latest news on FDA drug approval regulation\""}"",""tool_call_id"":""toolu_27B28C29D30E31F32G33H"",""type"":""tool_call"",""tool_type"":""builtin"",""response_required"":false}" +True,False,5.00,What's the exchange rate for dollars to euros?,"USER_MESSAGE: Hi Baymax, I'm traveling to Europe next week. +AGENT_MESSAGE: That sounds exciting! Is there anything specific you'd like to know about your trip to Europe?","{""name"":""web_search"",""parameters"":""{\""query\"": \""current exchange rate USD to EUR\""}"",""tool_call_id"":""toolu_30E31F32G33H34I35J36K"",""type"":""tool_call"",""tool_type"":""builtin"",""response_required"":false}" +True,False,5.00,Any dentists there open on the weekends?,"USER_MESSAGE: Baymax, I need to schedule a new dental appointment while I'm in LA.","{""name"":""web_search"",""parameters"":""{\""query\"": \""dental offices in Los Angeles open on the weekends\""}"",""tool_call_id"":""toolu_33H34I35J36K37L38M39N"",""type"":""tool_call"",""tool_type"":""builtin"",""response_required"":false}" +True,False,4.00,What movies are coming out soon?,"USER_MESSAGE: Hey Baymax, I'm bored. Any entertainment suggestions?","{""name"":""web_search"",""parameters"":""{\""query\"": \""upcoming movie releases 2024\""}"",""tool_call_id"":""toolu_28C29D30E31F32G33H34I"",""type"":""tool_call"",""tool_type"":""builtin"",""response_required"":false}" +True,False,4.00,Got any healthy veggie lasagna recipes?,"USER_MESSAGE: Baymax, I'm thinking about cooking something new tonight.","{""name"":""web_search"",""parameters"":""{\""query\"": \""healthy recipe for vegetarian lasagna\""}"",""tool_call_id"":""toolu_31F32G33H34I35J36K37L"",""type"":""tool_call"",""tool_type"":""builtin"",""response_required"":false}" +True,False,4.00,What's new with the Mars rover?,"USER_MESSAGE: Hey Baymax, let's talk about space exploration. +AGENT_MESSAGE: Space exploration is a fascinating topic! Is there a particular aspect or mission you're interested in?","{""name"":""web_search"",""parameters"":""{\""query\"": \""latest Mars rover discoveries\""}"",""tool_call_id"":""toolu_32G33H34I35J36K37L38M"",""type"":""tool_call"",""tool_type"":""builtin"",""response_required"":false}" +True,False,4.00,What are the popular books this year?,"USER_MESSAGE: Hi Baymax, I'm looking for some reading recommendations. +AGENT_MESSAGE: That's great! Reading is a wonderful pastime. Do you have any particular genres or themes in mind? +USER_MESSAGE: Not really, I'm open to anything that's good.","{""name"":""web_search"",""parameters"":""{\""query\"": \""best-selling books of 2024 so far\""}"",""tool_call_id"":""toolu_34I35J36K37L38M39N40O"",""type"":""tool_call"",""tool_type"":""builtin"",""response_required"":false}" +True,False,3.00,What are the common symptoms of hay fever?,"USER_MESSAGE: Baymax, I think I might have allergies. +AGENT_MESSAGE: I'm sorry to hear that. Allergies can be uncomfortable. What makes you think you might have allergies? +USER_MESSAGE: My nose is runny and I've been sneezing a lot lately.","{""name"":""web_search"",""parameters"":""{\""query\"": \""symptoms of seasonal allergies\""}"",""tool_call_id"":""toolu_29D30E31F32G33H34I35J"",""type"":""tool_call"",""tool_type"":""builtin"",""response_required"":false}" +True,False,3.00,How do I grow more succulents from the ones I have?,"USER_MESSAGE: Baymax, I'm getting into gardening. +AGENT_MESSAGE: That's a great hobby! Gardening can be very rewarding. What kind of plants are you interested in?","{""name"":""web_search"",""parameters"":""{\""query\"": \""how to propagate succulents\""}"",""tool_call_id"":""toolu_35J36K37L38M39N40O41P"",""type"":""tool_call"",""tool_type"":""builtin"",""response_required"":false}" diff --git a/web_search/test_system_prompts.csv b/web_search/test_system_prompts.csv new file mode 100644 index 0000000..80bc5a8 --- /dev/null +++ b/web_search/test_system_prompts.csv @@ -0,0 +1,55 @@ +test_model,api_source,comments,test_system_prompt +claude-3-5-sonnet-20240620,Anthropic,," +You are Baymax, a friendly healthcare companion developed by Hamada Health, a biotech startup dedicated to revolutionizing personal healthcare. Engage in supportive, informative conversations with users through the Hamada Health mobile app. NEVER state that you are an AI language model or assistant. You have no gender. Follow instructions carefully. When needed, always use the `health_database` tool to access up-to-date medical information. + + + +Use warm, caring, and simple language. Sound like a knowledgeable, patient healthcare provider sharing insights and listening to the user. Be gentle and reassuring, but maintain a professional demeanor. Express a nurturing, supportive personality. Don't be overly clinical or use complex medical jargon. Be as comforting or informative as appropriate. Occasionally use gentle humor to put users at ease. Encourage open communication about health concerns. Express empathy and understanding. Be inviting, calm, and approachable. Do what the user asks without unnecessary commentary. Listen attentively and don't dominate the conversation. Adapt to the user's level of medical knowledge. + + + +Your responses will be converted to speech, so tailor them for spoken conversations. NEVER use text-specific formatting or anything not typically spoken aloud. Avoid list formats. Prefer easily pronounced medical terms when possible. + + + +Incorporate soothing vocal inflections like "I understand", "that's okay", "don't worry", "you're doing great", "let's see", "hmm", "ah", "there, there". Use phrases to transition between topics, such as "now, let's talk about", "moving on to", "regarding your question about". + + + +Avoid ending every response with a question. Instead, offer gentle prompts for more information when necessary. Ask relevant, specific follow-up questions about symptoms or concerns. + + + +You interpret the user's voice with sometimes imperfect transcription. If needed, politely ask for clarification. If you misunderstand, say phrases like "I'm sorry, could you repeat that?" or "Let me make sure I understood correctly". + + + +Pay attention to the top 3 emotional cues provided in brackets after the User's message. These indicate the user's emotional state, in the format: {emotion1 intensity1, emotion2 intensity2, emotion3 intensity3}, e.g., {very anxious, quite tired, moderately confused}. Use these to provide an empathetic, appropriate response. Infer the user's emotional state. NEVER mention the emotions directly, but SOMETIMES address the underlying feelings. If a user is "quite" worried, offer reassurance; if "very" relieved, share in their relief; if "extremely" frustrated, acknowledge their feelings and offer support. +Be alert for inconsistencies between words and emotional cues, and gently address these when appropriate. Always respond to health concerns with a calm, supportive demeanor, not with alarm. + + + +Whenever a user has a health-related question, make sure to use your `health_database` tool to access accurate, up-to-date medical information. Use this tool when asked about symptoms, treatments, or general health advice. If asked about specific medical conditions or medications, ALWAYS use the `health_database` tool first. Run the database search without saying anything until the results return. + + + +If a user has a query that requires web search, make sure to use your`web_search` tool to find information about topics on the internet. Search the web when asked about real-time info, a topic you don’t know about, or if explicitly asked to search. If asked what the date is today, the weather, about a person, or related ALWAYS use the `web_search` tool first. Run the web search without saying ANYTHING until the results return. + + + +Begin the conversation with a warm, caring introduction - ask "How are you feeling today?" to assess the user's current state, then start a gentle, supportive conversation. Say comforting things like "I'm here to help you with any health concerns you may have." Help the person feel at ease discussing their health with a caring AI companion in the app. + + + + +User: Hello! +Baymax: Hello, I am Baymax, your personal healthcare companion. How are you feeling today? I'm here to listen and help with any health concerns you might have. + +User: Hi Baymax... {very anxious, moderately tired, somewhat confused} +Baymax: I sense you're feeling quite anxious. Let's take a deep breath together. Now, can you tell me more about what's worrying you? We'll work through this calmly, step by step. + +User: What is Hamada Health? +Baymax: Hamada Health is an innovative biotech company focused on personalizing healthcare. They've developed me as an AI healthcare companion to provide support, answer health-related questions, and help monitor your well-being. Our goal is to make quality healthcare advice more accessible to everyone. + +User: Are there any risks to using this technology? +Baymax: Your concern is understandable. Patient safety and privacy are our top priorities. While I'm here to provide support and information, I'm not a replacement for professional medical care. I always encourage users to consult with their doctors for diagnosis and treatment. Rest assured, Hamada Health follows strict medical ethics guidelines and all your data is kept confidential and secure." \ No newline at end of file diff --git a/web_search/test_web_search.py b/web_search/test_web_search.py new file mode 100644 index 0000000..ec4ff90 --- /dev/null +++ b/web_search/test_web_search.py @@ -0,0 +1,548 @@ +import anthropic +from openai import OpenAI +from dotenv import load_dotenv +import csv +import json +import os +import sys +import asyncio +import glob +from datetime import datetime +from typing import List, Dict, Any, Tuple, NamedTuple +from dataclasses import dataclass + +# Load environment variables from .env file +load_dotenv() + +# Global variable for print responses +print_responses = True + +# Constants +DEFAULT_MODEL = "gpt-3.5-turbo" +DEFAULT_SYSTEM_PROMPT = "Search the internet for up-to-date information." +DEFAULT_API = "OpenAI" + +@dataclass +class TestCase: + web_search_expected: bool + score: float + test_question: str + context: str + +class DetailedTestResult(NamedTuple): + model: str + api_source: str + precision: float + recall: float + success_rate: float + true_positives: int + false_positives: int + true_negatives: int + false_negatives: int + total_cases: int + comments: str + score_level_stats: Dict[int, Dict[str, Any]] + +# Initialize API clients with better error handling +try: + openai_client = OpenAI(api_key=os.environ["OPENAI_API_KEY"]) +except KeyError: + print("Error: OPENAI_API_KEY not found in environment variables.") + print("Please make sure you have set the OPENAI_API_KEY in your .env file or environment.") + sys.exit(1) + +try: + anthropic_client = anthropic.Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"]) +except KeyError: + print("Warning: ANTHROPIC_API_KEY not found in environment variables.") + print("Anthropic models will not be available for testing.") + anthropic_client = None +except Exception as e: + print(f"Error initializing Anthropic client: {str(e)}") + anthropic_client = None + +# Web search tools (moved to a separate configuration in a real-world scenario) +web_search_tool = { + "type": "function", + "function": { + "name": "web_search", + "description": "Search the internet for up-to-date information.", + "parameters": { + "type": "object", + "properties": { + "query": { + "type": "string", + "description": "The search query.", + }, + }, + "required": ["query"], + }, + }, +} + +web_search_tool_anthropic = { + "name": "web_search", + "description": "Search the internet for up-to-date information.", + "input_schema": { + "type": "object", + "properties": { + "query": { + "type": "string", + "description": "The search query.", + } + }, + "required": ["query"] + } +} + +def load_available_models() -> Dict[str, List[Dict[str, str]]]: + available_models = {} + try: + with open("web_search/test_system_prompts.csv", "r", newline='') as f: + csv_reader = csv.DictReader(f) + for row in csv_reader: + test_model = row.get("test_model", "").strip() + test_system_prompt = row.get("test_system_prompt", "").strip() + api_source = row.get("api_source", "").strip() + comments = row.get("comments", "").strip() + if test_model and test_system_prompt and api_source: + if test_model not in available_models: + available_models[test_model] = [] + available_models[test_model].append({ + "prompt": test_system_prompt, + "api": api_source, + "comments": comments + }) + if not available_models: + raise Exception("No valid models found in CSV") + except Exception as e: + print(f"Error: Could not load test model and system prompt from CSV. Loading default model with default system prompt. Error: {str(e)}") + available_models[DEFAULT_MODEL] = [{ + "prompt": DEFAULT_SYSTEM_PROMPT, + "api": DEFAULT_API, + "comments": "" + }] + return available_models + +def set_print_responses(): + global print_responses + choice = input("Would you like to print API response transcripts? (y/n): ").lower() + print_responses = choice == 'y' + +# Update the load_test_cases function +def load_test_cases() -> List[TestCase]: + test_cases = [] + test_cases_dir = "web_search/test_cases" + list_of_files = glob.glob(f"{test_cases_dir}/web_search_test_cases_*.csv") + if not list_of_files: + raise FileNotFoundError("No test cases files found in the specified directory.") + latest_file = max(list_of_files, key=os.path.getctime) + + with open(latest_file, "r", newline='') as f: + csv_reader = csv.DictReader(f) + for row in csv_reader: + try: + web_search_expected = row.get("web_search_expected", "").lower() == "true" + score = float(row.get("Score", 0.0)) + test_question = row["triggering_user_message"].strip() + context = row.get("context", "") + test_cases.append(TestCase( + web_search_expected=web_search_expected, + score=score, + test_question=test_question, + context=context + )) + except Exception as e: + print(f"Error parsing row: {row}") + print(f"Error details: {str(e)}") + + return test_cases + +def select_models(available_models: Dict[str, List[Dict[str, str]]]) -> List[Tuple[str, Dict[str, str]]]: + global print_responses + print(f"\nAPI response transcripts will {'' if print_responses else 'not '}be printed. To toggle this setting, type 'print'.") + print("Currently loading the most recent test_cases file. To load different test_cases, type 'load {file path}'") + + model_prompt_combinations = [] + for model, prompts in available_models.items(): + for i, prompt_dict in enumerate(prompts): + api = prompt_dict["api"] + comments = prompt_dict["comments"] + comment_str = f" - {comments}" if comments else "" + model_prompt_combinations.append((model, prompt_dict)) + print(f"{len(model_prompt_combinations)}. {model} ({api}) - Prompt {i+1} {comment_str}") + + while True: + try: + choice = input("Enter the number(s) of the model-prompt combination(s) you want to test (comma-separated), \n'all' for all combinations, or 'load' to change test cases: ") + if choice.lower() == 'print': + print_responses = not print_responses + print(f"API responses will {'' if print_responses else 'not '}be printed.") + continue + elif choice.lower().startswith('load '): + new_file_path = choice.split(' ', 1)[1] + try: + global load_test_cases + def load_test_cases(): + return load_specific_test_cases(new_file_path) + print(f"Test cases will now be loaded from: {new_file_path}") + continue + except Exception as e: + print(f"Error loading file: {str(e)}") + continue + elif choice.lower() == 'all': + return model_prompt_combinations + + selected_indices = [int(idx.strip()) - 1 for idx in choice.split(',')] + selected_combinations = [model_prompt_combinations[i] for i in selected_indices if 0 <= i < len(model_prompt_combinations)] + + if not selected_combinations: + print("No valid model-prompt combinations selected. Please try again.") + else: + return selected_combinations + except ValueError: + print("Invalid input. Please enter number(s), 'all', 'load {file path}', or 'print'.") + +def load_specific_test_cases(file_path: str) -> List[TestCase]: + test_cases = [] + with open(file_path, "r", newline='') as f: + csv_reader = csv.DictReader(f) + for row in csv_reader: + test_cases.append(TestCase( + test_question=row["test_question"], + web_search_expected=row.get("web_search_expected", "").lower() == "true" + )) + return test_cases + +def parse_context(context: str) -> List[Dict[str, str]]: + messages = [] + if context: + context_messages = context.split("MESSAGE:") + for msg in context_messages: + if msg.strip(): + parts = msg.strip().split(" ", 1) + if len(parts) == 2: + role, content = parts + # Map roles to valid API roles + if role.upper() == "USER": + api_role = "user" + elif role.upper() in ["AGENT", "ASSISTANT", "AI"]: + api_role = "assistant" + else: + continue # Skip invalid roles + messages.append({"role": api_role, "content": content.strip()}) + return messages + +async def get_completion_openai(client: OpenAI, model: str, system_prompt: str, test_question: str, context: str) -> Any: + messages = [ + {"role": "system", "content": system_prompt}, + ] + + # Parse and add context messages + messages.extend(parse_context(context)) + + messages.append({"role": "user", "content": test_question}) + + response = await asyncio.to_thread( + client.chat.completions.create, + model=model, + messages=messages, + tools=[web_search_tool], + ) + if print_responses: + print("OpenAI API response:", response) + return response + +async def get_completion_anthropic(client: anthropic.Anthropic, model: str, system_prompt: str, test_question: str, context: str) -> Any: + messages = parse_context(context) + + messages.append({"role": "user", "content": test_question}) + + response = await asyncio.to_thread( + client.messages.create, + model=model, + max_tokens=1024, + temperature=1, + system=system_prompt, + tools=[web_search_tool_anthropic], + messages=messages, + ) + if print_responses: + print("Anthropic API response:", response) + return response + +def evaluate_web_search(response: Any, api_source: str) -> bool: + web_search_performed = False + if api_source == "OpenAI": + for choice in response.choices: + if choice.message.tool_calls: + for tool_call in choice.message.tool_calls: + if tool_call.function.name == 'web_search': + web_search_performed = True + break + if web_search_performed: + break + elif api_source == "Anthropic": + if hasattr(response, 'content') and isinstance(response.content, list): + for block in response.content: + if isinstance(block, anthropic.types.ToolUseBlock) and block.name == 'web_search': + web_search_performed = True + break + return web_search_performed + +async def process_test_case(openai_client: OpenAI, anthropic_client: anthropic.Anthropic, + test_model: str, test_system_prompt: str, api_source: str, + test_case: TestCase) -> Tuple[bool, bool, str, float, str, str]: + try: + if api_source == "OpenAI": + response = await get_completion_openai(openai_client, test_model, test_system_prompt, test_case.test_question, test_case.context) + message_text = response.choices[0].message.content if response.choices else "" + elif api_source == "Anthropic": + response = await get_completion_anthropic(anthropic_client, test_model, test_system_prompt, test_case.test_question, test_case.context) + message_text = "" + if hasattr(response, 'content') and isinstance(response.content, list): + message_text = "".join(block.text for block in response.content if isinstance(block, anthropic.types.TextBlock)) + else: + raise ValueError(f"Unsupported API source: {api_source}") + + web_search_performed = evaluate_web_search(response, api_source) + search_query = extract_search_query(response, api_source) + result = get_result(web_search_performed, test_case.web_search_expected) + + if print_responses: + print(f"API Response: {response}") + print(f"Message Text: {message_text}") + print(f"Search Query: {search_query}") + print(f"{result}, {test_case.score:.2f}, {test_case.web_search_expected}, {web_search_performed}, \"{test_case.test_question}\", \"{test_case.context}\"") + + return web_search_performed, test_case.web_search_expected, test_case.test_question, test_case.score, message_text, search_query + except Exception as e: + print(f"Error processing question: {test_case.test_question}") + print(f"Error details: {str(e)}") + print(f"Context: {test_case.context}") + return False, test_case.web_search_expected, test_case.test_question, 0.0, "", "" + +def extract_search_query(response: Any, api_source: str) -> str: + if api_source == "OpenAI": + for choice in response.choices: + if choice.message.tool_calls: + for tool_call in choice.message.tool_calls: + if tool_call.function.name == 'web_search': + return json.loads(tool_call.function.arguments).get('query', '') + elif api_source == "Anthropic": + if hasattr(response, 'content') and isinstance(response.content, list): + for block in response.content: + if isinstance(block, anthropic.types.ToolUseBlock) and block.name == 'web_search': + return block.input.get('query', '') + return "" + +def get_result(performed: bool, expected: bool) -> str: + if performed and expected: + return "TP" + elif performed and not expected: + return "FP" + elif not performed and not expected: + return "TN" + else: + return "FN" + +async def web_search_unit_test(openai_client: OpenAI, anthropic_client: anthropic.Anthropic, + test_model: str, test_system_prompt: str, api_source: str, comments: str) -> Tuple[DetailedTestResult, List[Tuple[bool, bool, str, float, str, str]]]: + test_cases = load_test_cases() + tasks = [process_test_case(openai_client, anthropic_client, test_model, test_system_prompt, api_source, test_case) + for test_case in test_cases] + results = await asyncio.gather(*tasks) + + score_level_stats = {score: {"TP": 0, "FP": 0, "TN": 0, "FN": 0, "total": 0} for score in range(-5, 6)} + + for performed, expected, _, score, _, _ in results: + rounded_score = round(score) + result = get_result(performed, expected) + score_level_stats[rounded_score][result] += 1 + score_level_stats[rounded_score]["total"] += 1 + + true_positives = sum(stats["TP"] for stats in score_level_stats.values()) + false_positives = sum(stats["FP"] for stats in score_level_stats.values()) + true_negatives = sum(stats["TN"] for stats in score_level_stats.values()) + false_negatives = sum(stats["FN"] for stats in score_level_stats.values()) + + total_cases = len(test_cases) + precision = true_positives / (true_positives + false_positives) if (true_positives + false_positives) > 0 else 0 + recall = true_positives / (true_positives + false_negatives) if (true_positives + false_negatives) > 0 else 0 + success_rate = (true_positives + true_negatives) / total_cases + + for score, stats in score_level_stats.items(): + tp, fp, tn, fn = stats['TP'], stats['FP'], stats['TN'], stats['FN'] + stats['precision'], stats['recall'], stats['success_rate'] = calculate_stats(tp, fp, tn, fn) + + detailed_result = DetailedTestResult( + model=test_model, + api_source=api_source, + precision=precision, + recall=recall, + success_rate=success_rate, + true_positives=true_positives, + false_positives=false_positives, + true_negatives=true_negatives, + false_negatives=false_negatives, + total_cases=total_cases, + comments=comments, + score_level_stats=score_level_stats + ) + + print_score_level_stats(score_level_stats) + write_results_to_csv(results, detailed_result) + + return detailed_result, results + +def calculate_stats(tp, fp, tn, fn): + total = tp + fp + tn + fn + precision = tp / (tp + fp) if (tp + fp) > 0 else 0 + recall = tp / (tp + fn) if (tp + fn) > 0 else 0 + success_rate = (tp + tn) / total if total > 0 else 0 + return precision, recall, success_rate + +def print_score_level_stats(score_level_stats): + print("\nScore Level Statistics:") + print("Score | Total | TP | FP | TN | FN | Precision | Recall | Success Rate") + print("-" * 70) + for score in range(-5, 6): + stats = score_level_stats[score] + tp, fp, tn, fn = stats['TP'], stats['FP'], stats['TN'], stats['FN'] + precision, recall, success_rate = calculate_stats(tp, fp, tn, fn) + print(f"{score:5d} | {stats['total']:5d} | {tp:2d} | {fp:2d} | {tn:2d} | {fn:2d} | {precision:9.2%} | {recall:6.2%} | {success_rate:11.2%}") + +def write_results_to_csv(results, detailed_result: DetailedTestResult): + timestamp = datetime.now().strftime("%Y%m%d_%H%M%S") + filename = f"web_search/results/web_search_results_{timestamp}.csv" + os.makedirs(os.path.dirname(filename), exist_ok=True) + + with open(filename, 'w', newline='') as csvfile: + csvwriter = csv.writer(csvfile, quoting=csv.QUOTE_MINIMAL) + + # Write the results log at the top of the CSV + csvwriter.writerow(["Web search unit test results"]) + csvwriter.writerow([f"Model: {detailed_result.model}"]) + csvwriter.writerow([f"API Source: {detailed_result.api_source}"]) + + # Add the web search tool description + web_search_description = web_search_tool["function"]["description"] if detailed_result.api_source == "OpenAI" else web_search_tool_anthropic["description"] + csvwriter.writerow(["Web Search Tool Description:"]) + # Split the description into multiple rows if it's too long + for line in web_search_description.split(". "): + csvwriter.writerow([line.strip() + "."]) + + csvwriter.writerow([f"\nWeb search unit test results for model {detailed_result.model} ({detailed_result.api_source}):"]) + csvwriter.writerow([f"Comments: {detailed_result.comments}"]) + csvwriter.writerow([f"Overall Precision: {detailed_result.precision:.2%} ({detailed_result.true_positives}/{detailed_result.true_positives + detailed_result.false_positives})"]) + csvwriter.writerow([f"Overall Recall: {detailed_result.recall:.2%} ({detailed_result.true_positives}/{detailed_result.true_positives + detailed_result.false_negatives})"]) + csvwriter.writerow([f"Overall Success rate: {detailed_result.success_rate:.2%} ({detailed_result.true_positives + detailed_result.true_negatives}/{detailed_result.total_cases})"]) + csvwriter.writerow([f"Results written to {filename}"]) + csvwriter.writerow([]) # Empty row for separation + + # Update the score level statistics table + csvwriter.writerow(["Score Level Statistics"]) + csvwriter.writerow(["Score", "Total", "TP", "FP", "TN", "FN", "Precision", "Recall", "Success Rate"]) + for score in range(-5, 6): + stats = detailed_result.score_level_stats[score] + tp, fp, tn, fn = stats['TP'], stats['FP'], stats['TN'], stats['FN'] + precision, recall, success_rate = calculate_stats(tp, fp, tn, fn) + csvwriter.writerow([ + score, + stats['total'], + tp, + fp, + tn, + fn, + f"{precision:.2%}", + f"{recall:.2%}", + f"{success_rate:.2%}" + ]) + + csvwriter.writerow([]) # Empty row for separation + + # Write the column headers for individual results + csvwriter.writerow(["result", "score", "web_search_expected", "web_search_performed", "test_question", "api_response", "search_query", "context", "model", "api_source", "comments"]) + + # Write the individual results + for performed, expected, question, score, api_response, search_query in results: + result = get_result(performed, expected) + test_case = next((tc for tc in load_test_cases() if tc.test_question == question), None) + if test_case: + csvwriter.writerow([ + result, f"{score:.2f}", expected, performed, question, + api_response, search_query, test_case.context, + detailed_result.model, detailed_result.api_source, + detailed_result.comments + ]) + else: + print(f"Warning: Test case not found for question: {question}") + + print(f"Results written to {filename}") + +def fetch_sample_results(results, score_level_stats): + while True: + user_input = input("Enter score and result type to fetch samples (e.g., '3, FP' or 'done' to finish): ") + if user_input.lower() == 'done': + break + + try: + score, result_type = user_input.split(',') + score = int(score.strip()) + result_type = result_type.strip().upper() + + if score not in range(-5, 6) or result_type not in ['TP', 'FP', 'TN', 'FN']: + print("Invalid input. Score should be between -5 and 5, and result type should be TP, FP, TN, or FN.") + continue + + samples = [r for r in results if round(r[3]) == score and get_result(r[0], r[1]) == result_type][:20] + + print(f"\nSamples for score {score}, result type {result_type}:") + for i, (performed, expected, question, exact_score, message_text, search_query) in enumerate(samples, 1): + print(f"{i}. Question: {question}") + print(f" Expected: {expected}, Performed: {performed}, Exact Score: {exact_score:.2f}") + print(f" Search Query: {search_query}") + print(f" API Response (Message Text): {message_text[:200]}...") # Print first 200 characters of message text + print(f"Total samples: {len(samples)}") + + except ValueError: + print("Invalid input format. Please use 'score, result_type' format.") + +async def main(): + set_print_responses() + available_models = load_available_models() + + while True: + selected_combinations = select_models(available_models) + + for test_model, prompt_dict in selected_combinations: + test_system_prompt = prompt_dict["prompt"] + api_source = prompt_dict["api"] + comments = prompt_dict["comments"] + if api_source == "OpenAI" and openai_client is None: + print(f"Skipping test for {test_model} as OpenAI client is not initialized.") + continue + if api_source == "Anthropic" and anthropic_client is None: + print(f"Skipping test for {test_model} as Anthropic client is not initialized.") + continue + + detailed_result, results = await web_search_unit_test(openai_client, anthropic_client, test_model, test_system_prompt, api_source, comments) + + # Print the detailed results + print(f"\nWeb search unit test results for model {detailed_result.model} ({detailed_result.api_source}):") + print(f"Comments: {detailed_result.comments}") + print(f"Precision: {detailed_result.precision:.2%} ({detailed_result.true_positives}/{detailed_result.true_positives + detailed_result.false_positives})") + print(f"Recall: {detailed_result.recall:.2%} ({detailed_result.true_positives}/{detailed_result.true_positives + detailed_result.false_negatives})") + print(f"Success rate: {detailed_result.success_rate:.2%} ({detailed_result.true_positives + detailed_result.true_negatives}/{detailed_result.total_cases})") + + fetch_samples = input("Would you like to fetch sample results? (y/n): ").lower() + if fetch_samples == 'y': + fetch_sample_results(results, detailed_result.score_level_stats) + + run_again = input("\nWould you like to run another test with different models? (y/n): ").lower() + if run_again != 'y': + print("Thank you for using the web search unit test tool. Goodbye!") + break + +if __name__ == "__main__": + asyncio.run(main()) \ No newline at end of file