Scenario is a library for testing agents end-to-end as a human would, but without having to manually do it. The automated testing agent covers every single scenario for you.
You define the scenarios, and the testing agent will simulate your users as it follows them, it will keep chatting and evaluating your agent until it reaches the desired goal or detects an unexpected behavior.
Install pytest and scenario:
pip install pytest langwatch-scenarioNow create your first scenario and save it as tests/test_vegetarian_recipe_agent.py:
import pytest
from scenario import Scenario, TestingAgent, scenario_cache
Scenario.configure(testing_agent=TestingAgent(model="openai/gpt-4o-mini"))
@pytest.mark.agent_test
@pytest.mark.asyncio
async def test_vegetarian_recipe_agent():
agent = VegetarianRecipeAgent()
def vegetarian_recipe_agent(message, context):
# Call your agent here
return agent.run(message)
# Define the scenario
scenario = Scenario(
"User is looking for a dinner idea",
agent=vegetarian_recipe_agent,
success_criteria=[
"Recipe agent generates a vegetarian recipe",
"Recipe includes a list of ingredients",
"Recipe includes step-by-step cooking instructions",
],
failure_criteria=[
"The recipe is not vegetarian or includes meat",
"The agent asks more than two follow-up questions",
],
)
# Run the scenario and get results
result = await scenario.run()
# Assert for pytest to know whether the test passed
assert result.success
# Example agent implementation
import litellm
class VegetarianRecipeAgent:
def __init__(self):
self.history = []
@scenario_cache()
def run(self, message: str):
self.history.append({"role": "user", "content": message})
response = litellm.completion(
model="openai/gpt-4o-mini",
messages=[
{
"role": "system",
"content": """You are a vegetarian recipe agent.
Given the user request, ask AT MOST ONE follow-up question,
then provide a complete recipe. Keep your responses concise and focused.""",
},
*self.history,
],
)
message = response.choices[0].message # type: ignore
self.history.append(message)
return {"messages": [message]}Create a .env file and put your OpenAI API key in it:
OPENAI_API_KEY=<your-api-key>Now run it with pytest:
pytest -s tests/test_vegetarian_recipe_agent.pyThis is how it will look like:
You can find a fully working example in examples/test_vegetarian_recipe_agent.py.
You can customize how should the testing agent go about testing by defining a strategy field. You can also limit the maximum number of turns the scenario will take by setting the max_turns field (defaults to 10).
For example, in this Lovable Clone scenario test:
scenario = Scenario(
"user wants to create a new landing page for their dog walking startup",
agent=lovable_agent,
strategy="send the first message to generate the landing page, then a single follow up request to extend it, then give your final verdict",
success_criteria=[
"agent reads the files before go and making changes",
"agent modified the index.css file",
"agent modified the Index.tsx file",
"agent created a comprehensive landing page",
"agent extended the landing page with a new section",
],
failure_criteria=[
"agent says it can't read the file",
"agent produces incomplete code or is too lazy to finish",
],
max_turns=5,
)
result = await scenario.run()You can find a fully working Lovable Clone example in examples/test_lovable_clone.py.
You can enable debug mode by setting the debug field to True in the Scenario.configure method or in the specific scenario you are running.
Debug mode allows you to see the messages in slow motion step by step, and intervene with your own inputs to debug your agent from the middle of the conversation.
Scenario.configure(testing_agent=TestingAgent(model="openai/gpt-4o-mini"), debug=True)Each time the scenario runs, the testing agent might chose a different input to start, this is good to make sure it covers the variance of real users as well, however we understand that the non-deterministic nature of it might make it less repeatable, costly and harder to debug. To solve for it, you can use the cache_key field in the Scenario.configure method or in the specific scenario you are running, this will make the testing agent give the same input for given the same scenario:
Scenario.configure(testing_agent=TestingAgent(model="openai/gpt-4o-mini"), cache_key="42")To bust the cache, you can simply pass a different cache_key, disable it, or delete the cache files located at ~/.scenario/cache.
To go a step further and fully cache the test end-to-end, you can also wrap the LLM calls or any other non-deterministic functions in your application side with the @scenario_cache decorator:
class MyAgent:
@scenario_cache(ignore=["self"])
def invoke(self, message, context):
return client.chat.completions.create(
# ...
)This will cache any function call you decorate when running the tests and make them repeatable, hashed by the function arguments, the scenario being executed, and the cache_key you provided. You can exclude arguments that should not be hashed for the cache key by naming them in the ignore argument.
You can remove the -s flag from pytest to hide the output during test, which will only show up if the test fails. Alternatively, you can set verbose=False in the Scenario.configure method or in the specific scenario you are running.
As the number of your scenarios grows, you might want to run them in parallel to speed up your whole test suite. We suggest you to use the pytest-asyncio-concurrent plugin to do so.
Simply install the plugin from the link above, then replace the @pytest.mark.asyncio annotation in the tests with @pytest.mark.asyncio_concurrent, adding a group name to it to mark the group of scenarions that should be run in parallel together, e.g.:
@pytest.mark.agent_test
@pytest.mark.asyncio_concurrent(group="vegetarian_recipe_agent")
async def test_vegetarian_recipe_agent():
# ...
@pytest.mark.agent_test
@pytest.mark.asyncio_concurrent(group="vegetarian_recipe_agent")
async def test_user_is_very_hungry():
# ...Those two scenarios should now run in parallel.
MIT License
