This repository contains notebooks and example code used in the talk "What Comes After Coding: Evaluating Agentic Behaviour", presented by Alfonso Roa at the Databricks User Group Meetup in Madrid (hosted at the Repsol Campus).
- Slides of the presentation (in Spanish).
The notebooks demonstrate how to:
- Build a clickbait detector agent using LangGraph and DSPy.
- Track and evaluate agentic behavior using MLflow, implementing custom scorers and judges.
The notebooks focus not only on building agents but also on measuring their correctness, reliability, and alignment—key concerns for production-grade AI systems.
To import the notebooks into a Databricks workspace:
-
Copy one of the notebook links
-
In the workspace browser, navigate to the location where you want to import the notebook.
-
Right-click the folder and select Import from the menu.
-
Click the URL radio button and paste the link you just copied in the field
-
Click Import. The notebook will be imported and opened automatically.
More detailed instructions in the Databricks Docs.
The notebooks include step-by-step instructions, making it easy to follow along and understand the configuration and usage. This section provides a high-level overview of the key prerequisites needed to run the agent examples successfully.
Each notebook includes global parameters for:
- Unity Catalog location: defines where models, prompts, and artifacts are stored
- Model serving endpoint: used for invoking LLMs via Databricks Model Serving
The agents use the Jina AI Reader API to extract titles and content from webpages. This functionality enables the agent to process live URLs provided by the user.
- A free API key is required, which can optionally be stored securely using Databricks Secrets.
To evaluate the agent’s performance, a labeled dataset of clickbait vs. non-clickbait headlines is used. You can download the dataset from Kaggle: clickbait dataset.
- LangGraph version: Clickbait agent - LangGraph.ipynb
- DSPy version: Clickbait agent - DSPy.ipynb
This project uses the new MLflow 3.x GenAI API, specifically the evaluate method, to evaluate agentic behavior in a structured and extensible way.
We define lightweight functional wrappers around the full agent or individual nodes (e.g., the classifier or rewriter). This allows evaluating specific parts of the agent independently. These wrappers are passed to the predict_fn
parameter of the mlflow.genai.evaluate()
method.
Custom scorers are implemented in a way that is agnostic to the underlying agent framework (LangGraph or DSPy). This enables reuse, portability, and clean separation between agent logic and evaluation logic.
As an example, the clickbait classification logic is reused as a custom judge. After the agent rewrites a clickbait title, the classifier is invoked again to verify whether the new version is still clickbait.
The notebooks also integrate MLflow’s built-in judges to evaluate agent output based on qualitative rules:
- 🛡 Safety — checks for harmful, unethical, or inappropriate content
- 📏 Guidelines — enforces custom constraints defined in natural language
This setup supports unit-like testing, integration-style validation, and property-based evaluation, all within a consistent MLflow workflow.
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.