What Comes After Coding: Evaluating Agentic Behaviour

This repository contains notebooks and example code used in the talk "What Comes After Coding: Evaluating Agentic Behaviour", presented by Alfonso Roa at the Databricks User Group Meetup in Madrid (hosted at the Repsol Campus).

Slides of the presentation (in Spanish).

The notebooks demonstrate how to:

Build a clickbait detector agent using LangGraph and DSPy.
Track and evaluate agentic behavior using MLflow, implementing custom scorers and judges.

The notebooks focus not only on building agents but also on measuring their correctness, reliability, and alignment—key concerns for production-grade AI systems.

🚀 Getting started

Import the notebook in databricks

To import the notebooks into a Databricks workspace:

Copy one of the notebook links
- Clickbait agent - LangGraph.ipynb
- Clickbait agent - DSPy.ipynb
In the workspace browser, navigate to the location where you want to import the notebook.
Right-click the folder and select Import from the menu.
Click the URL radio button and paste the link you just copied in the field
Click Import. The notebook will be imported and opened automatically.

More detailed instructions in the Databricks Docs.

Prerequisites

The notebooks include step-by-step instructions, making it easy to follow along and understand the configuration and usage. This section provides a high-level overview of the key prerequisites needed to run the agent examples successfully.

Global parameters

Each notebook includes global parameters for:

Unity Catalog location: defines where models, prompts, and artifacts are stored
Model serving endpoint: used for invoking LLMs via Databricks Model Serving

Jina AI tool

The agents use the Jina AI Reader API to extract titles and content from webpages. This functionality enables the agent to process live URLs provided by the user.

A free API key is required, which can optionally be stored securely using Databricks Secrets.

Clickbait dataset

To evaluate the agent’s performance, a labeled dataset of clickbait vs. non-clickbait headlines is used. You can download the dataset from Kaggle: clickbait dataset.

🧠 Agent variants

LangGraph version: Clickbait agent - LangGraph.ipynb
DSPy version: Clickbait agent - DSPy.ipynb

📊 Evaluation Techniques

This project uses the new MLflow 3.x GenAI API, specifically the evaluate method, to evaluate agentic behavior in a structured and extensible way.

🧩 Modular evaluation via functional wrappers

We define lightweight functional wrappers around the full agent or individual nodes (e.g., the classifier or rewriter). This allows evaluating specific parts of the agent independently. These wrappers are passed to the predict_fn parameter of the mlflow.genai.evaluate() method.

🛠 Custom framework-agnostic scorers

Custom scorers are implemented in a way that is agnostic to the underlying agent framework (LangGraph or DSPy). This enables reuse, portability, and clean separation between agent logic and evaluation logic.

🧑‍⚖️ Custom judges from internal components

As an example, the clickbait classification logic is reused as a custom judge. After the agent rewrites a clickbait title, the classifier is invoked again to verify whether the new version is still clickbait.

🧑‍⚖️ Judges for qualitative feedback

The notebooks also integrate MLflow’s built-in judges to evaluate agent output based on qualitative rules:

🛡 Safety — checks for harmful, unethical, or inappropriate content
📏 Guidelines — enforces custom constraints defined in natural language

This setup supports unit-like testing, integration-style validation, and property-based evaluation, all within a consistent MLflow workflow.

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
Clickbait agent - DSPy.ipynb		Clickbait agent - DSPy.ipynb
Clickbait agent - LangGraph.ipynb		Clickbait agent - LangGraph.ipynb
Evaluating Agentic Behaviour - slides.pdf		Evaluating Agentic Behaviour - slides.pdf
README.md		README.md
urls_data.json		urls_data.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

What Comes After Coding: Evaluating Agentic Behaviour

🚀 Getting started

Import the notebook in databricks

Prerequisites

Global parameters

Jina AI tool

Clickbait dataset

🧠 Agent variants

📊 Evaluation Techniques

🧩 Modular evaluation via functional wrappers

🛠 Custom framework-agnostic scorers

🧑‍⚖️ Custom judges from internal components

🧑‍⚖️ Judges for qualitative feedback

About

Uh oh!

Releases

Packages

Languages

hablapps/agentic-evaluation-databricks-repsol

Folders and files

Latest commit

History

Repository files navigation

What Comes After Coding: Evaluating Agentic Behaviour

🚀 Getting started

Import the notebook in databricks

Prerequisites

Global parameters

Jina AI tool

Clickbait dataset

🧠 Agent variants

📊 Evaluation Techniques

🧩 Modular evaluation via functional wrappers

🛠 Custom framework-agnostic scorers

🧑‍⚖️ Custom judges from internal components

🧑‍⚖️ Judges for qualitative feedback

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages