Skip to content

Framework for Post-production Evaluation of LLM based ChatBots

License

Notifications You must be signed in to change notification settings

Azure-Samples/llm-eval-grader-samples

Evaluation of LLM based ChatBots

Overview

Business Problem

LLMs can produce sophisticated, coherent and persuasive language on any topic using their general knowledge, but most of their real-world applications require them to answer questions based on the context given in the prompt and their previous training.

Customers who develop and use applications such as ChatBots for answering product or service-related questions would need to evaluate the performance of these applications from both Business and Data science perspectives to enhance them further.

The area of LLM assessing LLM based applications is new but evolving quickly, however there are no common framework or platform to conduct evaluations at a large scale for retail/commerce ChatBots.

Business Value

This solution (along with the working codebase) and the related pattern here provide a high-level approach to understand LLM evaluations, including the data model, data pipelines to process the input data, and datastore to store the evaluation metrics for Self-service visualization. The solution can be a catalyst for e-commerce/retail chatbot use cases, but it can also be adapted for other domains with minor changes based on the needs.

Likewise, the solution can be used as an Example asset both in ISE and MCAPS more generally to show the relevance and usefulness of LLM evaluations for different domains, its advantages for business and data science, adaptability and scalability of the solutions etc.

Features

The asset is a summary of learnings and extractions from the implementation of “Conversational commerce chatbot Evaluation Framework” for a large e-commerce customer based in India.

This asset has a reference frontend UI that simulates a chat interface which generates conversations which then persisted in the data stores for further evaluations. The reference implementations have the following features:

NOTE: Several features might be re-usable as-is, however, there could be modifications or extensions needed based on the use cases.

  • Data Model for LLM Evaluation
  • Data Transformation Pipeline for processing application logs
  • LLM Evaluation Pipeline for generating evaluation metrics at scale
  • Visualization of evaluation metrics in PowerBI
  • Automation of infrastructure deployment using IaC

PostProd Evaluation

Architecture (PostProd Evaluation)

The architecture of the solution is as follows:

Architecture

The architecture consists of the following components:

  1. ChatBot - The application that is handling the chat conversations, built using LLMs.
  2. Log Storage - The storage for the chat logs, generated by the ChatBot application. This contains the raw chat logs for each turn in the conversation and the LLM calls made by the ChatBot.
  3. Transformation Pipeline - The pipeline that processes the raw chat logs and generates the data model for the evaluation.
  4. Evaluation Data Store - The storage for the processed chat logs, containing the data model for the evaluation. The data will be stored as Parquet files.
  5. Evaluation Pipeline - The pipeline that processes the data model and generates the evaluation metrics by calling the LLM. In this step, Prompt Flow is used at scale to generate the evaluation metrics.
  6. Evaluation Metrics Store - The storage for the evaluation metrics generated by the Evaluation Pipeline. The data store will be a structured store, such as a SQL database.
  7. Visualization - The visualization tool that will be used to visualize the evaluation metrics. This could be a BI tool, such as PowerBI.

Getting Started

QuickStart (PostProd Evaluation)

Refer to the QuickStart guide to get started with the solution.

Prerequisites

  • Azure Subscription
  • Development Environment
    • Azure CLI
    • Python 3.11
    • Conda
    • Visual Studio Code
    • Power BI Desktop (for visualization)

Infrastructure Deployment

The infrastructure for the solution can be deployed using the provided Infrastructure as Code (IaC) templates. For more details refer to the Infrastructure Deployment guide.

Framework Deployment

The framework can be deployed using the provided deployment scripts. For more details refer to the Framework Deployment guide.

Demo

A quick demo of the solution can be found in the Demo guide.

Understanding the Framework

The framework is designed to be modular and extensible. The framework consists of the following components:

  1. Data Model - The data model for the evaluation, containing the chat logs and the evaluation metrics.
  2. Azure SQL - The Azure SQL scripts for creating the database and tables for storing the evaluation metrics as per the data model.
  3. Sample Chatbot - A sample chatbot application that generates the chat logs for demonstration purposes.
  4. Framework Source - The source code for the framework, containing python modules for the data transformation pipeline and evaluation pipeline.
  5. Framework Deployment - The deployment and execution scripts for the framework, containing the Azure ML pipelines for the data transformation and evaluation pipelines.
  6. Dashboards - The PowerBI dashboard for visualizing the evaluation metrics.

NOTE: The more detailed information about the framework can be found in the Developer Guide.

Extending the Framework

The framework can be extended in several ways to support different use cases. For more details refer to the Extending the Framework guide.

Resources

PreProd Evaluation

Go to PreProd Evaluation for more details.

Contributors

About

Framework for Post-production Evaluation of LLM based ChatBots

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Contributors 2

  •  
  •