Evaluation of LLM based ChatBots

Overview

Business Problem

LLMs can produce sophisticated, coherent and persuasive language on any topic using their general knowledge, but most of their real-world applications require them to answer questions based on the context given in the prompt and their previous training.

Customers who develop and use applications such as ChatBots for answering product or service-related questions would need to evaluate the performance of these applications from both Business and Data science perspectives to enhance them further.

The area of LLM assessing LLM based applications is new but evolving quickly, however there are no common framework or platform to conduct evaluations at a large scale for retail/commerce ChatBots.

Business Value

This solution (along with the working codebase) and the related pattern here provide a high-level approach to understand LLM evaluations, including the data model, data pipelines to process the input data, and datastore to store the evaluation metrics for Self-service visualization. The solution can be a catalyst for e-commerce/retail chatbot use cases, but it can also be adapted for other domains with minor changes based on the needs.

Likewise, the solution can be used as an Example asset both in ISE and MCAPS more generally to show the relevance and usefulness of LLM evaluations for different domains, its advantages for business and data science, adaptability and scalability of the solutions etc.

Features

The asset is a summary of learnings and extractions from the implementation of “Conversational commerce chatbot Evaluation Framework” for a large e-commerce customer based in India.

This asset has a reference frontend UI that simulates a chat interface which generates conversations which then persisted in the data stores for further evaluations. The reference implementations have the following features:

NOTE: Several features might be re-usable as-is, however, there could be modifications or extensions needed based on the use cases.

Data Model for LLM Evaluation
Data Transformation Pipeline for processing application logs
LLM Evaluation Pipeline for generating evaluation metrics at scale
Visualization of evaluation metrics in PowerBI
Automation of infrastructure deployment using IaC

PostProd Evaluation

Architecture (PostProd Evaluation)

The architecture of the solution is as follows:

The architecture consists of the following components:

ChatBot - The application that is handling the chat conversations, built using LLMs.
Log Storage - The storage for the chat logs, generated by the ChatBot application. This contains the raw chat logs for each turn in the conversation and the LLM calls made by the ChatBot.
Transformation Pipeline - The pipeline that processes the raw chat logs and generates the data model for the evaluation.
Evaluation Data Store - The storage for the processed chat logs, containing the data model for the evaluation. The data will be stored as Parquet files.
Evaluation Pipeline - The pipeline that processes the data model and generates the evaluation metrics by calling the LLM. In this step, Prompt Flow is used at scale to generate the evaluation metrics.
Evaluation Metrics Store - The storage for the evaluation metrics generated by the Evaluation Pipeline. The data store will be a structured store, such as a SQL database.
Visualization - The visualization tool that will be used to visualize the evaluation metrics. This could be a BI tool, such as PowerBI.

Getting Started

QuickStart (PostProd Evaluation)

Refer to the QuickStart guide to get started with the solution.

Prerequisites

Azure Subscription
Development Environment
- Azure CLI
- Python 3.11
- Conda
- Visual Studio Code
- Power BI Desktop (for visualization)

Infrastructure Deployment

The infrastructure for the solution can be deployed using the provided Infrastructure as Code (IaC) templates. For more details refer to the Infrastructure Deployment guide.

Framework Deployment

The framework can be deployed using the provided deployment scripts. For more details refer to the Framework Deployment guide.

Demo

A quick demo of the solution can be found in the Demo guide.

Understanding the Framework

The framework is designed to be modular and extensible. The framework consists of the following components:

Data Model - The data model for the evaluation, containing the chat logs and the evaluation metrics.
Azure SQL - The Azure SQL scripts for creating the database and tables for storing the evaluation metrics as per the data model.
Sample Chatbot - A sample chatbot application that generates the chat logs for demonstration purposes.
Framework Source - The source code for the framework, containing python modules for the data transformation pipeline and evaluation pipeline.
Framework Deployment - The deployment and execution scripts for the framework, containing the Azure ML pipelines for the data transformation and evaluation pipelines.
Dashboards - The PowerBI dashboard for visualizing the evaluation metrics.

NOTE: The more detailed information about the framework can be found in the Developer Guide.

Extending the Framework

The framework can be extended in several ways to support different use cases. For more details refer to the Extending the Framework guide.

Resources

PreProd Evaluation

Go to PreProd Evaluation for more details.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.github		.github
postprod-eval		postprod-eval
preprod-eval/weather-chatbot		preprod-eval/weather-chatbot
.flake8		.flake8
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE.md		LICENSE.md
README.md		README.md
SECURITY.md		SECURITY.md
SUPPORT.md		SUPPORT.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Evaluation of LLM based ChatBots

Overview

Business Problem

Business Value

Features

PostProd Evaluation

Architecture (PostProd Evaluation)

Getting Started

QuickStart (PostProd Evaluation)

Prerequisites

Infrastructure Deployment

Framework Deployment

Demo

Understanding the Framework

Extending the Framework

Resources

PreProd Evaluation

Contributors

About

Uh oh!

Uh oh!

Contributors 2

Uh oh!

Languages

License

Azure-Samples/llm-eval-grader-samples

Folders and files

Latest commit

History

Repository files navigation

Evaluation of LLM based ChatBots

Overview

Business Problem

Business Value

Features

PostProd Evaluation

Architecture (PostProd Evaluation)

Getting Started

QuickStart (PostProd Evaluation)

Prerequisites

Infrastructure Deployment

Framework Deployment

Demo

Understanding the Framework

Extending the Framework

Resources

PreProd Evaluation

Contributors

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors 2

Uh oh!

Languages