LLMs can produce sophisticated, coherent and persuasive language on any topic using their general knowledge, but most of their real-world applications require them to answer questions based on the context given in the prompt and their previous training.
Customers who develop and use applications such as ChatBots for answering product or service-related questions would need to evaluate the performance of these applications from both Business and Data science perspectives to enhance them further.
The area of LLM assessing LLM based applications is new but evolving quickly, however there are no common framework or platform to conduct evaluations at a large scale for retail/commerce ChatBots.
This solution (along with the working codebase) and the related pattern here provide a high-level approach to understand LLM evaluations, including the data model, data pipelines to process the input data, and datastore to store the evaluation metrics for Self-service visualization. The solution can be a catalyst for e-commerce/retail chatbot use cases, but it can also be adapted for other domains with minor changes based on the needs.
Likewise, the solution can be used as an Example asset both in ISE and MCAPS more generally to show the relevance and usefulness of LLM evaluations for different domains, its advantages for business and data science, adaptability and scalability of the solutions etc.
The asset is a summary of learnings and extractions from the implementation of “Conversational commerce chatbot Evaluation Framework” for a large e-commerce customer based in India.
This asset has a reference frontend UI that simulates a chat interface which generates conversations which then persisted in the data stores for further evaluations. The reference implementations have the following features:
NOTE: Several features might be re-usable as-is, however, there could be modifications or extensions needed based on the use cases.
- Data Model for LLM Evaluation
- Data Transformation Pipeline for processing application logs
- LLM Evaluation Pipeline for generating evaluation metrics at scale
- Visualization of evaluation metrics in PowerBI
- Automation of infrastructure deployment using IaC
The architecture of the solution is as follows:
The architecture consists of the following components:
- ChatBot - The application that is handling the chat conversations, built using LLMs.
- Log Storage - The storage for the chat logs, generated by the ChatBot application. This contains the raw chat logs for each turn in the conversation and the LLM calls made by the ChatBot.
- Transformation Pipeline - The pipeline that processes the raw chat logs and generates the data model for the evaluation.
- Evaluation Data Store - The storage for the processed chat logs, containing the data model for the evaluation. The data will be stored as Parquet files.
- Evaluation Pipeline - The pipeline that processes the data model and generates the evaluation metrics by calling the LLM. In this step, Prompt Flow is used at scale to generate the evaluation metrics.
- Evaluation Metrics Store - The storage for the evaluation metrics generated by the Evaluation Pipeline. The data store will be a structured store, such as a SQL database.
- Visualization - The visualization tool that will be used to visualize the evaluation metrics. This could be a BI tool, such as PowerBI.
Refer to the QuickStart guide to get started with the solution.
- Azure Subscription
- Development Environment
- Azure CLI
- Python 3.11
- Conda
- Visual Studio Code
- Power BI Desktop (for visualization)
The infrastructure for the solution can be deployed using the provided Infrastructure as Code (IaC) templates. For more details refer to the Infrastructure Deployment guide.
The framework can be deployed using the provided deployment scripts. For more details refer to the Framework Deployment guide.
A quick demo of the solution can be found in the Demo guide.
The framework is designed to be modular and extensible. The framework consists of the following components:
- Data Model - The data model for the evaluation, containing the chat logs and the evaluation metrics.
- Azure SQL - The Azure SQL scripts for creating the database and tables for storing the evaluation metrics as per the data model.
- Sample Chatbot - A sample chatbot application that generates the chat logs for demonstration purposes.
- Framework Source - The source code for the framework, containing python modules for the data transformation pipeline and evaluation pipeline.
- Framework Deployment - The deployment and execution scripts for the framework, containing the Azure ML pipelines for the data transformation and evaluation pipelines.
- Dashboards - The PowerBI dashboard for visualizing the evaluation metrics.
NOTE: The more detailed information about the framework can be found in the Developer Guide.
The framework can be extended in several ways to support different use cases. For more details refer to the Extending the Framework guide.
Go to PreProd Evaluation for more details.
