Skip to content

jamesmcroft/ai-document-data-extraction-evaluation

Repository files navigation

Evaluating the quality of AI document data extraction with small and large language models

This repository contains the code and data used in the analysis write-up on "Evaluating the quality of AI document data extraction with small and large language models published on the Microsoft Tech Community Blog by the ISV & Digital Native Center of Excellence team.

The repository provides a .NET NUnit test project demonstrates the following techniques for data extraction using small and large language models:

  • Markdown Extraction with Azure AI Document Intelligence. This technique involves converting the document into Markdown using the pre-built layout model in Azure AI Document Intelligence. Read more about this technique in our detailed article.
  • Vision Capabilities of Multi-Modal Language Models. This technique focuses on GPT-4 Turbo and Omni models by converting the document pages to images. This leverages the models' capabilities to analyze both text and visual elements. Explore this technique in more detail in our sample project.

For each technique, the model is prompted using a one-shot technique, providing the expected output schema for the response. This establishes the intention, improving the overall accuracy of the generated output.

Pre-requisites - Understanding

Before exploring this repository in detail, please ensure that you have a level of understanding of the following:

.NET Testing

Azure AI

Pre-requisites - Setup

Before running the tests, you will need to have the following:

Understanding the evaluation tests

The purpose of this repository is to provide a demonstration to how you can effectively evaluate different language models and techniques for document data extraction. The test cases for each scenario create a consistent evaluation framework for each technique and model combination, providing results that can be clearly compared.

Evaluating the Markdown technique with Azure AI Document Intelligence

Markdown Extraction with Azure AI Document Intelligence

Test Scenario: The test scenario evaluates the quality of data extraction using the Markdown technique with Azure AI Document Intelligence and prompting language models for JSON structures.

For this technique, the test runs as follows:

  1. The scenario PDF document is loaded into sent to the Azure AI Document Intelligence service for processing using the prebuilt-layout model.

    See AzureAIDocumentIntelligenceMarkdownConverter.cs for more details on calling this API endpoint using the Azure AI Document Intelligence .NET SDK.

  2. The extracted Markdown content is then passed back, being constructed into a request to a language model combined with a system and scenario extract prompts used to perform the data extraction.

    The details for the specific system and extract prompts are defined in the test cases for consistency across each model and technique. As part of the scenario extract prompt, a one-shot example is provided (see InvoiceData.cs). For detail on the test case constructions, see InvoiceDataExtractionTests.cs for more details.

  3. With this construct, the request is made to the language model.

    See AzureMLServerlessMarkdownDocumentDataExtractor.cs and AzureOpenAIMarkdownDocumentDataExtractor.cs for more details on how the request is made.

  4. The JSON response is then deserialized into the same .NET object that was used to construct the request.

    As an example, see InvoiceData.cs for the object structure.

  5. Based on the expected output, determined by the test case, the actual output is compared to the expected output. The test then stores the quality results, including the accuracy, execution time, and tokens used.

    See InvoiceDataExtractionTests.cs for more details on the test case construction and evaluation.

Evaluating the Vision technique with Multi-Modal Language Models

Vision Capabilities of Multi-Modal Language Models

Test Scenario: The test scenario evaluates the quality of data extraction using the Vision technique with Multi-Modal Language Models with a single prompt request for JSON structures.

For this technique, the test runs as follows:

  1. The scenario PDF document is loaded, and each page is converted to an image. The page images are then constructed into a request to a vision capable language model combined with a system and scenario extract prompts used to perform the data extraction.

    The details for the specific system and extract prompts are defined in the test cases for consistency across each model and technique. As part of the scenario extract prompt, a one-shot example is provided (see VehicleInsuranceContractData.cs). For detail on the test case constructions, see VehicleInsuranceContractDataExtractionTests.cs for more details.

Important

The GPT-4 with Vision models only support 10 images, so the test performs pre-processing on documents over this limit. This is achieved by stitching pages together to reduce the number of total images to the maximum supported. See AzureOpenAIVisionDocumentDataExtractor.cs for more details on how the images are processed.

  1. With this construct, the request is made to the language model.

    See AzureOpenAIVisionDocumentDataExtractor.cs for more details on how the request is made.

  2. The JSON response is then deserialized into the same .NET object that was used to construct the request.

    As an example, see VehicleInsuranceContractData.cs for the object structure.

  3. Based on the expected output, determined by the test case, the actual output is compared to the expected output. The test then stores the quality results, including the accuracy, execution time, and tokens used.

    See VehicleInsuranceContractDataExtractionTests.cs for more details on the test case construction and evaluation.

Project structure

The project structure is as follows:

  • EvaluationTests: The main test project containing the scenarios, test cases, and documents for processing.
    • Structured (Invoices): Contains the test cases for extracting data from structured documents, in this case, invoices. The test cases evaluate two sub-scenarios: one for a simple invoice that contains typed data and signatures, and another for a more complex invoice that contains a grid system to separate the data, handwritten text, and content overlapping.
    • Unstructured (Vehicle Insurance Contracts): Contains the test cases for extracting data from unstructured documents, in this case, vehicle insurance contracts. The tests aim to extract data that can only be inferred from the content and structure of the document. For example, the VehicleInsuranceContractData.cs object expects to extract a Last Date To Cancel which is not explicitly stated in the document. The expectation is that the language model will infer natural language, such as "You have the right to cancel this policy within the first 14 days" and extract the date based on the effective start date, which is known.
  • Shared: Contains reusable .NET services that can be used in projects, including the document data extractor services for Azure OpenAI and Azure AI Document Intelligence.
  • Infra: Contains the Azure Bicep template for deploying the necessary Azure services for the tests.

Running the tests

To run the tests, you must setup the necessary Azure services and configure the test project with the required environment variables.

To setup an environment in Azure, simply run the Setup-Environment.ps1 script from the root of the project:

.\Setup-Environment.ps1 -DeploymentName <DeploymentName> -Location <Location (e.g., eastus or swedencentral)> -SkipInfrastructure $false

This script will deploy the necessary Azure services using the Azure Bicep template in the infra folder.

Once deployed, the script will also update the appsettings.Test.json file with the necessary connection strings and keys for the Azure services.

You can then run the tests using the following command:

dotnet test

Understanding the test results

For each of the scenarios run, files will be output in the bin/Debug/net8.0/Output folder for the project.

These files contain:

  • Accuracy: The results of the comparison between the expected and actual output. Most property values will be 0 or 1, with 1 indicating a match. Objects and arrays will have a rollup accuracy based on the number of properties that matched. The overall accuracy is calculated as the sum of the property accuracies divided by the total number of properties.
  • Execution Time: The time taken to execute the data extraction. This starts when the first request is made to either the Azure AI Document Intelligence or Azure OpenAI service and ends when the extracted data response is received.
  • Result: The extracted data response from the language model, including the total tokens used in both the prompt and response.