Skip to content

Commit

Permalink
Add a receipt processing cookbook (#1249)
Browse files Browse the repository at this point in the history
I added a receipt processing cookbook. 

- Uses Qwen or Pixtral
- General purpose message templating, no messy model-specific token
adding
- Easy function for compressing images down for lower processing/memory
requirements

Should help illustrate a simple use case for vision models.
  • Loading branch information
cpfiffer authored Nov 8, 2024
1 parent b93f550 commit b5d26c4
Show file tree
Hide file tree
Showing 4 changed files with 298 additions and 0 deletions.
Binary file added docs/cookbook/images/trader-joes-receipt.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions docs/cookbook/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,4 +14,5 @@ This part of the documentation provides a few cookbooks that you can browse to g
- [ReAct Agent](react_agent.md): Build an agent with open weights models using regex-structured generation.
- [Earnings reports to CSV](earnings-reports.md): Extract data from earnings reports to CSV using regex-structured generation.
- [Vision-Language Models](atomic_caption.md): Use Outlines with vision-language models for tasks like image captioning and visual reasoning.
- [Receipt Digitization](receipt-digitization.md): Extract information from a picture of a receipt using structured generation.
- [Structured Generation from PDFs](read-pdfs.md): Use Outlines with vision-language models to read PDFs and produce structured output.
296 changes: 296 additions & 0 deletions docs/cookbook/receipt-digitization.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,296 @@
# Receipt Data Extraction with VLMs

## Setup

You'll need to install the dependencies:

```bash
pip install outlines torch==2.4.0 transformers accelerate pillow rich
```

## Import libraries

Load all the necessary libraries:

```python
# LLM stuff
import outlines
import torch
from transformers import AutoProcessor
from pydantic import BaseModel, Field
from typing import Literal, Optional, List

# Image stuff
from PIL import Image
import requests

# Rich for pretty printing
from rich import print
```

## Choose a model

This example has been tested with `mistral-community/pixtral-12b` ([HF link](https://huggingface.co/mistral-community/pixtral-12b)) and `Qwen/Qwen2-VL-7B-Instruct` ([HF link](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct)).

We recommend Qwen-2-VL as we have found it to be more accurate than Pixtral.

If you want to use Qwen-2-VL, you can do the following:

```python
# To use Qwen-2-VL:
from transformers import Qwen2VLForConditionalGeneration
model_name = "Qwen/Qwen2-VL-7B-Instruct"
model_class = Qwen2VLForConditionalGeneration
```

If you want to use Pixtral, you can do the following:

```python
# To use Pixtral:
from transformers import LlavaForConditionalGeneration
model_name="mistral-community/pixtral-12b"
model_class=LlavaForConditionalGeneration
```

## Load the model

Load the model into memory:

```python
model = outlines.models.transformers_vision(
model_name,
model_class=model_class,
model_kwargs={
"device_map": "auto",
"torch_dtype": torch.bfloat16,
},
processor_kwargs={
"device": "cuda", # set to "cpu" if you don't have a GPU
},
)
```

## Image processing

Images can be quite large. In GPU-poor environments, you may need to resize the image to a smaller size.

Here's a helper function to do that:

```python
def load_and_resize_image(image_path, max_size=1024):
"""
Load and resize an image while maintaining aspect ratio
Args:
image_path: Path to the image file
max_size: Maximum dimension (width or height) of the output image
Returns:
PIL Image: Resized image
"""
image = Image.open(image_path)

# Get current dimensions
width, height = image.size

# Calculate scaling factor
scale = min(max_size / width, max_size / height)

# Only resize if image is larger than max_size
if scale < 1:
new_width = int(width * scale)
new_height = int(height * scale)
image = image.resize((new_width, new_height), Image.Resampling.LANCZOS)

return image
```

You can change the resolution of the image by changing the `max_size` argument. Small max sizes will make the image more blurry, but processing will be faster and require less memory.

## Load an image

Load an image and resize it. We've provided a sample image of a Trader Joe's receipt, but you can use any image you'd like.

Here's what the image looks like:

![Trader Joe's receipt](./images/trader-joes-receipt.jpg)

```python
# Path to the image
image_path = "https://dottxt-ai.github.io/outlines/main/cookbook/images/trader-joes-receipt.png"

# Download the image
response = requests.get(image_path)
with open("receipt.png", "wb") as f:
f.write(response.content)

# Load + resize the image
image = load_and_resize_image("receipt.png")
```

## Define the output structure

We'll define a Pydantic model to describe the data we want to extract from the image.

In our case, we want to extract the following information:

- The store name
- The store address
- The store number
- A list of items, including the name, quantity, price per unit, and total price
- The tax
- The total
- The date
- The payment method

Most fields are optional, as not all receipts contain all information.

```python
class Item(BaseModel):
name: str
quantity: Optional[int]
price_per_unit: Optional[float]
total_price: Optional[float]

class ReceiptSummary(BaseModel):
store_name: str
store_address: str
store_number: Optional[int]
items: List[Item]
tax: Optional[float]
total: Optional[float]
# Date is in the format YYYY-MM-DD. We can apply a regex pattern to ensure it's formatted correctly.
date: Optional[str] = Field(pattern=r'\d{4}-\d{2}-\d{2}', description="Date in the format YYYY-MM-DD")
payment_method: Literal["cash", "credit", "debit", "check", "other"]
```

## Prepare the prompt

We'll use the `AutoProcessor` to convert the image and the text prompt into a format that the model can understand. Practically,
this is the code that adds user, system, assistant, and image tokens to the prompt.

```python
# Set up the content you want to send to the model
messages = [
{
"role": "user",
"content": [
{
# The image is provided as a PIL Image object
"type": "image",
"image": image,
},
{
"type": "text",
"text": f"""You are an expert at extracting information from receipts.
Please extract the information from the receipt. Be as detailed as possible --
missing or misreporting information is a crime.
Return the information in the following JSON schema:
{ReceiptSummary.model_json_schema()}
"""},
],
}
]

# Convert the messages to the final prompt
processor = AutoProcessor.from_pretrained(model_name)
prompt = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
```

If you are curious, the final prompt that is sent to the model looks (roughly) like this:

```
<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
<|vision_start|><|image_pad|><|vision_end|>
You are an expert at extracting information from receipts.
Please extract the information from the receipt. Be as detailed as
possible -- missing or misreporting information is a crime.
Return the information in the following JSON schema:
<JSON SCHEMA OMITTED>
<|im_end|>
<|im_start|>assistant
```

## Run the model

```python
# Prepare a function to process receipts
receipt_summary_generator = outlines.generate.json(
model,
ReceiptSummary,

# Greedy sampling is a good idea for numeric
# data extraction -- no randomness.
sampler=outlines.samplers.greedy()
)

# Generate the receipt summary
result = receipt_summary_generator(prompt, [image])
print(result)
```

## Output

The output should look like this:

```
ReceiptSummary(
store_name="Trader Joe's",
store_address='401 Bay Street, San Francisco, CA 94133',
store_number=0,
items=[
Item(name='BANANA EACH', quantity=7, price_per_unit=0.23, total_price=1.61),
Item(name='BAREBELLS CHOCOLATE DOUG', quantity=1, price_per_unit=2.29, total_price=2.29),
Item(name='BAREBELLS CREAMY CRISP', quantity=1, price_per_unit=2.29, total_price=2.29),
Item(name='BAREBELLS CHOCOLATE DOUG', quantity=1, price_per_unit=2.29, total_price=2.29),
Item(name='BAREBELLS CARAMEL CASHEW', quantity=2, price_per_unit=2.29, total_price=4.58),
Item(name='BAREBELLS CREAMY CRISP', quantity=1, price_per_unit=2.29, total_price=2.29),
Item(name='SPINDRIFT ORANGE MANGO 8', quantity=1, price_per_unit=7.49, total_price=7.49),
Item(name='Bottle Deposit', quantity=8, price_per_unit=0.05, total_price=0.4),
Item(name='MILK ORGANIC GALLON WHOL', quantity=1, price_per_unit=6.79, total_price=6.79),
Item(name='CLASSIC GREEK SALAD', quantity=1, price_per_unit=3.49, total_price=3.49),
Item(name='COBB SALAD', quantity=1, price_per_unit=5.99, total_price=5.99),
Item(name='PEPPER BELL RED XL EACH', quantity=1, price_per_unit=1.29, total_price=1.29),
Item(name='BAG FEE.', quantity=1, price_per_unit=0.25, total_price=0.25),
Item(name='BAG FEE.', quantity=1, price_per_unit=0.25, total_price=0.25)
],
tax=0.68,
total=41.98,
date='2023-11-04',
payment_method='debit',
)
```

Voila! You've successfully extracted information from a receipt using an LLM.

## Bonus: roasting the user for their receipt

You can roast the user for their receipt by adding a `roast` field to the end of the `ReceiptSummary` model.

```python
class ReceiptSummary(BaseModel):
...
roast: str
```

which gives you a result like

```
ReceiptSummary(
...
roast="You must be a fan of Trader Joe's because you bought enough
items to fill a small grocery bag and still had to pay for a bag fee.
Maybe you should start using reusable bags to save some money and the
environment."
)
```

Qwen is not particularly funny, but worth a shot.
1 change: 1 addition & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -122,6 +122,7 @@ nav:
- Vision-Language Models: cookbook/atomic_caption.md
- Structured Generation from PDFs: cookbook/read-pdfs.md
- Earnings reports to CSV: cookbook/earnings-reports.md
- Digitizing receipts with vision models: cookbook/receipt-digitization.md
- Run on the cloud:
- BentoML: cookbook/deploy-using-bentoml.md
- Cerebrium: cookbook/deploy-using-cerebrium.md
Expand Down

0 comments on commit b5d26c4

Please sign in to comment.