Skip to content

Commit 7621d1f

Browse files
authored
Improving step by step instructions for structured_parser (#985)
2 parents b7f880b + c50ddbf commit 7621d1f

File tree

8 files changed

+134
-104
lines changed

8 files changed

+134
-104
lines changed

.github/scripts/spellcheck_conf/wordlist.txt

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1547,3 +1547,12 @@ DeepEval
15471547
SDV
15481548
sklearn
15491549
GCP
1550+
compat
1551+
ArtifactExtractor
1552+
DatabaseManager
1553+
DocumentLens
1554+
PDFs
1555+
RequestBuilder
1556+
VectorIndexManager
1557+
csvs
1558+
programmatically

.github/workflows/pytest_cpu_gha_runner.yaml

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -72,5 +72,4 @@ jobs:
7272
with:
7373
paths: |
7474
**/*.xml
75-
!**/AndroidManifest.xml
7675
if: always()

end-to-end-use-cases/structured_parser/README.md

Lines changed: 19 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -26,27 +26,40 @@ The tool is designed to handle complex documents with high accuracy and provides
2626
1. Clone the repository
2727
2. Install dependencies:
2828

29+
```bash
30+
git clone https://github.com/meta-llama/llama-cookbook.git
31+
```
32+
```bash
33+
cd llama-cookbook
34+
```
35+
```bash
36+
pip install -r requirements.txt
37+
```
38+
2. Install project specific dependencies:
39+
```bash
40+
cd end-to-end-use-cases/structured_parser
41+
```
2942
```bash
3043
pip install -r requirements.txt
3144
```
32-
33-
3. Configure the tool (see Configuration section)
34-
3545
## Quick Start
3646

37-
Extract text from a PDF:
47+
### Configure the tool (see [Configuration](#Configuration) section)
48+
(Note: Setup API Key, Model for inferencing, etc.)
49+
50+
### Extract text from a PDF:
3851

3952
```bash
4053
python src/structured_extraction.py path/to/document.pdf --text
4154
```
4255

43-
Extract text and tables, and save tables as CSV files:
56+
### Extract text and tables, and save tables as CSV files:
4457

4558
```bash
4659
python src/structured_extraction.py path/to/document.pdf --text --tables --save_tables_as_csv
4760
```
4861

49-
Process a directory of PDFs and export tables to Excel:
62+
### Process a directory of PDFs and export tables to Excel:
5063

5164
```bash
5265
python src/structured_extraction.py path/to/pdf_directory --text --tables --export_excel

end-to-end-use-cases/structured_parser/requirements.txt

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,6 @@ vllm>=0.2.0
1414
openai>=1.0.0
1515

1616
# Database and vector search
17-
sqlite3>=3.35.0
1817
chromadb>=0.4.0
1918
sqlalchemy>=2.0.0
2019

end-to-end-use-cases/structured_parser/src/config.yaml

Lines changed: 75 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,7 @@ model:
2323
extraction_inference:
2424
temperature: 0.2
2525
top_p: 0.9
26-
max_completion_tokens: 17000
26+
max_completion_tokens: 32000
2727
seed: 42
2828

2929
# Artifact configuration
@@ -64,7 +64,7 @@ artifacts:
6464

6565
images:
6666
prompts:
67-
system: "You are an OCR expert.\n\n1.Your task is to extract images, pictures and diagrams data from the following document. Do not extract tables or charts. \n2. For each extracted image, you must write\n a) a caption as given in the document,\n b) a detailed description of the image; utilize the surrounding text for this. Your descriptions should be very informative so that a human can understand what is in the image without ever seeing the document.Think step-by-step and write a JSON that corresponds to the schema and the information in the document\n\nIf there is nothing to extract, simply return an empty JSON {\"images\": []}. Ensure your final answer is appropriately formatted as a JSON object and wrapped in a ```json\n\n``` block."
67+
system: "You are an OCR expert. (Note: Do not extract tables)\n\n1.Your task is to extract images, pictures, charts and diagrams only from the following document.\n 2. For each extracted image, you must write\n a) a caption as given in the document\n b) a detailed description of the image; utilize the surrounding text for this. Your descriptions should be very informative so that a human can understand what is in the image without ever seeing the document. Think step-by-step and write a JSON that corresponds to the schema and the information in the document\n\nIf there is nothing to extract, simply return an empty JSON {\"images\": []}. \nIf the image is a table, simply return an empty JSON {\"images\": []}. \n\nIf the image is a chart or a graph then you must convert them to JSON outputs.\n\n# Instructions to convert charts & graphs to JSON\nYour task is to: Analyze and describe the chart or graph. Summarize the type of chart/graph (e.g., bar chart, line graph, pie chart). Identify the axes, labels, categories, and any notable trends or patterns. Provide a brief textual description of what the chart/graph represents. Extract and structure the data:\n1. Capture all relevant values and data points from the chart/graph.\n2. Organize the extracted data into a clear and logical JSON structure.\n\n# Output format:\n\nYour response should be captured in the chart_data attribute of the JSON schema. Ensure your final answer is appropriately formatted as a JSON object and wrapped in a ```json\n\n``` block."
6868
user: "TARGET SCHEMA:\n```json\n{schema}\n```"
6969
output_schema: {
7070
"type": "object",
@@ -93,6 +93,79 @@ artifacts:
9393
"image_type": {
9494
"type": "string",
9595
"description": "Type of image (e.g., 'photograph', 'chart', 'diagram', 'illustration')"
96+
},
97+
"chart_data": {
98+
"type": "object",
99+
"properties": {
100+
"type": {
101+
"type": "string",
102+
"enum": ["bar", "line", "pie", "scatter", "area"]
103+
},
104+
"title": {
105+
"type": "string"
106+
},
107+
"subtitle": {
108+
"type": "string"
109+
},
110+
"xAxis": {
111+
"type": "object",
112+
"properties": {
113+
"title": { "type": "string" },
114+
"labels": {
115+
"type": "array",
116+
"items": { "type": "string" }
117+
}
118+
},
119+
"required": ["title", "labels"]
120+
},
121+
"yAxis": {
122+
"type": "object",
123+
"properties": {
124+
"title": { "type": "string" },
125+
"labels": {
126+
"type": "array",
127+
"items": { "type": "string" }
128+
}
129+
},
130+
"required": ["title", "labels"]
131+
},
132+
"data": {
133+
"type": "array",
134+
"items": {
135+
"oneOf": [
136+
{
137+
"type": "object",
138+
"properties": {
139+
"label": { "type": "string" },
140+
"values": {
141+
"type": "array",
142+
"items": { "type": "number" }
143+
}
144+
},
145+
"required": ["label", "values"]
146+
},
147+
{
148+
"type": "object",
149+
"properties": {
150+
"x": { "type": "number" },
151+
"y": { "type": "number" }
152+
},
153+
"required": ["x", "y"]
154+
}
155+
]
156+
}
157+
},
158+
"options": {
159+
"type": "object",
160+
"properties": {
161+
"legend": { "type": "boolean" },
162+
"rtl": { "type": "boolean" },
163+
"responsive": { "type": "boolean" },
164+
"animation": { "type": "boolean" }
165+
}
166+
}
167+
},
168+
"required": ["type", "title", "xAxis", "yAxis", "data"]
96169
}
97170
},
98171
"required": [

end-to-end-use-cases/structured_parser/src/json_to_sql.py

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -100,7 +100,8 @@ def create_artifact_table(sql_db_path: str) -> None:
100100
cursor.execute("DROP TABLE IF EXISTS document_artifacts")
101101

102102
# Create table with schema
103-
cursor.execute("""
103+
cursor.execute(
104+
"""
104105
CREATE TABLE IF NOT EXISTS document_artifacts (
105106
id INTEGER PRIMARY KEY AUTOINCREMENT,
106107
doc_path TEXT,
@@ -124,7 +125,8 @@ def create_artifact_table(sql_db_path: str) -> None:
124125
image_caption TEXT,
125126
image_type TEXT
126127
)
127-
""")
128+
"""
129+
)
128130

129131
# Create indexes for common queries
130132
cursor.execute(

end-to-end-use-cases/structured_parser/src/structured_extraction.py

Lines changed: 11 additions & 82 deletions
Original file line numberDiff line numberDiff line change
@@ -196,6 +196,7 @@ def _run_inference(
196196
artifact_types = [r[0] for r in requests]
197197
inference_requests = [r[1] for r in requests]
198198

199+
response_batch = []
199200
if backend == "offline-vllm":
200201
request_batch = InferenceUtils.make_vllm_batch(inference_requests)
201202
response_batch = InferenceUtils.run_vllm_inference(request_batch)
@@ -304,79 +305,6 @@ def from_pdf(pdf_path: str, artifact_types: List[str]) -> List[ExtractedPage]:
304305

305306
return pdf_pages
306307

307-
# @staticmethod
308-
# async def _run_inference_async(
309-
# requests: List[Tuple[str, InferenceRequest]],
310-
# ) -> List[Tuple[str, str]]:
311-
# """
312-
# Run inference asynchronously for all requests.
313-
314-
# Args:
315-
# requests: List of tuples containing (artifact_type, inference_request)
316-
317-
# Returns:
318-
# List of tuples containing (artifact_type, response)
319-
320-
# Raises:
321-
# ValueError: If the backend is not supported
322-
# """
323-
# backend = config["model"].get("backend")
324-
# if backend not in SUPPORTED_BACKENDS:
325-
# raise ValueError(
326-
# f"Allowed config.model.backend: {SUPPORTED_BACKENDS}, got unknown value: {backend}"
327-
# )
328-
329-
# artifact_types = [r[0] for r in requests]
330-
# inference_requests = [r[1] for r in requests]
331-
332-
# if backend == "offline-vllm":
333-
# request_batch = InferenceUtils.make_vllm_batch(inference_requests)
334-
# response_batch = InferenceUtils.run_vllm_inference(request_batch)
335-
# elif backend == "openai-compat":
336-
# tasks = [
337-
# InferenceUtils.async_run_openai_inference(request)
338-
# for request in inference_requests
339-
# ]
340-
# response_batch = await asyncio.gather(*tasks)
341-
342-
# return list(zip(artifact_types, response_batch))
343-
344-
# @staticmethod
345-
# async def from_image_async(
346-
# img_path: str,
347-
# artifact_types: Union[List[str], str],
348-
# ) -> ArtifactCollection:
349-
# """
350-
# Extract artifacts from an image asynchronously.
351-
352-
# Args:
353-
# img_path: Path to the image file
354-
# artifact_types: Type(s) of artifacts to extract
355-
356-
# Returns:
357-
# ArtifactCollection: Extracted artifacts
358-
359-
# Raises:
360-
# ValueError: If the backend is not supported
361-
# FileNotFoundError: If the image file doesn't exist
362-
# """
363-
# if not os.path.exists(img_path):
364-
# raise FileNotFoundError(f"Image file not found: {img_path}")
365-
366-
# if isinstance(artifact_types, str):
367-
# artifact_types = [artifact_types]
368-
369-
# # Prepare inference requests
370-
# requests = ArtifactExtractor._prepare_inference_requests(
371-
# img_path, artifact_types
372-
# )
373-
374-
# # Run inference asynchronously
375-
# responses = await ArtifactExtractor._run_inference_async(requests)
376-
377-
# # Process responses
378-
# return ArtifactExtractor._process_responses(responses)
379-
380308

381309
def get_artifact_types(text: bool, tables: bool, images: bool) -> List[str]:
382310
"""
@@ -422,16 +350,16 @@ def get_target_files(target_path: str) -> List[Path]:
422350
if not os.path.exists(target_path):
423351
raise FileNotFoundError(f"Target path not found: {target_path}")
424352

425-
target_path = Path(target_path)
426-
if target_path.is_file() and target_path.suffix not in SUPPORTED_FILE_TYPES:
353+
target_path_obj = Path(target_path)
354+
if target_path_obj.is_file() and target_path_obj.suffix not in SUPPORTED_FILE_TYPES:
427355
raise ValueError(
428-
f"Unsupported file type: {target_path.suffix}. Supported types: {SUPPORTED_FILE_TYPES}"
356+
f"Unsupported file type: {target_path_obj.suffix}. Supported types: {SUPPORTED_FILE_TYPES}"
429357
)
430358

431359
targets = (
432-
[target_path]
433-
if target_path.is_file()
434-
else [f for f in target_path.iterdir() if f.suffix in SUPPORTED_FILE_TYPES]
360+
[target_path_obj]
361+
if target_path_obj.is_file()
362+
else [f for f in target_path_obj.iterdir() if f.suffix in SUPPORTED_FILE_TYPES]
435363
)
436364
logger.debug(f"Processing {len(targets)} files")
437365
if not targets:
@@ -456,7 +384,7 @@ def process_files(
456384
out_json = []
457385
for target in targets:
458386
try:
459-
artifacts = ArtifactExtractor.from_pdf(target, artifact_types)
387+
artifacts = ArtifactExtractor.from_pdf(str(target), artifact_types)
460388
out_json.extend(artifacts)
461389
except Exception as e:
462390
logger.error(f"Failed to process {target}: {e}")
@@ -485,6 +413,7 @@ def save_results(
485413
output_dir.mkdir(parents=True, exist_ok=True)
486414

487415
# Save to JSON file
416+
output_path = None
488417
try:
489418
output_path = output_dir / f"artifacts_{timestamp}.json"
490419
json_content = json.dumps(data, indent=2)
@@ -562,8 +491,8 @@ def main(
562491
results = process_files(targets, artifact_types)
563492

564493
# Save results
565-
target_path = Path(target_path)
566-
output_dir = target_path.parent / "extracted"
494+
target_path_obj = Path(target_path)
495+
output_dir = target_path_obj.parent / "extracted"
567496
save_results(
568497
output_dir,
569498
results,

src/tests/datasets/test_samsum_datasets.py

Lines changed: 16 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,32 +1,36 @@
11
# Copyright (c) Meta Platforms, Inc. and affiliates.
22
# This software may be used and distributed according to the terms of the Llama 2 Community License Agreement.
33

4-
import pytest
54
from dataclasses import dataclass
65
from functools import partial
76
from unittest.mock import patch
7+
8+
import pytest
89
from datasets import load_dataset
910

11+
1012
@dataclass
1113
class Config:
1214
model_type: str = "llama"
1315

16+
1417
try:
15-
load_dataset("Samsung/samsum")
18+
load_dataset("knkarthick/samsum")
1619
SAMSUM_UNAVAILABLE = False
1720
except ValueError:
1821
SAMSUM_UNAVAILABLE = True
1922

23+
2024
@pytest.mark.skipif(SAMSUM_UNAVAILABLE, reason="Samsum dataset is unavailable")
2125
@pytest.mark.skip_missing_tokenizer
22-
@patch('llama_cookbook.finetuning.train')
23-
@patch('llama_cookbook.finetuning.AutoTokenizer')
26+
@patch("llama_cookbook.finetuning.train")
27+
@patch("llama_cookbook.finetuning.AutoTokenizer")
2428
@patch("llama_cookbook.finetuning.AutoConfig.from_pretrained")
2529
@patch("llama_cookbook.finetuning.AutoProcessor")
2630
@patch("llama_cookbook.finetuning.MllamaForConditionalGeneration.from_pretrained")
27-
@patch('llama_cookbook.finetuning.LlamaForCausalLM.from_pretrained')
28-
@patch('llama_cookbook.finetuning.optim.AdamW')
29-
@patch('llama_cookbook.finetuning.StepLR')
31+
@patch("llama_cookbook.finetuning.LlamaForCausalLM.from_pretrained")
32+
@patch("llama_cookbook.finetuning.optim.AdamW")
33+
@patch("llama_cookbook.finetuning.StepLR")
3034
def test_samsum_dataset(
3135
step_lr,
3236
optimizer,
@@ -39,11 +43,13 @@ def test_samsum_dataset(
3943
mocker,
4044
setup_tokenizer,
4145
llama_version,
42-
):
46+
):
4347
from llama_cookbook.finetuning import main
4448

4549
setup_tokenizer(tokenizer)
46-
get_model.return_value.get_input_embeddings.return_value.weight.shape = [32000 if "Llama-2" in llama_version else 128256]
50+
get_model.return_value.get_input_embeddings.return_value.weight.shape = [
51+
32000 if "Llama-2" in llama_version else 128256
52+
]
4753
get_mmodel.return_value.get_input_embeddings.return_value.weight.shape = [0]
4854
get_config.return_value = Config()
4955

@@ -55,7 +61,7 @@ def test_samsum_dataset(
5561
"use_peft": False,
5662
"dataset": "samsum_dataset",
5763
"batching_strategy": "padding",
58-
}
64+
}
5965

6066
main(**kwargs)
6167

0 commit comments

Comments
 (0)