batch-doc-pipeline

The purpose of this repo is to setup a batch pipeline to process PDFs into text files leveraging Azure ML's native pipeline capabilities and Azure Form Recognizer (soon to be Azure AI Document Intelligence). This is a custom version of this repo, though this repo does not split PDFs by pages.

Other considerations

With PDF file names, ensure special characters like + don't cause issues while processing. This is not specifically handled in the above operations.
Given the size of the PDF files being processed, this can sometimes lead to out of memory issues. Either change the compute configuration or have a way of filtering out larger items to process independently.
As of the current update (May 2024), azure-ai-form-recognizer was version 3.1 and GA. Over time, however this will give way to azure-ai-documentintelligence which is currently version 4.0 and in preview. This repo uses the former.
In terms of RBAC, both the Azure ML workspace and the service principal have Contributor access to the storage account. Additionally, the workspace has Storage Blob Data Contributor access to the storage account.
Note about for Form Recognizer, you can auto-scale to avoid throttling issues.
Critical to understand which SDK version maps to which API as listed here.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
blob-operations		blob-operations
ml-pipeline		ml-pipeline
setup		setup
.amlignore		.amlignore
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

batch-doc-pipeline

Other considerations

About

Uh oh!

Releases

Packages

Uh oh!

Languages

ts-azure-services/batch-doc-pipeline

Folders and files

Latest commit

History

Repository files navigation

batch-doc-pipeline

Other considerations

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages