The purpose of this repo is to setup a batch pipeline to process PDFs into text files leveraging Azure ML's native pipeline capabilities and Azure Form Recognizer (soon to be Azure AI Document Intelligence). This is a custom version of this repo, though this repo does not split PDFs by pages.
- With PDF file names, ensure special characters like
+
don't cause issues while processing. This is not specifically handled in the above operations. - Given the size of the PDF files being processed, this can sometimes lead to out of memory issues. Either change the compute configuration or have a way of filtering out larger items to process independently.
- As of the current update (May 2024), azure-ai-form-recognizer was version 3.1 and GA. Over time, however this will give way to azure-ai-documentintelligence which is currently version 4.0 and in preview. This repo uses the former.
- In terms of RBAC, both the Azure ML workspace and the service principal have
Contributor
access to the storage account. Additionally, the workspace hasStorage Blob Data Contributor
access to the storage account. - Note about for Form Recognizer, you can auto-scale to avoid throttling issues.
- Critical to understand which SDK version maps to which API as listed here.