Skip to content

[Question - Document Intelligence] Streaming large files #37662

Open
@ai-learner-00

Description

@ai-learner-00
  • Package Name: azure-document-intelligence
  • Package Version: 1.0.0b4
  • Operating System: Windows
  • Python Version: 3.10

Describe the bug
I want to confirm the proper way to stream large files. Does using AnalyzeDocumentRequest create a JSON payload? (which is less efficient?)

    async def get_analyze_result(self, document_data: bytes) -> AnalyzeResult:
        """
        Get markdown of a document
        """
    
        document_intelligence_client = DocumentIntelligenceClient(
            endpoint=self.document_intelligence_endpoint,
            credential=AzureKeyCredential(key=self.document_intelligence_key),
        )

        async with document_intelligence_client:
            poller = await document_intelligence_client.begin_analyze_document(
                analyze_request=AnalyzeDocumentRequest(
                    bytes_source=document_data),
                model_id="prebuilt-layout",
                output_content_format=ContentFormat.MARKDOWN,
            )

            analyze_result = await poller.result()
            return analyze_result

Samples

Does the following code stream the file without blocking the thread? (I don't think a BufferedReader has async methods) What is the chunk size?

with open(path_to_sample_documents, "rb") as f:
        poller = await document_intelligence_client.begin_analyze_document(
            model_id=model_id, analyze_request=f, content_type="application/octet-stream"
        )
    result: AnalyzeResult = await poller.result()

Expected behavior
I was expecting an AsyncBufferedReader to not block the current thread or avoid having to create other threads.

import aiofiles

async with aiofiles.open('t.pdf', mode='rb') as f: # AsyncBufferedReader
    content = await f.read()

I intend to use it with fastapi UploadFile which has a await file.read(size) method. Maybe creating a protocol will be needed so that it works with both AsyncBufferedReader and UploadFile.

Metadata

Metadata

Assignees

Labels

ClientThis issue points to a problem in the data-plane of the library.Document IntelligenceService AttentionWorkflow: This issue is responsible by Azure service team.customer-reportedIssues that are reported by GitHub users external to the Azure organization.needs-team-attentionWorkflow: This issue needs attention from Azure service team or SDK teamquestionThe issue doesn't require a change to the product in order to be resolved. Most issues start as that

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions