fix: Address some issues in the split pdf logic #165

awalker4 · 2024-08-31T01:24:44Z

We've encountered some bugs in the split pdf code. For one, these requests are not retried. With the new split_pdf_allow_failed=False behavior, this means one transient network error can interrupt the whole doc. We've also had some asyncio warnings such as ... was never awaited.

This PR adds retries, cleans up the logic, and gives us a much better base for the V2 client release.

Changes

Return a "dummy request" in the split BeforeRequestHook

When the BeforeRequestHook is called, we would split up the doc into N requests, issue coroutines for N-1 requests, and return the last one for the SDK to run. This adds two paths for recombining the results. Instead, the BeforeRequest can return a dummy request that will get a 200. This takes us straight to the AfterSuccessHook, which awaits all of the splits and builds the response.

Add retries to the split requests

This is a copy of the autogenerated code in retry.py, which will work for the async calls. At some point, we should be able to reuse the SDK for this so we aren't hardcoding the retry config values here. Need to work with Speakeasy on this.

Clean up error handling

When the retries fail and we do have to bubble up an error, we pass it to create_failure_response before returning to the SDK. This pops a 500 status code into the response, only so the SDK does not see a 502/503/504, and proceed to retry the entire doc.

Set a httpx timeout

Many of the failing requests right now are hi_res calls. This is because the default httpx client timeout is 5 seconds, and we immediately throw a ReadTimeout. For now, set this timeout to 10 minutes. This should be sufficient in the splitting code, where page size per request will be controlled. This is another hardcoded value that should go away once we're able to send our splits back into sdk.general.partition

Testing

Any pipelines that have failed consistently should work now. For more fine grained testing, I tend to mock up my local server to return a retryable error for specific pages, a certain number of times. In the general_partition function, I add something like

    global num_bounces # Initialize this somewhere up above
    page_num = form_params.starting_page_number or 1
    if num_bounces > 0 and page_num == 3:
        num_bounces -= 1
        logger.info(page_num)
        raise HTTPException(status_code=502, detail="BOUNCE")

Then, send a SDK request to your local server and verify that the split request for page 3 of your doc is retrying up to the number of times you want.

Also, setting the max concurrency to 15 should reproduce the issue. Choose some 50+ page pdf and try the following with the current 0.25.5 branch. It will likely fail with ServerError: {}. Then try a local pip install off this branch.

s = UnstructuredClient(
    api_key_auth="my-api-key",
)

filename = "some-large-pdf"
with open(filename, "rb") as f:
    files=shared.Files(
        content=f.read(),
        file_name=filename,
    )

req = operations.PartitionRequest(
    shared.PartitionParameters(
        files=files,
        split_pdf_page=True,
        strategy="hi_res",
        split_pdf_allow_failed=False,
        split_pdf_concurrency_level=15,
    ),
)

resp = s.general.partition(req)

if num_elements := len(resp.elements):
    print(f"Succeeded with {num_elements}")

Simplify the splitting logic like so: BeforeRequest: * Split up the doc, issue coroutines to send off all the splits * Return a request that just returns 200 AfterSuccess: * Await all of the splits, or stop early if allow_failed=False * Build a new requests.Response with whatever elements we have When the SDK is responsible for the last page, we need branching logic to capture the result and append the coroutine elements.

If the splitting code retries and fails on a 502, 503, or 504, we just want to return the error. Let's set a 500 status code before handing it back to the SDK, otherwise it will want to retry the whole document again.

amanda103

tested locally - looks good to me!

This is a copy of [the fixes](#165) that went into the release/0.25.x branch, updated slightly for the v2 SDK. The main branch is already retrying split requests, but we do not have an updated httpx timeout value, and the retries are not logged. # To verify Same as the linked pr, we can mock arbitrary errors from the server and verify the behavior. Apply this patch to your local unstructured-api, and do `make run-web-app`. Whenever we send a request starting at page 3, the server will sleep for a bit, and then return a 502 error. ``` diff --git a/prepline_general/api/general.py b/prepline_general/api/general.py index 2ca56270..0624076c 100644 *** a/unstructured-api/prepline_general/api/general.py --- b/unstructured-api/prepline_general/api/general.py *************** *** 812,817 **** --- 814,822 ---- ) + global num_bounces + num_bounces = 2 + @elasticapm.capture_span() @router.post( "/general/v0/general", *************** *** 835,840 **** --- 840,854 ---- # For new parameters - add them in models/form_params.py form_params: GeneralFormParams = Depends(GeneralFormParams.as_form), ): + + global num_bounces + page_num = form_params.starting_page_number or 1 + if num_bounces > 0 and page_num == 3: + num_bounces -= 1 + time.sleep(10) + raise HTTPException(status_code=502, detail="BOUNCE") + # -- must have a valid API key -- if api_key_env := os.environ.get("UNSTRUCTURED_API_KEY"): api_key = request.headers.get("unstructured-api-key") ``` Use the main branch to make a client request using layout-parser-paper, which should default to 2 page splits, with the second split starting at page 3. Because the httpx timeout is not specified, this split will raise a ReadTimeout error, and get stuck in a retry loop. This is not logged, so it will appear to be a client hang. ``` from unstructured_client import UnstructuredClient from unstructured_client.models import shared, operations filename = "_sample_docs/layout-parser-paper.pdf" s = UnstructuredClient( server_url="http://localhost:5000", ) with open(filename, "rb") as f: files=shared.Files( content=f.read(), file_name=filename, ) req = operations.PartitionRequest( shared.PartitionParameters( files=files, strategy="fast", split_pdf_page=True, ), ) resp = s.general.partition(req) if num_elements := len(resp.elements): print(f"Succeeded with {num_elements}") ``` Finally, make the same request against this branch. The bad split should be retried for the allotted 2 times, with logging, and then the doc will succeed.

awalker4 added 3 commits August 30, 2024 13:33

Ensure that the split hook is responsible for its own retries

e62dd5f

If the splitting code retries and fails on a 502, 503, or 504, we just want to return the error. Let's set a 500 status code before handing it back to the SDK, otherwise it will want to retry the whole document again.

Add retry logic to split pdf calls

080ea13

awalker4 changed the base branch from main to release/0.25.x August 31, 2024 01:24

Add a httpx timeout value, log the page number in retry attempts

817c5f2

awalker4 changed the title ~~Fix/retry errors~~ fix: Address some issues in the split pdf logic Aug 31, 2024

awalker4 marked this pull request as ready for review August 31, 2024 14:25

amanda103 approved these changes Aug 31, 2024

View reviewed changes

awalker4 added 2 commits August 31, 2024 12:09

Fix unit test

f43f296

Fix linter errors

e43f5fb

awalker4 merged commit 98028ec into release/0.25.x Aug 31, 2024

awalker4 deleted the fix/retry-errors branch August 31, 2024 16:18

awalker4 mentioned this pull request Sep 4, 2024

fix: Address some issues in the split pdf logic #170

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: Address some issues in the split pdf logic #165

fix: Address some issues in the split pdf logic #165

Uh oh!

awalker4 commented Aug 31, 2024 •

edited

Loading

Uh oh!

amanda103 left a comment

Uh oh!

Uh oh!

fix: Address some issues in the split pdf logic #165

fix: Address some issues in the split pdf logic #165

Uh oh!

Conversation

awalker4 commented Aug 31, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Return a "dummy request" in the split BeforeRequestHook

Add retries to the split requests

Clean up error handling

Set a httpx timeout

Testing

Uh oh!

amanda103 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

awalker4 commented Aug 31, 2024 •

edited

Loading