Skip to content

fix: Address some issues in the split pdf logic #165

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Aug 31, 2024

Conversation

awalker4
Copy link
Collaborator

@awalker4 awalker4 commented Aug 31, 2024

We've encountered some bugs in the split pdf code. For one, these requests are not retried. With the new split_pdf_allow_failed=False behavior, this means one transient network error can interrupt the whole doc. We've also had some asyncio warnings such as ... was never awaited.

This PR adds retries, cleans up the logic, and gives us a much better base for the V2 client release.

Changes

Return a "dummy request" in the split BeforeRequestHook

When the BeforeRequestHook is called, we would split up the doc into N requests, issue coroutines for N-1 requests, and return the last one for the SDK to run. This adds two paths for recombining the results. Instead, the BeforeRequest can return a dummy request that will get a 200. This takes us straight to the AfterSuccessHook, which awaits all of the splits and builds the response.

Add retries to the split requests

This is a copy of the autogenerated code in retry.py, which will work for the async calls. At some point, we should be able to reuse the SDK for this so we aren't hardcoding the retry config values here. Need to work with Speakeasy on this.

Clean up error handling

When the retries fail and we do have to bubble up an error, we pass it to create_failure_response before returning to the SDK. This pops a 500 status code into the response, only so the SDK does not see a 502/503/504, and proceed to retry the entire doc.

Set a httpx timeout

Many of the failing requests right now are hi_res calls. This is because the default httpx client timeout is 5 seconds, and we immediately throw a ReadTimeout. For now, set this timeout to 10 minutes. This should be sufficient in the splitting code, where page size per request will be controlled. This is another hardcoded value that should go away once we're able to send our splits back into sdk.general.partition

Testing

Any pipelines that have failed consistently should work now. For more fine grained testing, I tend to mock up my local server to return a retryable error for specific pages, a certain number of times. In the general_partition function, I add something like

    global num_bounces # Initialize this somewhere up above
    page_num = form_params.starting_page_number or 1
    if num_bounces > 0 and page_num == 3:
        num_bounces -= 1
        logger.info(page_num)
        raise HTTPException(status_code=502, detail="BOUNCE")

Then, send a SDK request to your local server and verify that the split request for page 3 of your doc is retrying up to the number of times you want.

Also, setting the max concurrency to 15 should reproduce the issue. Choose some 50+ page pdf and try the following with the current 0.25.5 branch. It will likely fail with ServerError: {}. Then try a local pip install off this branch.

s = UnstructuredClient(
    api_key_auth="my-api-key",
)

filename = "some-large-pdf"
with open(filename, "rb") as f:
    files=shared.Files(
        content=f.read(),
        file_name=filename,
    )

req = operations.PartitionRequest(
    shared.PartitionParameters(
        files=files,
        split_pdf_page=True,
        strategy="hi_res",
        split_pdf_allow_failed=False,
        split_pdf_concurrency_level=15,
    ),
)

resp = s.general.partition(req)

if num_elements := len(resp.elements):
    print(f"Succeeded with {num_elements}")

Simplify the splitting logic like so:

BeforeRequest:
* Split up the doc, issue coroutines to send off all the splits
* Return a request that just returns 200

AfterSuccess:
* Await all of the splits, or stop early if allow_failed=False
* Build a new requests.Response with whatever elements we have

When the SDK is responsible for the last page, we need branching logic to capture the result and
append the coroutine elements.
If the splitting code retries and fails on a 502, 503, or 504, we just want to return the error. Let's set a 500 status code before handing it back to the SDK, otherwise it will want to retry the whole document again.
@awalker4 awalker4 changed the base branch from main to release/0.25.x August 31, 2024 01:24
@awalker4 awalker4 changed the title Fix/retry errors fix: Address some issues in the split pdf logic Aug 31, 2024
@awalker4 awalker4 marked this pull request as ready for review August 31, 2024 14:25
Copy link

@amanda103 amanda103 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tested locally - looks good to me!

@awalker4 awalker4 merged commit 98028ec into release/0.25.x Aug 31, 2024
@awalker4 awalker4 deleted the fix/retry-errors branch August 31, 2024 16:18
awalker4 added a commit that referenced this pull request Sep 10, 2024
This is a copy of [the
fixes](#165)
that went into the release/0.25.x branch, updated slightly for the v2
SDK. The main branch is already retrying split requests, but we do not
have an updated httpx timeout value, and the retries are not logged.

# To verify
Same as the linked pr, we can mock arbitrary errors from the server and
verify the behavior.

Apply this patch to your local unstructured-api, and do `make
run-web-app`. Whenever we send a request starting at page 3, the server
will sleep for a bit, and then return a 502 error.

```
diff --git a/prepline_general/api/general.py b/prepline_general/api/general.py
index 2ca56270..0624076c 100644
*** a/unstructured-api/prepline_general/api/general.py
--- b/unstructured-api/prepline_general/api/general.py
***************
*** 812,817 ****
--- 814,822 ----
      )
  
  
+ global num_bounces
+ num_bounces = 2
+ 
  @elasticapm.capture_span()
  @router.post(
      "/general/v0/general",
***************
*** 835,840 ****
--- 840,854 ----
      # For new parameters - add them in models/form_params.py
      form_params: GeneralFormParams = Depends(GeneralFormParams.as_form),
  ):
+ 
+     global num_bounces
+     page_num = form_params.starting_page_number or 1
+     if num_bounces > 0 and page_num == 3:
+         num_bounces -= 1
+         time.sleep(10)
+         raise HTTPException(status_code=502, detail="BOUNCE")
+ 
      # -- must have a valid API key --
      if api_key_env := os.environ.get("UNSTRUCTURED_API_KEY"):
          api_key = request.headers.get("unstructured-api-key")
```

Use the main branch to make a client request using layout-parser-paper,
which should default to 2 page splits, with the second split starting at
page 3. Because the httpx timeout is not specified, this split will
raise a ReadTimeout error, and get stuck in a retry loop. This is not
logged, so it will appear to be a client hang.

```
from unstructured_client import UnstructuredClient
from unstructured_client.models import shared, operations

filename = "_sample_docs/layout-parser-paper.pdf"

s = UnstructuredClient(
    server_url="http://localhost:5000",
)

with open(filename, "rb") as f:
    files=shared.Files(
        content=f.read(),
        file_name=filename,
    )

req = operations.PartitionRequest(
    shared.PartitionParameters(
        files=files,
        strategy="fast",
        split_pdf_page=True,
    ),
)

resp = s.general.partition(req)

if num_elements := len(resp.elements):
    print(f"Succeeded with {num_elements}")
```

Finally, make the same request against this branch. The bad split should
be retried for the allotted 2 times, with logging, and then the doc will
succeed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants