Skip to content

fix: Address some issues in the split pdf logic #170

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Sep 10, 2024
Merged

Conversation

awalker4
Copy link
Collaborator

@awalker4 awalker4 commented Sep 4, 2024

This is a copy of the fixes that went into the release/0.25.x branch, updated slightly for the v2 SDK. The main branch is already retrying split requests, but we do not have an updated httpx timeout value, and the retries are not logged.

To verify

Same as the linked pr, we can mock arbitrary errors from the server and verify the behavior.

Apply this patch to your local unstructured-api, and do make run-web-app. Whenever we send a request starting at page 3, the server will sleep for a bit, and then return a 502 error.

diff --git a/prepline_general/api/general.py b/prepline_general/api/general.py
index 2ca56270..0624076c 100644
*** a/unstructured-api/prepline_general/api/general.py
--- b/unstructured-api/prepline_general/api/general.py
***************
*** 812,817 ****
--- 814,822 ----
      )
  
  
+ global num_bounces
+ num_bounces = 2
+ 
  @elasticapm.capture_span()
  @router.post(
      "/general/v0/general",
***************
*** 835,840 ****
--- 840,854 ----
      # For new parameters - add them in models/form_params.py
      form_params: GeneralFormParams = Depends(GeneralFormParams.as_form),
  ):
+ 
+     global num_bounces
+     page_num = form_params.starting_page_number or 1
+     if num_bounces > 0 and page_num == 3:
+         num_bounces -= 1
+         time.sleep(10)
+         raise HTTPException(status_code=502, detail="BOUNCE")
+ 
      # -- must have a valid API key --
      if api_key_env := os.environ.get("UNSTRUCTURED_API_KEY"):
          api_key = request.headers.get("unstructured-api-key")

Use the main branch to make a client request using layout-parser-paper, which should default to 2 page splits, with the second split starting at page 3. Because the httpx timeout is not specified, this split will raise a ReadTimeout error, and get stuck in a retry loop. This is not logged, so it will appear to be a client hang.

from unstructured_client import UnstructuredClient
from unstructured_client.models import shared, operations

filename = "_sample_docs/layout-parser-paper.pdf"

s = UnstructuredClient(
    server_url="http://localhost:5000",
)

with open(filename, "rb") as f:
    files=shared.Files(
        content=f.read(),
        file_name=filename,
    )

req = operations.PartitionRequest(
    shared.PartitionParameters(
        files=files,
        strategy="fast",
        split_pdf_page=True,
    ),
)

resp = s.general.partition(req)

if num_elements := len(resp.elements):
    print(f"Succeeded with {num_elements}")

Finally, make the same request against this branch. The bad split should be retried for the allotted 2 times, with logging, and then the doc will succeed.

Simplify the splitting logic like so:

BeforeRequest:
* Split up the doc, issue coroutines to send off all the splits
* Return a request that just returns 200

AfterSuccess:
* Await all of the splits, or stop early if allow_failed=False
* Build a new requests.Response with whatever elements we have

When the SDK is responsible for the last page, we need branching logic to capture the result and
append the coroutine elements.
- Do not log when SDK calls the after_success hooks. This is now the success of the mock request and
does not mean the doc is finished.
- Move the logging of successful/failed splits. Use debug level.
The autogenerated code does not log when it retries. We need to know what's happening, otherwise the
user experiences a hanging process.
@awalker4 awalker4 changed the title Fix/retry fixes 0.26.0 fix: Address some issues in the split pdf logic Sep 4, 2024
@awalker4 awalker4 marked this pull request as ready for review September 4, 2024 22:20
Copy link

@JOSHMT0744 JOSHMT0744 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really like these changes :)
On test_decorators.py, if endpoint url isn't "/general/v0/general", this could cause issues..?

@awalker4 awalker4 force-pushed the fix/retry-fixes-0.26.0 branch 2 times, most recently from 45e561d to 5af1018 Compare September 9, 2024 18:04
@awalker4
Copy link
Collaborator Author

awalker4 commented Sep 9, 2024

Really like these changes :) On test_decorators.py, if endpoint url isn't "/general/v0/general", this could cause issues..?

Thanks for the review! For that test - I was able to change the logic that required that check, so I can remove that now.

@awalker4 awalker4 force-pushed the fix/retry-fixes-0.26.0 branch from 5af1018 to a77b554 Compare September 9, 2024 18:21
@awalker4 awalker4 force-pushed the fix/retry-fixes-0.26.0 branch from a77b554 to b8d0af9 Compare September 9, 2024 18:23
@awalker4 awalker4 enabled auto-merge (squash) September 9, 2024 20:33
# When we're able to reuse the SDK to make these calls, we can remove this var
# The SDK timeout will be controlled by parameter
client_timeout_minutes = 30
if timeout_var := os.getenv("UNSTRUCTURED_CLIENT_TIMEOUT_MINUTES"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does the client user have to know to set this environment variable? if so should the default go the other way around maybe? just thinking about how it might be challenging waiting up to 30 minutes without realizing why it's hanging so long

Copy link
Collaborator Author

@awalker4 awalker4 Sep 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've had some users get stuck here. For very large pdfs, a hi_res call can sit for 30 minutes, at which point they get a httpx.ReadTimeout, the doc gets retried, and the same thing happens again. This should be rare, but we wanted the timeout to account for the worst case. And we added the env var so we could ask users to bump the timeout manually if this still wasn't resolved.

There is a http timeout parameter for the SDK now, but due to some gaps in the hook context, we don't have access to that param from down here. Longer term, that should be the value that users can configure, and we can probably have a much saner default. The env var here is a bit of a bandaid, so I'm keeping it out of the readme, etc.

The hooks do not give us access to the AsyncClient, which breaks the mock no-op request in the
splitting hook. For now, let's add it as a param to the sdk_init hook. This means we'll have to
bring this code back whenever the SDK regenerates. At least, until we have an update from Speakeasy
for this.
@awalker4 awalker4 force-pushed the fix/retry-fixes-0.26.0 branch from d1dad0e to dee638f Compare September 9, 2024 21:58
No need to test the base SDK hook behavior
Instead of copying out this code, let's reuse the SDK util function. Also, add the fix to use
asyncio.sleep in the retry function. This will be updated when we regenerate the SDK, but we can
change it manually for now.
@awalker4 awalker4 merged commit 6b81735 into main Sep 10, 2024
7 checks passed
@awalker4 awalker4 deleted the fix/retry-fixes-0.26.0 branch September 10, 2024 13:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants