-
Notifications
You must be signed in to change notification settings - Fork 17
fix: Address some issues in the split pdf logic #165
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Simplify the splitting logic like so: BeforeRequest: * Split up the doc, issue coroutines to send off all the splits * Return a request that just returns 200 AfterSuccess: * Await all of the splits, or stop early if allow_failed=False * Build a new requests.Response with whatever elements we have When the SDK is responsible for the last page, we need branching logic to capture the result and append the coroutine elements.
If the splitting code retries and fails on a 502, 503, or 504, we just want to return the error. Let's set a 500 status code before handing it back to the SDK, otherwise it will want to retry the whole document again.
amanda103
approved these changes
Aug 31, 2024
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tested locally - looks good to me!
awalker4
added a commit
that referenced
this pull request
Sep 10, 2024
This is a copy of [the fixes](#165) that went into the release/0.25.x branch, updated slightly for the v2 SDK. The main branch is already retrying split requests, but we do not have an updated httpx timeout value, and the retries are not logged. # To verify Same as the linked pr, we can mock arbitrary errors from the server and verify the behavior. Apply this patch to your local unstructured-api, and do `make run-web-app`. Whenever we send a request starting at page 3, the server will sleep for a bit, and then return a 502 error. ``` diff --git a/prepline_general/api/general.py b/prepline_general/api/general.py index 2ca56270..0624076c 100644 *** a/unstructured-api/prepline_general/api/general.py --- b/unstructured-api/prepline_general/api/general.py *************** *** 812,817 **** --- 814,822 ---- ) + global num_bounces + num_bounces = 2 + @elasticapm.capture_span() @router.post( "/general/v0/general", *************** *** 835,840 **** --- 840,854 ---- # For new parameters - add them in models/form_params.py form_params: GeneralFormParams = Depends(GeneralFormParams.as_form), ): + + global num_bounces + page_num = form_params.starting_page_number or 1 + if num_bounces > 0 and page_num == 3: + num_bounces -= 1 + time.sleep(10) + raise HTTPException(status_code=502, detail="BOUNCE") + # -- must have a valid API key -- if api_key_env := os.environ.get("UNSTRUCTURED_API_KEY"): api_key = request.headers.get("unstructured-api-key") ``` Use the main branch to make a client request using layout-parser-paper, which should default to 2 page splits, with the second split starting at page 3. Because the httpx timeout is not specified, this split will raise a ReadTimeout error, and get stuck in a retry loop. This is not logged, so it will appear to be a client hang. ``` from unstructured_client import UnstructuredClient from unstructured_client.models import shared, operations filename = "_sample_docs/layout-parser-paper.pdf" s = UnstructuredClient( server_url="http://localhost:5000", ) with open(filename, "rb") as f: files=shared.Files( content=f.read(), file_name=filename, ) req = operations.PartitionRequest( shared.PartitionParameters( files=files, strategy="fast", split_pdf_page=True, ), ) resp = s.general.partition(req) if num_elements := len(resp.elements): print(f"Succeeded with {num_elements}") ``` Finally, make the same request against this branch. The bad split should be retried for the allotted 2 times, with logging, and then the doc will succeed.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
We've encountered some bugs in the split pdf code. For one, these requests are not retried. With the new
split_pdf_allow_failed=False
behavior, this means one transient network error can interrupt the whole doc. We've also had some asyncio warnings such as... was never awaited
.This PR adds retries, cleans up the logic, and gives us a much better base for the V2 client release.
Changes
Return a "dummy request" in the split BeforeRequestHook
When the BeforeRequestHook is called, we would split up the doc into N requests, issue coroutines for N-1 requests, and return the last one for the SDK to run. This adds two paths for recombining the results. Instead, the BeforeRequest can return a dummy request that will get a 200. This takes us straight to the AfterSuccessHook, which awaits all of the splits and builds the response.
Add retries to the split requests
This is a copy of the autogenerated code in
retry.py
, which will work for the async calls. At some point, we should be able to reuse the SDK for this so we aren't hardcoding the retry config values here. Need to work with Speakeasy on this.Clean up error handling
When the retries fail and we do have to bubble up an error, we pass it to
create_failure_response
before returning to the SDK. This pops a 500 status code into the response, only so the SDK does not see a 502/503/504, and proceed to retry the entire doc.Set a httpx timeout
Many of the failing requests right now are hi_res calls. This is because the default httpx client timeout is 5 seconds, and we immediately throw a ReadTimeout. For now, set this timeout to 10 minutes. This should be sufficient in the splitting code, where page size per request will be controlled. This is another hardcoded value that should go away once we're able to send our splits back into
sdk.general.partition
Testing
Any pipelines that have failed consistently should work now. For more fine grained testing, I tend to mock up my local server to return a retryable error for specific pages, a certain number of times. In the
general_partition
function, I add something likeThen, send a SDK request to your local server and verify that the split request for page 3 of your doc is retrying up to the number of times you want.
Also, setting the max concurrency to 15 should reproduce the issue. Choose some 50+ page pdf and try the following with the current 0.25.5 branch. It will likely fail with
ServerError: {}
. Then try a local pip install off this branch.