-
Notifications
You must be signed in to change notification settings - Fork 17
fix: Address some issues in the split pdf logic #170
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Simplify the splitting logic like so: BeforeRequest: * Split up the doc, issue coroutines to send off all the splits * Return a request that just returns 200 AfterSuccess: * Await all of the splits, or stop early if allow_failed=False * Build a new requests.Response with whatever elements we have When the SDK is responsible for the last page, we need branching logic to capture the result and append the coroutine elements.
- Do not log when SDK calls the after_success hooks. This is now the success of the mock request and does not mean the doc is finished. - Move the logging of successful/failed splits. Use debug level.
The autogenerated code does not log when it retries. We need to know what's happening, otherwise the user experiences a hanging process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really like these changes :)
On test_decorators.py, if endpoint url isn't "/general/v0/general", this could cause issues..?
45e561d
to
5af1018
Compare
Thanks for the review! For that test - I was able to change the logic that required that check, so I can remove that now. |
5af1018
to
a77b554
Compare
a77b554
to
b8d0af9
Compare
# When we're able to reuse the SDK to make these calls, we can remove this var | ||
# The SDK timeout will be controlled by parameter | ||
client_timeout_minutes = 30 | ||
if timeout_var := os.getenv("UNSTRUCTURED_CLIENT_TIMEOUT_MINUTES"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does the client user have to know to set this environment variable? if so should the default go the other way around maybe? just thinking about how it might be challenging waiting up to 30 minutes without realizing why it's hanging so long
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've had some users get stuck here. For very large pdfs, a hi_res call can sit for 30 minutes, at which point they get a httpx.ReadTimeout, the doc gets retried, and the same thing happens again. This should be rare, but we wanted the timeout to account for the worst case. And we added the env var so we could ask users to bump the timeout manually if this still wasn't resolved.
There is a http timeout parameter for the SDK now, but due to some gaps in the hook context, we don't have access to that param from down here. Longer term, that should be the value that users can configure, and we can probably have a much saner default. The env var here is a bit of a bandaid, so I'm keeping it out of the readme, etc.
The hooks do not give us access to the AsyncClient, which breaks the mock no-op request in the splitting hook. For now, let's add it as a param to the sdk_init hook. This means we'll have to bring this code back whenever the SDK regenerates. At least, until we have an update from Speakeasy for this.
d1dad0e
to
dee638f
Compare
No need to test the base SDK hook behavior
Instead of copying out this code, let's reuse the SDK util function. Also, add the fix to use asyncio.sleep in the retry function. This will be updated when we regenerate the SDK, but we can change it manually for now.
This is a copy of the fixes that went into the release/0.25.x branch, updated slightly for the v2 SDK. The main branch is already retrying split requests, but we do not have an updated httpx timeout value, and the retries are not logged.
To verify
Same as the linked pr, we can mock arbitrary errors from the server and verify the behavior.
Apply this patch to your local unstructured-api, and do
make run-web-app
. Whenever we send a request starting at page 3, the server will sleep for a bit, and then return a 502 error.Use the main branch to make a client request using layout-parser-paper, which should default to 2 page splits, with the second split starting at page 3. Because the httpx timeout is not specified, this split will raise a ReadTimeout error, and get stuck in a retry loop. This is not logged, so it will appear to be a client hang.
Finally, make the same request against this branch. The bad split should be retried for the allotted 2 times, with logging, and then the doc will succeed.