-
-
Notifications
You must be signed in to change notification settings - Fork 629
Improve performance of GetOrderForNames. #4326
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
When there are a lot of potential orders to reuse, the query could scan unnecessary rows, sometimes leading to timeouts. The new query uses the available index more effectively and adds a LIMIT clause.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We talked offline about putting this behind a feature flag.
I pushed changes to add the feature flag. However, in local testing (both before and after the feature flag) I reliably got failures for the
When I check my local integration test database (which is persisted across runs), this checked out. Order 4 was created last Wednesday (less than a week ago), and there was one entry in the orderToAuthz table for it, but the corresponding authz ID did not exist in either pendingAuthorizations or authz. I initially thought this could be an edge case where the order was originally created with reused authzs, and the authzs expired before the order did and were cleaned up by the expired-authz-purger. But when that happens, the order expiry is shortened to match the min expiry time on any of the authzs involved. In this case, the order had its full one-week lifetime, suggesting there was no authz reuse involved. It is worth noting that this case always happens for One other data point: When I switched the Note that this error case could be avoided by moving the check for account ID so it's before the GetOrder call, but that might just be masking the symptoms. |
Found the problem: expired-authz-purger is run, during the integration test, with a date 1 year in the future, to delete all authzs: boulder/test/integration-test.py Lines 69 to 70 in e3f797f
However, since the purger doesn't delete orders, that leaves dangling orders whose authorizations have been deleted. Before this change, that wasn't a big deal. Since each run of the integration test would generate fresh accounts, and the I verified this by disabling the purger step in the integration test, and found that the test passed as expected. I'm going to fix this by doing this (from previous comment):
This is nice because it saves a query in some cases. Based on the above testing, I'm satisfied that this isn't just masking a bigger problem. |
This allows us to early-reject orders that are for a different account ID, and fixes an integration test bug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work figuring out the integration test problem. I agree that moving the acct ID check earlier seems like the right fix here 👍
This looks good to me, I'm going to merge with the one 🔍 based on available reviewers.
When I originally made this branch, I wrote a patch on top of #4326, then rebased that patch on top of master. That rebase had conflicts, naturally, and I resolved those conflicts but evidently messed it up. This brings back the missing changes by: git show -v 9fa3607 | patch -p1 There was one rejected hunk; I manually made the changes from that hunk.
This rolls forward #4326 after it was reverted in #4328. Resolves #4329 The older query didn't have a `LIMIT 1` so it was returning multiple results, but gorp's `SelectOne` was okay with multiple results when the selection was going into an `int64`. When I changed this to a `struct` in #4326, gorp started producing errors. For this bug to manifest, an account needs to create an order, then fail validation, twice in a row for a given domain name, then create an order once more for the same domain name - that third request will fail because there are multiple orders in the orderFqdnSets table for that domain. Note that the bug condition doesn't happen when an account does three successful issuances in a row, because finalizing an order (that is, issuing a certificate for it) deletes the row in orderFqdnSets. Failing an authorization does not delete the row in orderFqdnSets. I believe this was an intentional design decision because an authorization can participate in many orders, and those orders can have many other authorizations, so computing the updated state of all those orders would be expensive (remember, order state is not persisted in the DB but is calculated dynamically based on the authorizations it contains). This wasn't detected in integration tests because we don't have any tests that fail validation for the same domain multiple times. I filed an issue for an integration test that would have incidentally caught this: #4332. There's also a more specific test case in #4331.
When there are a lot of potential orders to reuse, the query could scan
unnecessary rows, sometimes leading to timeouts. The new query uses the
available index more effectively and adds a LIMIT clause.
There are no new unittest cases because I think the cases at https://github.com/letsencrypt/boulder/blob/master/sa/sa_test.go#L2046-L2086 already cover what I would want to test for this change, but I'm definitely open to ideas for additional test cases.
Fixes #4325