Improve performance of GetOrderForNames. #4326

jsha · 2019-07-08T18:21:38Z

When there are a lot of potential orders to reuse, the query could scan
unnecessary rows, sometimes leading to timeouts. The new query uses the
available index more effectively and adds a LIMIT clause.

There are no new unittest cases because I think the cases at https://github.com/letsencrypt/boulder/blob/master/sa/sa_test.go#L2046-L2086 already cover what I would want to test for this change, but I'm definitely open to ideas for additional test cases.

Fixes #4325

When there are a lot of potential orders to reuse, the query could scan unnecessary rows, sometimes leading to timeouts. The new query uses the available index more effectively and adds a LIMIT clause.

cpu

We talked offline about putting this behind a feature flag.

jsha · 2019-07-08T22:20:55Z

I pushed changes to add the feature flag.

However, in local testing (both before and after the feature flag) I reliably got failures for the test_http2_http01_challenge test case under config-next:

Internal error - Error creating new order - getAllOrderAuthorizationStatuses returned the wrong number of authorization statuses (0 vs expected 1) for order 4

When I check my local integration test database (which is persisted across runs), this checked out. Order 4 was created last Wednesday (less than a week ago), and there was one entry in the orderToAuthz table for it, but the corresponding authz ID did not exist in either pendingAuthorizations or authz.

I initially thought this could be an edge case where the order was originally created with reused authzs, and the authzs expired before the order did and were cleaned up by the expired-authz-purger. But when that happens, the order expiry is shortened to match the min expiry time on any of the authzs involved. In this case, the order had its full one-week lifetime, suggesting there was no authz reuse involved.

It is worth noting that this case always happens for test_http2_http01_challenge, which is unusual among our integration test cases because it always attempts to issue for the same hostname. Also, each integration test run generates a new client (and so a new account key), which means that previously, orders generated from prior integration test runs would have been ignored (they didn't match the account ID). However, with this new code, we pick the oldest order as the sole likely candidate for reuse, and that order happens to be one that is broken (i.e. one that fails during GetOrder).

One other data point: When I switched the ORDER BY expires ASC to ORDER BY expires DESC, I still get the same error.

Note that this error case could be avoided by moving the check for account ID so it's before the GetOrder call, but that might just be masking the symptoms.

jsha · 2019-07-08T23:58:04Z

Found the problem:

expired-authz-purger is run, during the integration test, with a date 1 year in the future, to delete all authzs:

boulder/test/integration-test.py

Lines 69 to 70 in e3f797f

    
           # Run the purger once to clear out any backlog so we have a clean slate. 
        
           expect(now+datetime.timedelta(days=+365), None, "")

However, since the purger doesn't delete orders, that leaves dangling orders whose authorizations have been deleted. Before this change, that wasn't a big deal. Since each run of the integration test would generate fresh accounts, and the SELECT used by this function included the account ID as one parameter, there would never be an attempt to load orders (and therefore authorizations) associated with an account from a previous run.

I verified this by disabling the purger step in the integration test, and found that the test passed as expected.

I'm going to fix this by doing this (from previous comment):

moving the check for account ID so it's before the GetOrder call, but that might just be masking the symptoms.

This is nice because it saves a query in some cases. Based on the above testing, I'm satisfied that this isn't just masking a bigger problem.

This allows us to early-reject orders that are for a different account ID, and fixes an integration test bug.

cpu

Nice work figuring out the integration test problem. I agree that moving the acct ID check earlier seems like the right fix here 👍

This looks good to me, I'm going to merge with the one 🔍 based on available reviewers.

This reverts commit 9fa3607.

This reverts commit 9fa3607. This commit can cause "gorp: multiple rows returned for: ..." under certain situations. See #4329 for details of followup.

This rolls forward #4326 after it was reverted in #4328.

When I originally made this branch, I wrote a patch on top of #4326, then rebased that patch on top of master. That rebase had conflicts, naturally, and I resolved those conflicts but evidently messed it up. This brings back the missing changes by: git show -v 9fa3607 | patch -p1 There was one rejected hunk; I manually made the changes from that hunk.

This rolls forward #4326 after it was reverted in #4328. Resolves #4329 The older query didn't have a `LIMIT 1` so it was returning multiple results, but gorp's `SelectOne` was okay with multiple results when the selection was going into an `int64`. When I changed this to a `struct` in #4326, gorp started producing errors. For this bug to manifest, an account needs to create an order, then fail validation, twice in a row for a given domain name, then create an order once more for the same domain name - that third request will fail because there are multiple orders in the orderFqdnSets table for that domain. Note that the bug condition doesn't happen when an account does three successful issuances in a row, because finalizing an order (that is, issuing a certificate for it) deletes the row in orderFqdnSets. Failing an authorization does not delete the row in orderFqdnSets. I believe this was an intentional design decision because an authorization can participate in many orders, and those orders can have many other authorizations, so computing the updated state of all those orders would be expensive (remember, order state is not persisted in the DB but is calculated dynamically based on the authorizations it contains). This wasn't detected in integration tests because we don't have any tests that fail validation for the same domain multiple times. I filed an issue for an integration test that would have incidentally caught this: #4332. There's also a more specific test case in #4331.

Improve performance of GetOrderForNames.

001a6f3

When there are a lot of potential orders to reuse, the query could scan unnecessary rows, sometimes leading to timeouts. The new query uses the available index more effectively and adds a LIMIT clause.

jsha requested a review from a team as a code owner July 8, 2019 18:21

Add FasterGetOrderForNames flag

ba712eb

cpu suggested changes Jul 8, 2019

View reviewed changes

Look up registrationID in orderFqdnSet query.

dda9c5b

This allows us to early-reject orders that are for a different account ID, and fixes an integration test bug.

cpu approved these changes Jul 9, 2019

View reviewed changes

cpu merged commit 9fa3607 into master Jul 9, 2019

cpu deleted the faster-getorderfornames branch July 9, 2019 13:46

jsha added a commit that referenced this pull request Jul 9, 2019

Revert "SA: improve performance of GetOrderForNames. (#4326)"

1d8ee38

This reverts commit 9fa3607.

jsha mentioned this pull request Jul 9, 2019

FasterGetOrderForNames feature breaks gorp #4329

Closed

jsha added a commit that referenced this pull request Jul 11, 2019

Fix FasterGetOrderForNames and add tests.

4c643a4

This rolls forward #4326 after it was reverted in #4328.

jsha added a commit that referenced this pull request Jul 11, 2019

Fix FasterGetOrderForNames and add tests.

52377f0

This rolls forward #4326 after it was reverted in #4328.

jsha mentioned this pull request Jul 11, 2019

Fix FasterGetOrderForNames and add tests. #4331

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Improve performance of GetOrderForNames. #4326

Improve performance of GetOrderForNames. #4326

Uh oh!

jsha commented Jul 8, 2019

Uh oh!

cpu left a comment

Uh oh!

jsha commented Jul 8, 2019

Uh oh!

jsha commented Jul 8, 2019

Uh oh!

cpu left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Improve performance of GetOrderForNames. #4326

Improve performance of GetOrderForNames. #4326

Uh oh!

Conversation

jsha commented Jul 8, 2019

Uh oh!

cpu left a comment

Choose a reason for hiding this comment

Uh oh!

jsha commented Jul 8, 2019

Uh oh!

jsha commented Jul 8, 2019

Uh oh!

cpu left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants