Skip to content

Conversation

jsha
Copy link
Contributor

@jsha jsha commented Jul 8, 2019

When there are a lot of potential orders to reuse, the query could scan
unnecessary rows, sometimes leading to timeouts. The new query uses the
available index more effectively and adds a LIMIT clause.

There are no new unittest cases because I think the cases at https://github.com/letsencrypt/boulder/blob/master/sa/sa_test.go#L2046-L2086 already cover what I would want to test for this change, but I'm definitely open to ideas for additional test cases.

Fixes #4325

When there are a lot of potential orders to reuse, the query could scan
unnecessary rows, sometimes leading to timeouts. The new query uses the
available index more effectively and adds a LIMIT clause.
@jsha jsha requested a review from a team as a code owner July 8, 2019 18:21
Copy link
Contributor

@cpu cpu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We talked offline about putting this behind a feature flag.

@jsha
Copy link
Contributor Author

jsha commented Jul 8, 2019

I pushed changes to add the feature flag.

However, in local testing (both before and after the feature flag) I reliably got failures for the test_http2_http01_challenge test case under config-next:

Internal error - Error creating new order - getAllOrderAuthorizationStatuses returned the wrong number of authorization statuses (0 vs expected 1) for order 4

When I check my local integration test database (which is persisted across runs), this checked out. Order 4 was created last Wednesday (less than a week ago), and there was one entry in the orderToAuthz table for it, but the corresponding authz ID did not exist in either pendingAuthorizations or authz.

I initially thought this could be an edge case where the order was originally created with reused authzs, and the authzs expired before the order did and were cleaned up by the expired-authz-purger. But when that happens, the order expiry is shortened to match the min expiry time on any of the authzs involved. In this case, the order had its full one-week lifetime, suggesting there was no authz reuse involved.

It is worth noting that this case always happens for test_http2_http01_challenge, which is unusual among our integration test cases because it always attempts to issue for the same hostname. Also, each integration test run generates a new client (and so a new account key), which means that previously, orders generated from prior integration test runs would have been ignored (they didn't match the account ID). However, with this new code, we pick the oldest order as the sole likely candidate for reuse, and that order happens to be one that is broken (i.e. one that fails during GetOrder).

One other data point: When I switched the ORDER BY expires ASC to ORDER BY expires DESC, I still get the same error.

Note that this error case could be avoided by moving the check for account ID so it's before the GetOrder call, but that might just be masking the symptoms.

@jsha
Copy link
Contributor Author

jsha commented Jul 8, 2019

Found the problem:

expired-authz-purger is run, during the integration test, with a date 1 year in the future, to delete all authzs:

# Run the purger once to clear out any backlog so we have a clean slate.
expect(now+datetime.timedelta(days=+365), None, "")

However, since the purger doesn't delete orders, that leaves dangling orders whose authorizations have been deleted. Before this change, that wasn't a big deal. Since each run of the integration test would generate fresh accounts, and the SELECT used by this function included the account ID as one parameter, there would never be an attempt to load orders (and therefore authorizations) associated with an account from a previous run.

I verified this by disabling the purger step in the integration test, and found that the test passed as expected.

I'm going to fix this by doing this (from previous comment):

moving the check for account ID so it's before the GetOrder call, but that might just be masking the symptoms.

This is nice because it saves a query in some cases. Based on the above testing, I'm satisfied that this isn't just masking a bigger problem.

This allows us to early-reject orders that are for a different account
ID, and fixes an integration test bug.
Copy link
Contributor

@cpu cpu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work figuring out the integration test problem. I agree that moving the acct ID check earlier seems like the right fix here 👍

This looks good to me, I'm going to merge with the one 🔍 based on available reviewers.

@cpu cpu merged commit 9fa3607 into master Jul 9, 2019
@cpu cpu deleted the faster-getorderfornames branch July 9, 2019 13:46
jsha added a commit that referenced this pull request Jul 9, 2019
jsha added a commit that referenced this pull request Jul 9, 2019
This reverts commit 9fa3607.

This commit can cause "gorp: multiple rows returned for: ..." under certain situations.

See #4329 for details of followup.
jsha added a commit that referenced this pull request Jul 11, 2019
This rolls forward #4326 after it was reverted in #4328.
jsha added a commit that referenced this pull request Jul 11, 2019
This rolls forward #4326 after it was reverted in #4328.
jsha added a commit that referenced this pull request Jul 11, 2019
When I originally made this branch, I wrote a patch on top of #4326,
then rebased that patch on top of master. That rebase had conflicts,
naturally, and I resolved those conflicts but evidently messed it up.

This brings back the missing changes by:

git show -v 9fa3607 | patch -p1

There was one rejected hunk; I manually made the changes from that hunk.
cpu pushed a commit that referenced this pull request Jul 11, 2019
This rolls forward #4326 after it was reverted in #4328.

Resolves #4329

The older query didn't have a `LIMIT 1` so it was returning multiple results,
but gorp's `SelectOne` was okay with multiple results when the selection was
going into an `int64`. When I changed this to a `struct` in #4326, gorp started
producing errors.

For this bug to manifest, an account needs to create an order, then fail
validation, twice in a row for a given domain name, then create an order once
more for the same domain name - that third request will fail because there are
multiple orders in the orderFqdnSets table for that domain.

Note that the bug condition doesn't happen when an account does three successful
issuances in a row, because finalizing an order (that is, issuing a certificate
for it) deletes the row in orderFqdnSets. Failing an authorization does not
delete the row in orderFqdnSets. I believe this was an intentional design
decision because an authorization can participate in many orders, and those
orders can have many other authorizations, so computing the updated state of
all those orders would be expensive (remember, order state is not persisted in
the DB but is calculated dynamically based on the authorizations it contains).

This wasn't detected in integration tests because we don't have any tests that
fail validation for the same domain multiple times. I filed an issue for an
integration test that would have incidentally caught this:
#4332. There's also a more specific
test case in #4331.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Improve GetOrderForNames performance

2 participants