condfetch: Skip revalidation of an invalidated stale_oc #4082

dridi · 2024-03-12T18:35:31Z

If we can catch a dying stale_oc before initiating a condfetch, we can fall back to a regular fetch instead. If we catch it too late, when we are ready to merge headers, we transition to vcl_backend_error and get a chance to retry.

This avoids a race between a condfetch and the invalidation of its stale objcore, covered with a purge in test cases.

dridi · 2024-03-13T09:17:33Z

@nigoroll @bsdphk: I think the first commit should be merged before the release.

bin/varnishtest/tests/c00129.vtc

dridi · 2024-03-15T17:35:49Z

@hermunn brought an interesting point regarding bans. We check the stale object against newer bans during lookup, but in the absence of another lookup or if the ban lurker doesn't get a chance to evaluate post-lookup bans (and the default ban_lurker_age almost guarantees that) the stale object will not be invalidated.

To get the discussion started I pushed what could look like evaluating bans before and after 304 revalidation. I will turn this pull request into a draft.

nigoroll · 2024-03-18T10:19:22Z

I understand that Dridi prepared these patches in a rush, so once the dust has settled, I would appreciate better explanations in the commit messages.

Regarding the actual issue here, I am not convinced yet that failing with a fetch error for the race is the best option (it might be, but I would like to think about it). There will probably be many sites for which the current behavior is just fine, and who would start to see errors unless they added the return(retry), so we should at least add this to the builtin vcl and docs.

If someone issues a PURGE, they expect that the respective object go away and not survive indirectly through a revalidation. The current suggestion to fail to v_b_e {} and then retry under VCL control has the advantage of replacing the busy object and thus not delivering the corrupted stale content. But we could also just accept the revalidated object and deliver it (once), also implicitly purging it. Or we could do the retry internally without VCL involvement.

Again, not sure yet.

bin/varnishd/cache/cache_fetch.c

dridi · 2024-03-18T14:51:29Z

bugwash:

go back to the original scope of the pull request
- leave post-lookup ban checks out for now
add 304 retry in built-in vcl_backend_error
docs

dridi · 2024-03-19T11:17:37Z

Updated as per bugwash consensus.

nigoroll

This looks really good after cleanup.

TBD (missing documentation and coverage)

A valid stale objcore may be selected during lookup, enabling a conditional fetch from the backend. Depending on the backend, or the resource being requested, this may leave a long window during which the stale object may be invalidated. In that case, a 304 response would defeat the invalidation, making it possible for invalidated response header fields or bodies to be perpetuated despite Varnish reporting a successful invalidation. We should consider adding this to vcl_backend_error: if (beresp.was_304) { unset bereq.http.If-Modified-Since; unset bereq.http.If-None-Match; return (retry(fetch)); } This would mostly preserve the current behavior, as observed by a client. In particular, when the ETag of the invalidated stale object is still the same when a new copy is fetched, the client side INM processing would remain. Varnish would simply no longer merge and reuse invalidated stale objects internally. From the point of view of an operator, there would of course be a visible change in the presence of a retry.

When it is time to decide between making a bereq for a regular or conditional fetch, we need to ensure that the stale object is still valid. There is a window between the lookup and the begining of a fetch task during which the object could have been invalidated. The window is in theory much smaller than the one between the lookup and the processing of a 304 response, but it can be significant if the fetch task was queued or restarted. Catching it early allows to proceed with a regular fetch.

Named after vcl_backend_refresh from varnishcache#3994.

dridi · 2024-06-18T16:36:04Z

@AlveElde brought up an interesting point in an offline discussion. An implicit retry from the built-in VCL is not desirable since there is no guarantee that vcl_backend_fetch is idempotent.

I added at the beginning of this patch series two commits that:

split the startfetch step in two (prepfetch, then what remains for startfetch)
introduce return (retry(fetch)) and return (retry(task)) transitions

The return (retry(task)) one is present for completeness and is otherwise equivalent to return (retry). In all cases a retry increments the retries counter and is limited by the max_retries parameter.

This allows an effective retry at the actual step where we initiate the backend request, skipping vcl_backend_fetch and staying in the same transaction. The commit introducing this variation on retries is minimalist, lacking coverage and documentation.

On the other hand, the original commits from this pull request were updated with their documentation and coverage adjusted.

hermunn

I have been thinking about this a lot (I have also looked at the code), and I have some opinions:

Introducing retries in the default VCL is a change very likely cause confusion and unexpected behavior in setups where users already have some complexity in their backend subs. In particular, anyone writing if (bereq.retries > 0) might have assumptions baked in about why the retry happens (in their VCL), and they did not think about this other posibility. The fact that sub vcl_backend_fetch is not run twice (something we need coverage on, for sure) makes it better, but the fact that you can have a bereq.retries > 0 in sub vcl_backend_response after going only once through sub vcl_backend_fetch is a gotcha, and needs to be documented carefully. It might be better to have two variables you can check in VCL, one which includes these short false starts, and one which does not.
When a backend responds with 304, it is correct and reasonable to evaluate new bans against the stale_oc. Am also convinced that the right Varnish behavior is to refuse to deliver that banned stale object,with no way to override this from VCL.
It is right to drop the stale_oc reference and go to sub vcl_backend_error, like this PR does. The conundrum for me, is what Varnish and the default VCL should do when this unfortunate situation, a banned stale OC with nothing to send to the client, happens.

Maybe it is better to insist that the user ask (through VCL in sub vcl_backend_fetch, or through a parameter) for the immediate retry sans IMS headers without taking a detour through sub vcl_backend_error. This will make the change to the Varnish Final state less dramatic, but I am not sure about our users' needs in this area.

In other words, we would leave return (retry) as it is now, allowing users to, if they really need it, do relatively complex logic for this case, while most users simply will opt in or out of the immediate retry. Of course, Varnish would remove the IMS headers at the right moment, probably before going back to sub backend_fetch, while the default VCL would simply serve a 503.

dridi · 2024-07-05T16:38:58Z

I agree with the points you raise. I still think that an explicit retry(fetch) action would be very useful outside of the built-in VCL to enable retries without tearing down privs for the task, and without going through a potentially-non-idempotent vcl_backend_fetch.

It might be better to have two variables you can check in VCL, one which includes these short false starts, and one which does not.

Or maybe make this opt-in, without a detour via vcl_backend_error?

sub vcl_backend_fetch {
        set bereq.retry_dying_304 = true;
}

No rug pull, no PRIV_TASK loss, no required attention to vcl_backend_fetch idempotency.

hermunn · 2024-07-08T07:35:25Z

I think the new VCL variable retry_dying_304 is a good idea. We might also consider documenting that from September, this will/might be default true, so that everyone should make sure to set this in their own configuration (VCL) if they care.

Also, I see your point with return (retry(fetch)) in general, as many users simply want to set a new backend and retry without any sub vcl_backend_fetch logic being relevant. However, if we go with the retry_dying_304 here, the question about return (retry(fetch)) might be better discussed in a different PR.

dridi · 2024-07-08T08:02:53Z

I'm wondering to which extent we could actually tie this to vcl_backend_refresh instead of vcl_backend_error, having this return (retry(fetch)) in the built-in VCL once we move some of the core code handling 304s to VCL.

See #3994 for more details.

That would make the bereq.retry_dying_304 idea irrelevant.

edit: and this could not qualify as a breaking change in a new subroutine for which there is no preexisting VCL.

dridi added b=bug b=enhancement r=trunk c=varnishd labels Mar 12, 2024

dridi commented Mar 13, 2024

View reviewed changes

bin/varnishtest/tests/c00129.vtc Outdated Show resolved Hide resolved

dridi force-pushed the dying_304 branch 2 times, most recently from cc56c8b to 9257ac3 Compare March 13, 2024 11:11

dridi marked this pull request as draft March 15, 2024 17:36

nigoroll reviewed Mar 18, 2024

View reviewed changes

bin/varnishd/cache/cache_fetch.c Outdated Show resolved Hide resolved

dridi force-pushed the dying_304 branch from b452cdb to 3b56954 Compare March 18, 2024 13:59

dridi marked this pull request as ready for review March 18, 2024 14:27

dridi force-pushed the dying_304 branch from 3b56954 to 75c9150 Compare March 18, 2024 14:33

dridi force-pushed the dying_304 branch 2 times, most recently from a0b6a7e to 541cd7a Compare March 19, 2024 10:46

nigoroll approved these changes May 24, 2024

View reviewed changes

dridi added the a=Bugwash Today label Jun 17, 2024

dridi added 6 commits June 18, 2024 17:04

fetch: Split the startfetch step in two

456a100

DRAFT: vcl: New restart(fetch) transition

6e9b09f

TBD (missing documentation and coverage)

builtin: Split a new vcl_beresp_error subroutine

b0d5920

builtin: Retry failed 304s by default

2c12ab4

Named after vcl_backend_refresh from varnishcache#3994.

dridi force-pushed the dying_304 branch from 541cd7a to 2c12ab4 Compare June 18, 2024 16:23

dridi marked this pull request as draft June 18, 2024 16:37

dridi requested a review from nigoroll June 18, 2024 16:38

daghf requested a review from hermunn June 25, 2024 14:14

hermunn reviewed Jul 5, 2024

View reviewed changes

dridi mentioned this pull request Jul 8, 2024

Do we handle 304 responses correctly? #4026

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

condfetch: Skip revalidation of an invalidated stale_oc #4082

condfetch: Skip revalidation of an invalidated stale_oc #4082

dridi commented Mar 12, 2024

dridi commented Mar 13, 2024

dridi commented Mar 15, 2024

nigoroll commented Mar 18, 2024 •

edited

Loading

dridi commented Mar 18, 2024

dridi commented Mar 19, 2024

nigoroll left a comment

dridi commented Jun 18, 2024

hermunn left a comment

dridi commented Jul 5, 2024

hermunn commented Jul 8, 2024

dridi commented Jul 8, 2024 •

edited

Loading

condfetch: Skip revalidation of an invalidated stale_oc #4082

Are you sure you want to change the base?

condfetch: Skip revalidation of an invalidated stale_oc #4082

Conversation

dridi commented Mar 12, 2024

dridi commented Mar 13, 2024

dridi commented Mar 15, 2024

nigoroll commented Mar 18, 2024 • edited Loading

dridi commented Mar 18, 2024

dridi commented Mar 19, 2024

nigoroll left a comment

Choose a reason for hiding this comment

dridi commented Jun 18, 2024

hermunn left a comment

Choose a reason for hiding this comment

dridi commented Jul 5, 2024

hermunn commented Jul 8, 2024

dridi commented Jul 8, 2024 • edited Loading

nigoroll commented Mar 18, 2024 •

edited

Loading

dridi commented Jul 8, 2024 •

edited

Loading