Add throttling logic per origin #1356

tidoust · 2024-06-06T16:01:02Z

[Creating as draft as I'd be more confident if I were able to run a successful build locally]

This adds a generic throttled queue per origin class that replaces the previous throttle function to add serialization (and sleep interval) per origin.

The code is inspired by:
https://github.com/w3c/cg-monitor/blob/5ceea24a1eaa7c1736d38909bcbb47df4ae297a3/lib/caching-queued-fetch.js#L68

There may be a way to rewrite the class in a function wrapper as the previous throttle module was doing. I just grew tired of abstractions.

(Note plan is to reuse the same class in Reffy, see w3c/reffy#1587)

This adds a generic throttled queue per origin class that replaces the previous `throttle` function to add serialization (and sleep interval) per origin. The code is inspired by: https://github.com/w3c/cg-monitor/blob/5ceea24a1eaa7c1736d38909bcbb47df4ae297a3/lib/caching-queued-fetch.js#L68 (Note plan is to reuse the same class in Reffy)

SVG files were not intercepted in Reffy because a couple of specs were using a clunky dynamic PNG fallback mechanism that created an infinite loop. These specs were fixed some time ago, so I don't think that's still an issue and, for some reason, the drafts CSS server takes a lot of time to serve SVG files linked from specs.

This re-writes the convoluted logic introduced in previous commit (and the also convoluted logic that pre-existed) as suggested in #1582, taking inspiration from the queue in the CG monitor: https://github.com/w3c/cg-monitor/blob/5ceea24a1eaa7c1736d38909bcbb47df4ae297a3/lib/caching-queued-fetch.js#L68 On top of serializing requests sent to a given origin and sleeping a bit in between these requests, the new queue controls parallelization, making sure that the code does not load too many specs at once (I stuck to 4 in parallel as before). The net result is essentially the same as that achieved with the previous commit: specs are processed as if origins had been interleaved. But the new code is more straightforward, less verbose and more readable. Also, the code for the throttled queue is the exact same one as the code in browser-specs (see w3c/browser-specs#1356). Code is duplicated here out of simplicity and laziness to create and maintain a shared dependency.

src/fetch-info.js

tidoust · 2024-06-06T17:59:20Z

I'm marking the PR as ready for review, but then I still haven't been able to test it for real with the borked CSS drafts specs server, and so do not know the impact on the build duration for now.

dontcallmedom

I think I'll be interested in extracting the throttlequeue class if we can make it usable here, in reffy and cg-monitor 🙂

src/throttled-queue.js

dontcallmedom · 2024-06-06T20:19:46Z

While we're looking at the cg-monitor, another idea from there maybe worth exploring is to build an HTTP cache and reuse it from one job to the next with GitHub caches.

Co-authored-by: Dominique Hazael-Massieux <dom@w3.org>

tidoust · 2024-06-07T06:37:39Z

[Creating as draft as I'd be more confident if I were able to run a successful build locally]

Managed to run a build locally this morning. The fetch-info step took about ~8mn (compared to ~4mn previously). That's not surprising for >200 specs, most of them being CSS drafts or github.io specs. If we need to speed things up a bit, we may want to treat xxx.github.io URLs as different origins, but I propose to err on the conservative side for now.

Anyway, that takes longer but that worked. Meanwhile, the job failed, this time with a 429 on a CSS stylesheet served from w3.org. I think the only reason why we keep loading CSS stylesheets is that ReSpec loads a few styles dynamically and may not manage to build the spec if styles cannot be loaded. That may be worth checking again though.

dontcallmedom · 2024-06-07T06:40:13Z

how about we intercept css requests but instead of failing them, we return empty content?

dontcallmedom · 2024-06-07T06:42:02Z

that being said, the last job I see seemed to have 429-failed on the fxtf server, not on the CSS stylesheet

tidoust · 2024-06-07T06:46:50Z

that being said, the last job I see seemed to have 429-failed on the fxtf server, not on the CSS stylesheet

Ah right, the protocol error on the CSS stylesheet is probably a consequence of the build stopping due to the 429, not the cause.

This re-writes the convoluted logic introduced in previous commit (and the also convoluted logic that pre-existed) as suggested in #1582, taking inspiration from the queue in the CG monitor: https://github.com/w3c/cg-monitor/blob/5ceea24a1eaa7c1736d38909bcbb47df4ae297a3/lib/caching-queued-fetch.js#L68 On top of serializing requests sent to a given origin and sleeping a bit in between these requests, the new queue controls parallelization, making sure that the code does not load too many specs at once (I stuck to 4 in parallel as before). The net result is essentially the same as that achieved with the previous commit: specs are processed as if origins had been interleaved. But the new code is more straightforward, less verbose and more readable. Also, the code for the throttled queue is the exact same one as the code in browser-specs (see w3c/browser-specs#1356). Code is duplicated here out of simplicity and laziness to create and maintain a shared dependency. The sleep time depends on the origin being crawled: - The CSS server breaks easily, the code sleeps 2 seconds in between requests. - The www.w3.org server has a limit of 180 requests per minute, the code sleeps 1 second in between requests. - Other origins seem more amenable to faster crawls, the code sleeps 100ms in between requests.

plinss · 2024-06-07T20:10:45Z

The rate limiting on the csswg.org server is fairly recent (note that the same machine serves fxtf.org and css-houdini.org).

It's currently set to 1 request per second but allows bursts to load associated resources (varies based on path, e.g. more for images).

The rate limiting was implemented to defend against several crawlers that were severely impacting the server, but it can be adjusted if needed.

Also note that the server is also now running CrowdSec and repeat rate limit violations will result in temporary bans at the firewall (4 hours at first, increasing an additional 4 hours per offense).

tidoust · 2024-06-08T11:44:50Z

Thanks for clarifying the rate limit rules that were put in place, @plinss!

I guess browser-specs and Reffy were among the impacting crawlers ;) We have now tamed our crawling beasts, and the rules we adopted should work well with the limits you mention:

The crawler only sends one request at a time to a given origin
The crawler sleeps 2s in between requests sent to the csswg.org server
The crawler avoids loading associated resources (CSS stylesheets, images, some scripts that we know we do not need)
csswg.org, fxtf.org and css-houdini.org are considered to be the same origin

Crawls have resumed without errors since yesterday, so all looks good!

tidoust added 2 commits June 6, 2024 17:56

tidoust mentioned this pull request Jun 6, 2024

Re-write throttling logic with queues w3c/reffy#1587

Merged

tidoust commented Jun 6, 2024

View reviewed changes

src/fetch-info.js Show resolved Hide resolved

src/fetch-info.js Show resolved Hide resolved

src/fetch-info.js Show resolved Hide resolved

tidoust requested a review from dontcallmedom June 6, 2024 17:57

tidoust marked this pull request as ready for review June 6, 2024 17:59

dontcallmedom approved these changes Jun 6, 2024

View reviewed changes

src/throttled-queue.js Outdated Show resolved Hide resolved

src/throttled-queue.js Show resolved Hide resolved

Fix typo

62a9332

Co-authored-by: Dominique Hazael-Massieux <dom@w3.org>

tidoust merged commit 7e3dc35 into main Jun 7, 2024
1 check passed

tidoust deleted the throttling branch June 7, 2024 06:38

tidoust mentioned this pull request Jun 7, 2024

Consider building an HTTP cache stored as a GitHub cache #1357

Closed

tidoust mentioned this pull request Jun 7, 2024

Stop loading CSS stylesheets #1358

Closed

plinss mentioned this pull request Jun 7, 2024

Refresh throttling logic (based on code used in Reffy) #1360

Merged

tidoust mentioned this pull request Jun 8, 2024

Recurring crawl errors! w3c/webref#1244

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add throttling logic per origin #1356

Add throttling logic per origin #1356

tidoust commented Jun 6, 2024 •

edited

Loading

tidoust commented Jun 6, 2024

dontcallmedom left a comment

dontcallmedom commented Jun 6, 2024

tidoust commented Jun 7, 2024

dontcallmedom commented Jun 7, 2024

dontcallmedom commented Jun 7, 2024

tidoust commented Jun 7, 2024

plinss commented Jun 7, 2024

tidoust commented Jun 8, 2024

Add throttling logic per origin #1356

Add throttling logic per origin #1356

Conversation

tidoust commented Jun 6, 2024 • edited Loading

tidoust commented Jun 6, 2024

dontcallmedom left a comment

Choose a reason for hiding this comment

dontcallmedom commented Jun 6, 2024

tidoust commented Jun 7, 2024

dontcallmedom commented Jun 7, 2024

dontcallmedom commented Jun 7, 2024

tidoust commented Jun 7, 2024

plinss commented Jun 7, 2024

tidoust commented Jun 8, 2024

tidoust commented Jun 6, 2024 •

edited

Loading