-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add throttling logic per origin #1356
Conversation
This adds a generic throttled queue per origin class that replaces the previous `throttle` function to add serialization (and sleep interval) per origin. The code is inspired by: https://github.com/w3c/cg-monitor/blob/5ceea24a1eaa7c1736d38909bcbb47df4ae297a3/lib/caching-queued-fetch.js#L68 (Note plan is to reuse the same class in Reffy)
SVG files were not intercepted in Reffy because a couple of specs were using a clunky dynamic PNG fallback mechanism that created an infinite loop. These specs were fixed some time ago, so I don't think that's still an issue and, for some reason, the drafts CSS server takes a lot of time to serve SVG files linked from specs.
This re-writes the convoluted logic introduced in previous commit (and the also convoluted logic that pre-existed) as suggested in #1582, taking inspiration from the queue in the CG monitor: https://github.com/w3c/cg-monitor/blob/5ceea24a1eaa7c1736d38909bcbb47df4ae297a3/lib/caching-queued-fetch.js#L68 On top of serializing requests sent to a given origin and sleeping a bit in between these requests, the new queue controls parallelization, making sure that the code does not load too many specs at once (I stuck to 4 in parallel as before). The net result is essentially the same as that achieved with the previous commit: specs are processed as if origins had been interleaved. But the new code is more straightforward, less verbose and more readable. Also, the code for the throttled queue is the exact same one as the code in browser-specs (see w3c/browser-specs#1356). Code is duplicated here out of simplicity and laziness to create and maintain a shared dependency.
I'm marking the PR as ready for review, but then I still haven't been able to test it for real with the borked CSS drafts specs server, and so do not know the impact on the build duration for now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'll be interested in extracting the throttlequeue class if we can make it usable here, in reffy and cg-monitor 🙂
While we're looking at the cg-monitor, another idea from there maybe worth exploring is to build an HTTP cache and reuse it from one job to the next with GitHub caches. |
Managed to run a build locally this morning. The fetch-info step took about ~8mn (compared to ~4mn previously). That's not surprising for >200 specs, most of them being CSS drafts or Anyway, that takes longer but that worked. Meanwhile, the job failed, this time with a 429 on a CSS stylesheet served from w3.org. I think the only reason why we keep loading CSS stylesheets is that ReSpec loads a few styles dynamically and may not manage to build the spec if styles cannot be loaded. That may be worth checking again though. |
how about we intercept css requests but instead of failing them, we return empty content? |
that being said, the last job I see seemed to have 429-failed on the fxtf server, not on the CSS stylesheet |
Ah right, the protocol error on the CSS stylesheet is probably a consequence of the build stopping due to the 429, not the cause. |
This re-writes the convoluted logic introduced in previous commit (and the also convoluted logic that pre-existed) as suggested in #1582, taking inspiration from the queue in the CG monitor: https://github.com/w3c/cg-monitor/blob/5ceea24a1eaa7c1736d38909bcbb47df4ae297a3/lib/caching-queued-fetch.js#L68 On top of serializing requests sent to a given origin and sleeping a bit in between these requests, the new queue controls parallelization, making sure that the code does not load too many specs at once (I stuck to 4 in parallel as before). The net result is essentially the same as that achieved with the previous commit: specs are processed as if origins had been interleaved. But the new code is more straightforward, less verbose and more readable. Also, the code for the throttled queue is the exact same one as the code in browser-specs (see w3c/browser-specs#1356). Code is duplicated here out of simplicity and laziness to create and maintain a shared dependency. The sleep time depends on the origin being crawled: - The CSS server breaks easily, the code sleeps 2 seconds in between requests. - The www.w3.org server has a limit of 180 requests per minute, the code sleeps 1 second in between requests. - Other origins seem more amenable to faster crawls, the code sleeps 100ms in between requests.
The rate limiting on the csswg.org server is fairly recent (note that the same machine serves fxtf.org and css-houdini.org). It's currently set to 1 request per second but allows bursts to load associated resources (varies based on path, e.g. more for images). The rate limiting was implemented to defend against several crawlers that were severely impacting the server, but it can be adjusted if needed. Also note that the server is also now running CrowdSec and repeat rate limit violations will result in temporary bans at the firewall (4 hours at first, increasing an additional 4 hours per offense). |
Thanks for clarifying the rate limit rules that were put in place, @plinss! I guess browser-specs and Reffy were among the impacting crawlers ;) We have now tamed our crawling beasts, and the rules we adopted should work well with the limits you mention:
Crawls have resumed without errors since yesterday, so all looks good! |
[Creating as draft as I'd be more confident if I were able to run a successful build locally]
This adds a generic throttled queue per origin class that replaces the previous
throttle
function to add serialization (and sleep interval) per origin.The code is inspired by:
https://github.com/w3c/cg-monitor/blob/5ceea24a1eaa7c1736d38909bcbb47df4ae297a3/lib/caching-queued-fetch.js#L68
There may be a way to rewrite the class in a function wrapper as the previous throttle module was doing. I just grew tired of abstractions.
(Note plan is to reuse the same class in Reffy, see w3c/reffy#1587)