Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add throttling logic per origin #1356

Merged
merged 3 commits into from
Jun 7, 2024
Merged

Add throttling logic per origin #1356

merged 3 commits into from
Jun 7, 2024

Conversation

tidoust
Copy link
Member

@tidoust tidoust commented Jun 6, 2024

[Creating as draft as I'd be more confident if I were able to run a successful build locally]

This adds a generic throttled queue per origin class that replaces the previous throttle function to add serialization (and sleep interval) per origin.

The code is inspired by:
https://github.com/w3c/cg-monitor/blob/5ceea24a1eaa7c1736d38909bcbb47df4ae297a3/lib/caching-queued-fetch.js#L68

There may be a way to rewrite the class in a function wrapper as the previous throttle module was doing. I just grew tired of abstractions.

(Note plan is to reuse the same class in Reffy, see w3c/reffy#1587)

This adds a generic throttled queue per origin class that replaces the previous
`throttle` function to add serialization (and sleep interval) per origin.

The code is inspired by:
https://github.com/w3c/cg-monitor/blob/5ceea24a1eaa7c1736d38909bcbb47df4ae297a3/lib/caching-queued-fetch.js#L68

(Note plan is to reuse the same class in Reffy)
SVG files were not intercepted in Reffy because a couple of specs were using a
clunky dynamic PNG fallback mechanism that created an infinite loop. These
specs were fixed some time ago, so I don't think that's still an issue and,
for some reason, the drafts CSS server takes a lot of time to serve SVG files
linked from specs.
tidoust added a commit to w3c/reffy that referenced this pull request Jun 6, 2024
This re-writes the convoluted logic introduced in previous commit (and the
also convoluted logic that pre-existed) as suggested in #1582, taking
inspiration from the queue in the CG monitor:
https://github.com/w3c/cg-monitor/blob/5ceea24a1eaa7c1736d38909bcbb47df4ae297a3/lib/caching-queued-fetch.js#L68

On top of serializing requests sent to a given origin and sleeping a bit in
between these requests, the new queue controls parallelization, making sure
that the code does not load too many specs at once (I stuck to 4 in parallel as
before).

The net result is essentially the same as that achieved with the previous
commit: specs are processed as if origins had been interleaved. But the new
code is more straightforward, less verbose and more readable.

Also, the code for the throttled queue is the exact same one as the code in
browser-specs (see w3c/browser-specs#1356). Code is
duplicated here out of simplicity and laziness to create and maintain a shared
dependency.
src/fetch-info.js Show resolved Hide resolved
src/fetch-info.js Show resolved Hide resolved
src/fetch-info.js Show resolved Hide resolved
@tidoust
Copy link
Member Author

tidoust commented Jun 6, 2024

I'm marking the PR as ready for review, but then I still haven't been able to test it for real with the borked CSS drafts specs server, and so do not know the impact on the build duration for now.

@tidoust tidoust marked this pull request as ready for review June 6, 2024 17:59
Copy link
Member

@dontcallmedom dontcallmedom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'll be interested in extracting the throttlequeue class if we can make it usable here, in reffy and cg-monitor 🙂

src/throttled-queue.js Outdated Show resolved Hide resolved
src/throttled-queue.js Show resolved Hide resolved
@dontcallmedom
Copy link
Member

While we're looking at the cg-monitor, another idea from there maybe worth exploring is to build an HTTP cache and reuse it from one job to the next with GitHub caches.

Co-authored-by: Dominique Hazael-Massieux <dom@w3.org>
@tidoust
Copy link
Member Author

tidoust commented Jun 7, 2024

[Creating as draft as I'd be more confident if I were able to run a successful build locally]

Managed to run a build locally this morning. The fetch-info step took about ~8mn (compared to ~4mn previously). That's not surprising for >200 specs, most of them being CSS drafts or github.io specs. If we need to speed things up a bit, we may want to treat xxx.github.io URLs as different origins, but I propose to err on the conservative side for now.

Anyway, that takes longer but that worked. Meanwhile, the job failed, this time with a 429 on a CSS stylesheet served from w3.org. I think the only reason why we keep loading CSS stylesheets is that ReSpec loads a few styles dynamically and may not manage to build the spec if styles cannot be loaded. That may be worth checking again though.

@tidoust tidoust merged commit 7e3dc35 into main Jun 7, 2024
1 check passed
@tidoust tidoust deleted the throttling branch June 7, 2024 06:38
@dontcallmedom
Copy link
Member

how about we intercept css requests but instead of failing them, we return empty content?

@dontcallmedom
Copy link
Member

that being said, the last job I see seemed to have 429-failed on the fxtf server, not on the CSS stylesheet

@tidoust
Copy link
Member Author

tidoust commented Jun 7, 2024

that being said, the last job I see seemed to have 429-failed on the fxtf server, not on the CSS stylesheet

Ah right, the protocol error on the CSS stylesheet is probably a consequence of the build stopping due to the 429, not the cause.

tidoust added a commit to w3c/reffy that referenced this pull request Jun 7, 2024
This re-writes the convoluted logic introduced in previous commit (and the
also convoluted logic that pre-existed) as suggested in #1582, taking
inspiration from the queue in the CG monitor:
https://github.com/w3c/cg-monitor/blob/5ceea24a1eaa7c1736d38909bcbb47df4ae297a3/lib/caching-queued-fetch.js#L68

On top of serializing requests sent to a given origin and sleeping a bit in
between these requests, the new queue controls parallelization, making sure
that the code does not load too many specs at once (I stuck to 4 in parallel as
before).

The net result is essentially the same as that achieved with the previous
commit: specs are processed as if origins had been interleaved. But the new
code is more straightforward, less verbose and more readable.

Also, the code for the throttled queue is the exact same one as the code in
browser-specs (see w3c/browser-specs#1356). Code is
duplicated here out of simplicity and laziness to create and maintain a shared
dependency.

The sleep time depends on the origin being crawled:
- The CSS server breaks easily, the code sleeps 2 seconds in between requests.
- The www.w3.org server has a limit of 180 requests per minute, the code sleeps
1 second in between requests.
- Other origins seem more amenable to faster crawls, the code sleeps 100ms in
between requests.
@plinss
Copy link
Member

plinss commented Jun 7, 2024

The rate limiting on the csswg.org server is fairly recent (note that the same machine serves fxtf.org and css-houdini.org).

It's currently set to 1 request per second but allows bursts to load associated resources (varies based on path, e.g. more for images).

The rate limiting was implemented to defend against several crawlers that were severely impacting the server, but it can be adjusted if needed.

Also note that the server is also now running CrowdSec and repeat rate limit violations will result in temporary bans at the firewall (4 hours at first, increasing an additional 4 hours per offense).

@tidoust
Copy link
Member Author

tidoust commented Jun 8, 2024

Thanks for clarifying the rate limit rules that were put in place, @plinss!

I guess browser-specs and Reffy were among the impacting crawlers ;) We have now tamed our crawling beasts, and the rules we adopted should work well with the limits you mention:

  • The crawler only sends one request at a time to a given origin
  • The crawler sleeps 2s in between requests sent to the csswg.org server
  • The crawler avoids loading associated resources (CSS stylesheets, images, some scripts that we know we do not need)
  • csswg.org, fxtf.org and css-houdini.org are considered to be the same origin

Crawls have resumed without errors since yesterday, so all looks good!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants