feat: implement per-host rate limiting and statistics #1929

thomas-zahner · 2025-11-21T13:20:16Z

Supersedes #1844. Rebased master to resolve conflicts. Since there were many conflicts and I spent some time resolving them I didn't force push to the existing branch in case I messed something up. (it seems like I didn't)

Things to be done:

Don't return Client from check & update Arcs (https://github.com/lycheeverse/lychee/pull/1844/files#r2360356264)
CLI test for host-specific headers
Fix the behavior when setting the request interval to 0
Maybe CLI test to verify expected timing?
Do through manual testing and potentially reconsider new default settings

Fixes #1605 - Add Per-Host Rate Limiting and Caching
Fixes #989 - Add custom delay between requests (prevent ban)
Addresses #1298 - restrict custom HTTP request headers to specific URL patterns
Addresses #367 - 429 Too Many Requests
Addresses #1593 - The cache is ineffective with the default concurrency
Addresses #1815 - All duplicates should get removed, and no link ever gets checked more than once

thomas-zahner · 2025-12-10T12:54:18Z

@mre I cleaned up the PR quite a bit. I think the PR is almost ready to be merged. I'm just considering some additional testing and changing some defaults, and maybe address the errors when request_interval is set to 0. If you have time, could you review my additions?

Removed things

While cleaning up I encountered a few things which I didn't understand why you added them. I've removed them to test and see if it made a difference. I think we're better of without them, unless you correct me.

I've removed cache_max_age, see 47d8b65.
As explained in the commit message I don't see why it is needed.
I've understood the cache to be only host-specific and in memory.
It has nothing to do with the lychee cache exposed to users. (.lycheecache)
Why would we need cache expiration and why would we set up a expiration time of one hour,
when a link check run normally takes a few seconds/minutes?

In c9578b9 I've removed max_concurrency and global_semaphore.
As explained in the message I didn't see any use of them either.
Did I miss something?

See 727f209 where I replace Window with Vec. Was there a good reason for using a Window? IMO it only distorts the statistics. Memory shouldn't be an issue, unless we had millions of requests, Duration is 16 bytes.

lychee-bin/tests/cli.rs

mre · 2025-12-16T11:27:51Z

I've removed cache_max_age, see 47d8b65. As explained in the commit message I don't see why it is needed.

You're right. Makes no sense. Similar to limiting the window-size, I was a bit worried about memory usage for long-running processes. My thought was that we could limit the in-memory cache in case link checking takes a long time. This might become relevant once we add recursion support, but even then I'm not 100% sure. So it's unlikely that this causes a problem, and I'm thankful that it got removed.

In c9578b9 I've removed max_concurrency and global_semaphore. As explained in the message I didn't see any use of them either. Did I miss something?

The global semaphore was to enforce an overall concurrency limit, but I somehow missed that our mpsc channel already limits that. So that's unnecessary. Thanks for the cleanup.

See 727f209 where I replace Window with Vec. Was there a good reason for using a Window? IMO it only distorts the statistics. Memory shouldn't be an issue, unless we had millions of requests, Duration is 16 bytes.

Yeah, I was a bit worried about the memory usage, actually, in the case where we handle millions of requests (which is unlikely but not impossible). But it's probably a case of premature optimization and before that part becomes a problem, other issues might arise before (such as input handling). So it's fine to use a Vec for consistency.

mre · 2025-12-16T11:35:33Z

Great work!

Can we set a different default for GitHub? As per their docs, they allow up to 100 concurrent requests and many of our users use lychee to check GitHub links specifically, which would reduce friction as it closely matches the prior behavior.

If we decide to change GitHub's default, we should mention that in the README somewhere and maybe in the extended help text.

lychee-lib/src/client.rs

lychee-lib/src/types/error.rs

lychee-lib/src/ratelimit/pool.rs

lychee-lib/src/ratelimit/mod.rs

lychee-lib/src/ratelimit/headers.rs

lychee-lib/src/ratelimit/config.rs

lychee-lib/src/ratelimit/host/host.rs

lychee-bin/src/options.rs

lychee-bin/src/formatters/stats/mod.rs

lychee-bin/src/commands/check.rs

mre

Done with my review. The logic looks correct to me. The parts I commented on were mistakes I introduced myself. I'd be thankful if you could take a look and let me know what you think.

mre · 2025-12-16T12:22:40Z

One more note: have we checked that caching still works? E.g. do the links end up in the cache file at the end?

Includes removing boilerplate code and removing instances of Arc.

Reuse reqwest client and make pool non-optional. Removed cache_max_age in the process as it didnt't seem useful.

But this time in lychee-lib client and not on demand. This reduces complexity quite a bit. Test that host-specific headers are correctly sent.

I do not see any reasons for having them. The global concurrency is already limited by the mpsc channel buffer size.

This means less code and better accuracy. As `Duration` is only 16 bytes in size, memory consumption should really not be a problem.

This makes error handling more user friendly

After previous refactoring only two variants were still used. Use the better fitting ErrorKind::InvalidUrlHost instead of RateLimitError::UrlParseError. Replace confusing RateLimitError::RateLimitExceeded with expect.

Co-authored-by: Matthias Endler <matthias@endler.dev>

Remove unneeded Result and move unwrap to compile time

The default host request interval is now more optimistic. The option names are now more consistent. We now have the host_ prefix for host related options. This prefix is dropped inside the host config.

This also reverts "Create RequestInterval" 89fd9f9.

thomas-zahner · 2025-12-21T17:54:06Z

Thanks for the high-quality additions and for following through. This is a massive step forward and I'm happy to finally see it land. A lot of our current issues will go away once merged. 😊 ⭐

Thank you for the core work! Totally agree that is quite a milestone for lychee. I did some tests where I compared lychee v0.21.0 with this branch and the link process was more than twice as fast. (so intentional slower requests leads to an overall speedup) Will do some follow up testing and this might be worth a report and blog post.

thomas-zahner changed the title ~~Per host rate limiting~~ feat: implement per-host rate limiting and statistics Nov 27, 2025

thomas-zahner mentioned this pull request Nov 27, 2025

Security: restrict custom HTTP request headers to specific URL patterns #1298

Open

thomas-zahner force-pushed the per-host-rate-limiting branch from 82f8e9b to 8c5ccbc Compare November 27, 2025 12:59

thomas-zahner force-pushed the per-host-rate-limiting branch 3 times, most recently from 65c9b4d to e8b026f Compare December 10, 2025 10:50

thomas-zahner mentioned this pull request Dec 10, 2025

feat: implement per-host rate limiting and statistics #1844

Closed

thomas-zahner marked this pull request as ready for review December 10, 2025 12:56

thomas-zahner commented Dec 10, 2025

View reviewed changes

lychee-bin/tests/cli.rs Show resolved Hide resolved

thomas-zahner force-pushed the per-host-rate-limiting branch 3 times, most recently from bdd446d to a096be7 Compare December 15, 2025 14:06