Skip to content

Conversation

@thomas-zahner
Copy link
Member

@thomas-zahner thomas-zahner commented Nov 21, 2025

Supersedes #1844. Rebased master to resolve conflicts. Since there were many conflicts and I spent some time resolving them I didn't force push to the existing branch in case I messed something up. (it seems like I didn't)

Things to be done:

Fixes #1605 - Add Per-Host Rate Limiting and Caching
Fixes #989 - Add custom delay between requests (prevent ban)
Addresses #1298 - restrict custom HTTP request headers to specific URL patterns
Addresses #367 - 429 Too Many Requests
Addresses #1593 - The cache is ineffective with the default concurrency
Addresses #1815 - All duplicates should get removed, and no link ever gets checked more than once

@thomas-zahner thomas-zahner changed the title Per host rate limiting feat: implement per-host rate limiting and statistics Nov 27, 2025
@thomas-zahner thomas-zahner force-pushed the per-host-rate-limiting branch 3 times, most recently from 65c9b4d to e8b026f Compare December 10, 2025 10:50
@thomas-zahner
Copy link
Member Author

thomas-zahner commented Dec 10, 2025

@mre I cleaned up the PR quite a bit. I think the PR is almost ready to be merged. I'm just considering some additional testing and changing some defaults, and maybe address the errors when request_interval is set to 0. If you have time, could you review my additions?

Removed things

While cleaning up I encountered a few things which I didn't understand why you added them. I've removed them to test and see if it made a difference. I think we're better of without them, unless you correct me.

I've removed cache_max_age, see 47d8b65.
As explained in the commit message I don't see why it is needed.
I've understood the cache to be only host-specific and in memory.
It has nothing to do with the lychee cache exposed to users. (.lycheecache)
Why would we need cache expiration and why would we set up a expiration time of one hour,
when a link check run normally takes a few seconds/minutes?

In c9578b9 I've removed max_concurrency and global_semaphore.
As explained in the message I didn't see any use of them either.
Did I miss something?

See 727f209 where I replace Window with Vec. Was there a good reason for using a Window? IMO it only distorts the statistics. Memory shouldn't be an issue, unless we had millions of requests, Duration is 16 bytes.

@thomas-zahner thomas-zahner marked this pull request as ready for review December 10, 2025 12:56
@thomas-zahner thomas-zahner force-pushed the per-host-rate-limiting branch 3 times, most recently from bdd446d to a096be7 Compare December 15, 2025 14:06
@mre
Copy link
Member

mre commented Dec 16, 2025

I've removed cache_max_age, see 47d8b65. As explained in the commit message I don't see why it is needed.

You're right. Makes no sense. Similar to limiting the window-size, I was a bit worried about memory usage for long-running processes. My thought was that we could limit the in-memory cache in case link checking takes a long time. This might become relevant once we add recursion support, but even then I'm not 100% sure. So it's unlikely that this causes a problem, and I'm thankful that it got removed.

In c9578b9 I've removed max_concurrency and global_semaphore. As explained in the message I didn't see any use of them either. Did I miss something?

The global semaphore was to enforce an overall concurrency limit, but I somehow missed that our mpsc channel already limits that. So that's unnecessary. Thanks for the cleanup.

See 727f209 where I replace Window with Vec. Was there a good reason for using a Window? IMO it only distorts the statistics. Memory shouldn't be an issue, unless we had millions of requests, Duration is 16 bytes.

Yeah, I was a bit worried about the memory usage, actually, in the case where we handle millions of requests (which is unlikely but not impossible). But it's probably a case of premature optimization and before that part becomes a problem, other issues might arise before (such as input handling). So it's fine to use a Vec for consistency.

@mre
Copy link
Member

mre commented Dec 16, 2025

Great work!

Can we set a different default for GitHub? As per their docs, they allow up to 100 concurrent requests and many of our users use lychee to check GitHub links specifically, which would reduce friction as it closely matches the prior behavior.

If we decide to change GitHub's default, we should mention that in the README somewhere and maybe in the extended help text.

Copy link
Member

@mre mre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done with my review. The logic looks correct to me. The parts I commented on were mistakes I introduced myself. I'd be thankful if you could take a look and let me know what you think.

@mre
Copy link
Member

mre commented Dec 16, 2025

One more note: have we checked that caching still works? E.g. do the links end up in the cache file at the end?

@thomas-zahner thomas-zahner force-pushed the per-host-rate-limiting branch 3 times, most recently from b9c1172 to f61d312 Compare December 17, 2025 10:46
thomas-zahner and others added 23 commits December 21, 2025 18:30
Includes removing boilerplate code and removing instances of Arc.
Reuse reqwest client and make pool non-optional.
Removed cache_max_age in the process as it didnt't seem useful.
But this time in lychee-lib client and not on demand.
This reduces complexity quite a bit.
Test that host-specific headers are correctly sent.
I do not see any reasons for having them. The global concurrency is
already limited by the mpsc channel buffer size.
This means less code and better accuracy. As `Duration` is only 16 bytes
in size, memory consumption should really not be a problem.
This makes error handling more user friendly
After previous refactoring only two variants were still used.
Use the better fitting ErrorKind::InvalidUrlHost instead of RateLimitError::UrlParseError.
Replace confusing RateLimitError::RateLimitExceeded with expect.
Co-authored-by: Matthias Endler <matthias@endler.dev>
Remove unneeded Result and move unwrap to compile time
The default host request interval is now more optimistic.
The option names are now more consistent.
We now have the host_ prefix for host related options.
This prefix is dropped inside the host config.
This also reverts "Create RequestInterval" 89fd9f9.
@thomas-zahner
Copy link
Member Author

Thanks for the high-quality additions and for following through. This is a massive step forward and I'm happy to finally see it land. A lot of our current issues will go away once merged. 😊 ⭐

Thank you for the core work! Totally agree that is quite a milestone for lychee. I did some tests where I compared lychee v0.21.0 with this branch and the link process was more than twice as fast. (so intentional slower requests leads to an overall speedup) Will do some follow up testing and this might be worth a report and blog post.

@thomas-zahner thomas-zahner merged commit f03b1c9 into master Dec 21, 2025
7 checks passed
@thomas-zahner thomas-zahner deleted the per-host-rate-limiting branch December 21, 2025 17:54
@mre mre mentioned this pull request Dec 19, 2025
@thomas-zahner thomas-zahner self-assigned this Jan 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add Per-Host Rate Limiting and Caching Add custom delay inbetween requests (prevent ban)

3 participants