-
-
Notifications
You must be signed in to change notification settings - Fork 189
feat: implement per-host rate limiting and statistics #1929
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
82f8e9b to
8c5ccbc
Compare
65c9b4d to
e8b026f
Compare
|
@mre I cleaned up the PR quite a bit. I think the PR is almost ready to be merged. I'm just considering some additional testing and changing some defaults, and maybe address the errors when Removed thingsWhile cleaning up I encountered a few things which I didn't understand why you added them. I've removed them to test and see if it made a difference. I think we're better of without them, unless you correct me. I've removed In c9578b9 I've removed See 727f209 where I replace |
bdd446d to
a096be7
Compare
You're right. Makes no sense. Similar to limiting the window-size, I was a bit worried about memory usage for long-running processes. My thought was that we could limit the in-memory cache in case link checking takes a long time. This might become relevant once we add recursion support, but even then I'm not 100% sure. So it's unlikely that this causes a problem, and I'm thankful that it got removed.
The global semaphore was to enforce an overall concurrency limit, but I somehow missed that our mpsc channel already limits that. So that's unnecessary. Thanks for the cleanup.
Yeah, I was a bit worried about the memory usage, actually, in the case where we handle millions of requests (which is unlikely but not impossible). But it's probably a case of premature optimization and before that part becomes a problem, other issues might arise before (such as input handling). So it's fine to use a |
|
Great work! Can we set a different default for GitHub? As per their docs, they allow up to 100 concurrent requests and many of our users use lychee to check GitHub links specifically, which would reduce friction as it closely matches the prior behavior. If we decide to change GitHub's default, we should mention that in the |
mre
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done with my review. The logic looks correct to me. The parts I commented on were mistakes I introduced myself. I'd be thankful if you could take a look and let me know what you think.
|
One more note: have we checked that caching still works? E.g. do the links end up in the cache file at the end? |
b9c1172 to
f61d312
Compare
Includes removing boilerplate code and removing instances of Arc.
Reuse reqwest client and make pool non-optional. Removed cache_max_age in the process as it didnt't seem useful.
But this time in lychee-lib client and not on demand. This reduces complexity quite a bit. Test that host-specific headers are correctly sent.
I do not see any reasons for having them. The global concurrency is already limited by the mpsc channel buffer size.
This means less code and better accuracy. As `Duration` is only 16 bytes in size, memory consumption should really not be a problem.
This makes error handling more user friendly
After previous refactoring only two variants were still used. Use the better fitting ErrorKind::InvalidUrlHost instead of RateLimitError::UrlParseError. Replace confusing RateLimitError::RateLimitExceeded with expect.
Co-authored-by: Matthias Endler <matthias@endler.dev>
Remove unneeded Result and move unwrap to compile time
The default host request interval is now more optimistic. The option names are now more consistent. We now have the host_ prefix for host related options. This prefix is dropped inside the host config.
This also reverts "Create RequestInterval" 89fd9f9.
e571138 to
b815e61
Compare
Thank you for the core work! Totally agree that is quite a milestone for lychee. I did some tests where I compared lychee v0.21.0 with this branch and the link process was more than twice as fast. (so intentional slower requests leads to an overall speedup) Will do some follow up testing and this might be worth a report and blog post. |
Supersedes #1844. Rebased master to resolve conflicts. Since there were many conflicts and I spent some time resolving them I didn't force push to the existing branch in case I messed something up. (it seems like I didn't)
Things to be done:
Clientfromcheck& updateArcs (https://github.com/lycheeverse/lychee/pull/1844/files#r2360356264)Fixes #1605 - Add Per-Host Rate Limiting and Caching
Fixes #989 - Add custom delay between requests (prevent ban)
Addresses #1298 - restrict custom HTTP request headers to specific URL patterns
Addresses #367 - 429 Too Many Requests
Addresses #1593 - The cache is ineffective with the default concurrency
Addresses #1815 - All duplicates should get removed, and no link ever gets checked more than once