Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
43 commits
Select commit Hold shift + click to select a range
f152b4d
fix: add support for name attributes in HTML fragment extraction
mre Sep 5, 2025
f165f37
feat: implement per-host rate limiting and statistics
mre Sep 7, 2025
c7480ff
fix: skip rate limiting tracking for file:// URLs
mre Sep 7, 2025
462ba32
feat: improve rate limiting logging and output formatting
mre Sep 7, 2025
7d25ea2
Fix lints
mre Sep 8, 2025
956e20a
Fix cookie jar sharing in per-host rate limiting
mre Sep 8, 2025
20844d2
Fix missing User-Agent header in per-host clients
mre Sep 8, 2025
9ecb5e9
Bring back global headers (e.g. for user-agent)
mre Sep 8, 2025
ded5bc1
Fix redirect handling in per-host clients
mre Sep 8, 2025
4577865
Pass missing args: max_redirects, timeout, allow_insecure
mre Sep 8, 2025
4e04271
Refactor host stats formatters to remove unused parameters and improv…
mre Sep 18, 2025
eed7576
remove confusing comment
mre Sep 18, 2025
ae104a0
Create `display_per_host_statistics` in separate file
mre Sep 18, 2025
8437068
Remove redundant check for `self.hosts`
mre Sep 22, 2025
92854c1
Import `std::collections::HashMap`
mre Sep 22, 2025
ada65ac
Use closures instead of if
mre Sep 22, 2025
d6f7836
Rename flags:
mre Sep 22, 2025
5960946
Fix `help` formatting
mre Sep 22, 2025
ea22e44
Reduce code duplication
thomas-zahner Oct 3, 2025
d33650d
Update documentation to reference hosts option
thomas-zahner Nov 21, 2025
dbada0d
clippy --fix
thomas-zahner Nov 27, 2025
3e0755c
Return HostPool instead of Client & code cleanup
thomas-zahner Nov 27, 2025
0939004
Move inner `Arc`s to the outside
thomas-zahner Nov 28, 2025
087ed56
Fix deadlock
thomas-zahner Nov 28, 2025
040971d
Update config options
thomas-zahner Nov 28, 2025
328ee22
Simplify host pool
thomas-zahner Dec 9, 2025
42dc072
Build host-specific reqwest clients again
thomas-zahner Dec 9, 2025
595e634
Remove max_concurrency and global_semaphore
thomas-zahner Dec 9, 2025
d374155
Update docs & reduce complexity
thomas-zahner Dec 10, 2025
3fdd992
Extract output functions
thomas-zahner Dec 10, 2025
7be4516
Replace Window with Vec
thomas-zahner Dec 10, 2025
420e822
Update RateLimitError
thomas-zahner Dec 11, 2025
00d0c13
Create RequestInterval
thomas-zahner Dec 11, 2025
08602e7
Remove RateLimitError
thomas-zahner Dec 12, 2025
75df7e8
Test and improve rate limit header handling
thomas-zahner Dec 15, 2025
3d0d4fa
Apply @mre's suggestions
thomas-zahner Dec 16, 2025
c748f7d
Apply suggestions from code review
thomas-zahner Dec 16, 2025
6297aa0
Fix tests
thomas-zahner Dec 17, 2025
a551fee
Minor improvements
thomas-zahner Dec 18, 2025
5704305
Remove reqwest_client from WebsiteChecker
thomas-zahner Dec 18, 2025
d5e8afe
Reference rate-limits crate as per @mre's suggestion
thomas-zahner Dec 18, 2025
0f21985
Update option names and the default interval value
thomas-zahner Dec 19, 2025
b815e61
Allow 0 to disable per-host rate limiting
thomas-zahner Dec 19, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
69 changes: 69 additions & 0 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

36 changes: 35 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -423,7 +423,12 @@ Options:
--default-extension <EXTENSION>
This is the default file extension that is applied to files without an extension.

This is useful for files without extensions or with unknown extensions. The extension will be used to determine the file type for processing. Examples: --default-extension md, --default-extension html
This is useful for files without extensions or with unknown extensions.
The extension will be used to determine the file type for processing.

Examples:
--default-extension md
--default-extension html

--dump
Don't perform any link checking. Instead, dump all the links extracted from inputs that would be checked
Expand Down Expand Up @@ -519,10 +524,39 @@ Options:
You can specify custom headers in the format 'Name: Value'. For example, 'Accept: text/html'.
This is the same format that other tools like curl or wget use.
Multiple headers can be specified by using the flag multiple times.
The specified headers are used for ALL requests.
Use the `hosts` option to configure headers on a per-host basis.

--hidden
Do not skip hidden directories and files

--host-concurrency <HOST_CONCURRENCY>
Default maximum concurrent requests per host (default: 10)

This limits the maximum amount of requests that are sent simultaneously
to the same host. This helps to prevent overwhelming servers and
running into rate-limits. Use the `hosts` option to configure this
on a per-host basis.

Examples:
--host-concurrency 2 # Conservative for slow APIs
--host-concurrency 20 # Aggressive for fast APIs

--host-request-interval <HOST_REQUEST_INTERVAL>
Minimum interval between requests to the same host (default: 50ms)

Sets a baseline delay between consecutive requests to prevent
overloading servers. The adaptive algorithm may increase this based
on server responses (rate limits, errors). Use the `hosts` option
to configure this on a per-host basis.

Examples:
--host-request-interval 50ms # Fast for robust APIs
--host-request-interval 1s # Conservative for rate-limited APIs

--host-stats
Show per-host statistics at the end of the run

-i, --insecure
Proceed for server connections considered insecure (invalid TLS)

Expand Down
4 changes: 4 additions & 0 deletions fixtures/configs/headers.toml
Original file line number Diff line number Diff line change
Expand Up @@ -4,3 +4,7 @@ X-Bar = "Baz"

# Alternative TOML syntax:
# header = { X-Foo = "Bar", X-Bar = "Baz" }


[hosts."127.0.0.1"]
headers = { "X-Host-Specific" = "Foo" }
7 changes: 6 additions & 1 deletion lychee-bin/src/client.rs
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ use crate::options::{Config, HeaderMapExt};
use crate::parse::{parse_duration_secs, parse_remaps};
use anyhow::{Context, Result};
use http::{HeaderMap, StatusCode};
use lychee_lib::{Client, ClientBuilder};
use lychee_lib::{Client, ClientBuilder, ratelimit::RateLimitConfig};
use regex::RegexSet;
use reqwest_cookie_store::CookieStoreMutex;
use std::sync::Arc;
Expand Down Expand Up @@ -55,6 +55,11 @@ pub(crate) fn create(cfg: &Config, cookie_jar: Option<&Arc<CookieStoreMutex>>) -
.include_fragments(cfg.include_fragments)
.fallback_extensions(cfg.fallback_extensions.clone())
.index_files(cfg.index_files.clone())
.rate_limit_config(RateLimitConfig::from_options(
cfg.host_concurrency,
cfg.host_request_interval,
))
.hosts(cfg.hosts.clone())
.build()
.client()
.context("Failed to create request client")
Expand Down
53 changes: 36 additions & 17 deletions lychee-bin/src/commands/check.rs
Original file line number Diff line number Diff line change
@@ -1,8 +1,10 @@
use std::collections::HashSet;
use std::sync::{Arc, Mutex};
use std::sync::Arc;
use std::sync::Mutex;
use std::time::Duration;

use futures::StreamExt;
use lychee_lib::ratelimit::HostPool;
use reqwest::Url;
use tokio::sync::mpsc;
use tokio_stream::wrappers::ReceiverStream;
Expand All @@ -24,7 +26,7 @@ use super::CommandParams;

pub(crate) async fn check<S>(
params: CommandParams<S>,
) -> Result<(ResponseStats, Arc<Cache>, ExitCode), ErrorKind>
) -> Result<(ResponseStats, Cache, ExitCode, Arc<HostPool>), ErrorKind>
where
S: futures::Stream<Item = Result<Request, RequestError>>,
{
Expand All @@ -41,7 +43,6 @@ where
} else {
ResponseStats::default()
};
let cache_ref = params.cache.clone();

let client = params.client;
let cache = params.cache;
Expand All @@ -53,7 +54,7 @@ where
let accept = params.cfg.accept.into();

// Start receiving requests
tokio::spawn(request_channel_task(
let handle = tokio::spawn(request_channel_task(
recv_req,
send_resp,
max_concurrency,
Expand All @@ -74,8 +75,9 @@ where
stats,
));

// Wait until all messages are sent
send_inputs_loop(params.requests, send_req, &progress).await?;
// Wait until all requests are sent
send_requests(params.requests, send_req, &progress).await?;
let (cache, client) = handle.await?;

// Wait until all responses are received
let result = show_results_task.await?;
Expand Down Expand Up @@ -103,7 +105,8 @@ where
} else {
ExitCode::LinkCheckFailure
};
Ok((stats, cache_ref, code))

Ok((stats, cache, code, client.host_pool()))
}

async fn suggest_archived_links(
Expand Down Expand Up @@ -143,7 +146,7 @@ async fn suggest_archived_links(
// drops the `send_req` channel on exit
// required for the receiver task to end, which closes send_resp, which allows
// the show_results_task to finish
async fn send_inputs_loop<S>(
async fn send_requests<S>(
requests: S,
send_req: mpsc::Sender<Result<Request, RequestError>>,
progress: &Progress,
Expand Down Expand Up @@ -180,17 +183,17 @@ async fn request_channel_task(
send_resp: mpsc::Sender<Result<Response, ErrorKind>>,
max_concurrency: usize,
client: Client,
cache: Arc<Cache>,
cache: Cache,
cache_exclude_status: HashSet<u16>,
accept: HashSet<u16>,
) {
) -> (Cache, Client) {
StreamExt::for_each_concurrent(
ReceiverStream::new(recv_req),
max_concurrency,
|request: Result<Request, RequestError>| async {
let response = handle(
&client,
cache.clone(),
&cache,
cache_exclude_status.clone(),
request,
accept.clone(),
Expand All @@ -204,6 +207,8 @@ async fn request_channel_task(
},
)
.await;

(cache, client)
}

/// Check a URL and return a response.
Expand Down Expand Up @@ -235,7 +240,7 @@ async fn check_url(client: &Client, request: Request) -> Response {
/// a failed response.
async fn handle(
client: &Client,
cache: Arc<Cache>,
cache: &Cache,
cache_exclude_status: HashSet<u16>,
request: Result<Request, RequestError>,
accept: HashSet<u16>,
Expand All @@ -247,6 +252,8 @@ async fn handle(
};

let uri = request.uri.clone();

// First check the persistent disk-based cache
if let Some(v) = cache.get(&uri) {
// Found a cached request
// Overwrite cache status in case the URI is excluded in the
Expand All @@ -260,16 +267,28 @@ async fn handle(
// code.
Status::from_cache_status(v.value().status, &accept)
};

// Track cache hit in the per-host stats (only for network URIs)
if !uri.is_file()
&& let Err(e) = client.host_pool().record_cache_hit(&uri)
{
log::debug!("Failed to record cache hit for {uri}: {e}");
}

return Ok(Response::new(uri.clone(), status, request.source.into()));
}

// Request was not cached; run a normal check
// Cache miss - track it and run a normal check (only for network URIs)
if !uri.is_file()
&& let Err(e) = client.host_pool().record_cache_miss(&uri)
{
log::debug!("Failed to record cache miss for {uri}: {e}");
}

let response = check_url(client, request).await;

// - Never cache filesystem access as it is fast already so caching has no
// benefit.
// - Skip caching unsupported URLs as they might be supported in a
// future run.
// - Never cache filesystem access as it is fast already so caching has no benefit.
// - Skip caching unsupported URLs as they might be supported in a future run.
// - Skip caching excluded links; they might not be excluded in the next run.
// - Skip caching links for which the status code has been explicitly excluded from the cache.
let status = response.status();
Expand Down
3 changes: 1 addition & 2 deletions lychee-bin/src/commands/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,6 @@ pub(crate) use dump_inputs::dump_inputs;
use std::fs;
use std::io::{self, Write};
use std::path::PathBuf;
use std::sync::Arc;

use crate::cache::Cache;
use crate::options::Config;
Expand All @@ -20,7 +19,7 @@ use lychee_lib::{Client, Request};
/// Parameters passed to every command
pub(crate) struct CommandParams<S: futures::Stream<Item = Result<Request, RequestError>>> {
pub(crate) client: Client,
pub(crate) cache: Arc<Cache>,
pub(crate) cache: Cache,
pub(crate) requests: S,
pub(crate) cfg: Config,
}
Expand Down
Loading
Loading