-
-
Notifications
You must be signed in to change notification settings - Fork 5.9k
Description
crawl4ai version
0.7.6
Expected Behavior
I am trying to seed urls from this sitemap and I can see that it contains the desired urls (all the /nl/agenda/ urls).
Current Behavior
It finds 0 urls without logging any errors, which is weird because I can see it contains the urls I request with my eyes.
[2025-10-23 16:56:55.332] [URL_SEED] ℹ Loading latest CC index from cache: ~/.crawl4ai/seeder_cache/latest_cc_index.txt
[2025-10-23 16:56:55.333] [URL_SEED] ℹ Starting URL seeding for https://www.muziekgebouw.nl with source=sitemap
[2025-10-23 16:56:55.334] [URL_SEED] ℹ Fetching from sitemaps...
[2025-10-23 16:56:55.334] [URL_SEED] ℹ Loading sitemap URLs for https:_www.muziekgebouw.nl from cache:~/.crawl4ai/seeder_cache/sitemap_https:_www.muziekgebouw.nl_3389dae3.jsonl
[2025-10-23 16:56:55.338] [URL_SEED] ℹ Producer finished.
[2025-10-23 16:57:00.340] [URL_SEED] ℹ Finished URL seeding for https://www.muziekgebouw.nl. Total URLs: 0
[2025-10-23 16:57:00.344] [URL_SEED] ℹ Closed HTTP client
Is this reproducible?
Yes
Inputs Causing the Bug
async with AsyncUrlSeeder(
logger=AsyncLogger(log_file="logs.txt", verbose=True)
) as seeder:
config = SeedingConfig(
source="sitemap",
pattern="*/agenda/*
extract_head=True,
max_urls=100000,
verbose=True,
filter_nonsense_urls=False,
force=True
)
print("Extracting URLS from sitemap...")
urls = await seeder.urls("https://www.muziekgebouw.nl", config)
print(f"\nFound {len(urls)} URLs:")
for u in urls:
print(json.dumps({"url": u}, ensure_ascii=False))
return urlsSteps to Reproduce
1. Run my codeCode snippets
async with AsyncUrlSeeder(
logger=AsyncLogger(log_file="logs.txt", verbose=True)
) as seeder:
config = SeedingConfig(
source="sitemap",
pattern="*/agenda/*
extract_head=True,
max_urls=100000,
verbose=True,
filter_nonsense_urls=False,
force=True
)
print("Extracting URLS from sitemap...")
urls = await seeder.urls("https://www.muziekgebouw.nl", config)
print(f"\nFound {len(urls)} URLs:")
for u in urls:
print(json.dumps({"url": u}, ensure_ascii=False))
return urlsOS
macOS
Python version
3.13.0
Browser
No response
Browser version
No response
Error logs & Screenshots (if applicable)
[2025-10-23 16:56:55.332] [URL_SEED] ℹ Loading latest CC index from cache: /Users/adonistseriotis/.crawl4ai/seeder_cache/latest_cc_index.txt
[2025-10-23 16:56:55.333] [URL_SEED] ℹ Starting URL seeding for https://www.muziekgebouw.nl with source=sitemap
[2025-10-23 16:56:55.334] [URL_SEED] ℹ Fetching from sitemaps...
[2025-10-23 16:56:55.334] [URL_SEED] ℹ Loading sitemap URLs for https:_www.muziekgebouw.nl from cache: /Users/adonistseriotis/.crawl4ai/seeder_cache/sitemap_https:_www.muziekgebouw.nl_3389dae3.jsonl
[2025-10-23 16:56:55.338] [URL_SEED] ℹ Producer finished.
[2025-10-23 16:57:00.340] [URL_SEED] ℹ Finished URL seeding for https://www.muziekgebouw.nl. Total URLs: 0
[2025-10-23 16:57:00.344] [URL_SEED] ℹ Closed HTTP client