Skip to content

[Bug]: URL Seeding using Sitemap doesn't work in some cases #1559

@adonistseriotis

Description

@adonistseriotis

crawl4ai version

0.7.6

Expected Behavior

I am trying to seed urls from this sitemap and I can see that it contains the desired urls (all the /nl/agenda/ urls).

Current Behavior

It finds 0 urls without logging any errors, which is weird because I can see it contains the urls I request with my eyes.

[2025-10-23 16:56:55.332] [URL_SEED] ℹ Loading latest CC index from cache: ~/.crawl4ai/seeder_cache/latest_cc_index.txt 
[2025-10-23 16:56:55.333] [URL_SEED] ℹ Starting URL seeding for https://www.muziekgebouw.nl with source=sitemap 
[2025-10-23 16:56:55.334] [URL_SEED] ℹ Fetching from sitemaps... 
[2025-10-23 16:56:55.334] [URL_SEED] ℹ Loading sitemap URLs for https:_www.muziekgebouw.nl from cache:~/.crawl4ai/seeder_cache/sitemap_https:_www.muziekgebouw.nl_3389dae3.jsonl 
[2025-10-23 16:56:55.338] [URL_SEED] ℹ Producer finished. 
[2025-10-23 16:57:00.340] [URL_SEED] ℹ Finished URL seeding for https://www.muziekgebouw.nl. Total URLs: 0 
[2025-10-23 16:57:00.344] [URL_SEED] ℹ Closed HTTP client 

Is this reproducible?

Yes

Inputs Causing the Bug

async with AsyncUrlSeeder(
            logger=AsyncLogger(log_file="logs.txt", verbose=True)
        ) as seeder:

            config = SeedingConfig(
                source="sitemap",
                pattern="*/agenda/*
                extract_head=True,
                max_urls=100000,
                verbose=True,
                filter_nonsense_urls=False,
                force=True
            )

            print("Extracting URLS from sitemap...")

            urls = await seeder.urls("https://www.muziekgebouw.nl", config)

            print(f"\nFound {len(urls)} URLs:")
            for u in urls:
                print(json.dumps({"url": u}, ensure_ascii=False))

            return urls

Steps to Reproduce

1. Run my code

Code snippets

async with AsyncUrlSeeder(
            logger=AsyncLogger(log_file="logs.txt", verbose=True)
        ) as seeder:

    config = SeedingConfig(
        source="sitemap",
        pattern="*/agenda/*
        extract_head=True,
        max_urls=100000,
        verbose=True,
        filter_nonsense_urls=False,
        force=True
    )

    print("Extracting URLS from sitemap...")

    urls = await seeder.urls("https://www.muziekgebouw.nl", config)

    print(f"\nFound {len(urls)} URLs:")
    for u in urls:
        print(json.dumps({"url": u}, ensure_ascii=False))

    return urls

OS

macOS

Python version

3.13.0

Browser

No response

Browser version

No response

Error logs & Screenshots (if applicable)

[2025-10-23 16:56:55.332] [URL_SEED] ℹ Loading latest CC index from cache: /Users/adonistseriotis/.crawl4ai/seeder_cache/latest_cc_index.txt
[2025-10-23 16:56:55.333] [URL_SEED] ℹ Starting URL seeding for https://www.muziekgebouw.nl with source=sitemap
[2025-10-23 16:56:55.334] [URL_SEED] ℹ Fetching from sitemaps...
[2025-10-23 16:56:55.334] [URL_SEED] ℹ Loading sitemap URLs for https:_www.muziekgebouw.nl from cache: /Users/adonistseriotis/.crawl4ai/seeder_cache/sitemap_https:_www.muziekgebouw.nl_3389dae3.jsonl
[2025-10-23 16:56:55.338] [URL_SEED] ℹ Producer finished.
[2025-10-23 16:57:00.340] [URL_SEED] ℹ Finished URL seeding for https://www.muziekgebouw.nl. Total URLs: 0
[2025-10-23 16:57:00.344] [URL_SEED] ℹ Closed HTTP client

Metadata

Metadata

Labels

⚙ DoneBug fix, enhancement, FR that's completed pending release🐞 BugSomething isn't working📌 Root causedidentified the root cause of bug

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions