Skip to content

[Bug]: CrawlRunConfig doesn't work #642

Closed
@TejaCherukuri

Description

crawl4ai version

0.4.248

Expected Behavior

I am trying to exclude the html tags and external links while scraping the web page. Following is how I defined my CrawlerRunConfig()

run_config = CrawlerRunConfig(
            excluded_tags=["header", "footer", "nav", "aside", "script", "style"], # Remove entire tag blocks
            exclude_external_links=True
        )

I am using it as

async with AsyncWebCrawler() as crawler:
            # Run the crawler on a URL
            result = await crawler.arun(
                url="https://www.example.com",
                run_config=run_config
            )

I expect to not see any header, footer, or links inside my markdown.

Current Behavior

I could see everything on the webpage, it just doesn't filter out. Using below

run_config = CrawlerRunConfig(
            excluded_tags=["header", "footer", "nav", "aside", "script", "style"], # Remove entire tag blocks
            exclude_external_links=True
        )
async with AsyncWebCrawler() as crawler:
            # Run the crawler on a URL
            result = await crawler.arun(
                url="https://www.example.com",
                run_config=run_config
            )

is as good as using

async with AsyncWebCrawler() as crawler:
            # Run the crawler on a URL
            result = await crawler.arun(
                url="https://www.example.com"
            )

What might be reason?

Is this reproducible?

Yes

Inputs Causing the Bug

Steps to Reproduce

I am running inside jupyter notebook, hence using nest_asyncio. See entire code snippet below

Code snippets

import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
import nest_asyncio

nest_asyncio.apply()

async def main():
    try:
        run_config = CrawlerRunConfig(
            excluded_tags=["header", "footer", "nav", "aside", "script", "style"], # Remove entire tag blocks
            exclude_external_links=True
        )
        # Create an instance of AsyncWebCrawler
        async with AsyncWebCrawler() as crawler:
            # Run the crawler on a URL
            result = await crawler.arun(
                url="https://www.example.com",
                run_config=run_config
            )

            # Extracted markdown content
            markdown_content = result.markdown

            # Define output file path
            output_file = "output.md"

            # Write the markdown content to a file
            with open(output_file, "w", encoding="utf-8") as file:
                file.write(markdown_content)

            print(f"Markdown content successfully saved to {output_file}")

    except Exception as e:
        print(f"Error occurred: {e}")

# Run the async main function
await main()

OS

macOS

Python version

3.10.9

Browser

Chrome

Browser version

132.0.6834.112

Error logs & Screenshots (if applicable)

No response

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Labels

    ❗InvalidThis doesn't seem right🐞 BugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions