Closed
Description
crawl4ai version
0.4.248
Expected Behavior
I am trying to exclude the html tags and external links while scraping the web page. Following is how I defined my CrawlerRunConfig()
run_config = CrawlerRunConfig(
excluded_tags=["header", "footer", "nav", "aside", "script", "style"], # Remove entire tag blocks
exclude_external_links=True
)
I am using it as
async with AsyncWebCrawler() as crawler:
# Run the crawler on a URL
result = await crawler.arun(
url="https://www.example.com",
run_config=run_config
)
I expect to not see any header, footer, or links inside my markdown.
Current Behavior
I could see everything on the webpage, it just doesn't filter out. Using below
run_config = CrawlerRunConfig(
excluded_tags=["header", "footer", "nav", "aside", "script", "style"], # Remove entire tag blocks
exclude_external_links=True
)
async with AsyncWebCrawler() as crawler:
# Run the crawler on a URL
result = await crawler.arun(
url="https://www.example.com",
run_config=run_config
)
is as good as using
async with AsyncWebCrawler() as crawler:
# Run the crawler on a URL
result = await crawler.arun(
url="https://www.example.com"
)
What might be reason?
Is this reproducible?
Yes
Inputs Causing the Bug
Steps to Reproduce
I am running inside jupyter notebook, hence using nest_asyncio. See entire code snippet below
Code snippets
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
import nest_asyncio
nest_asyncio.apply()
async def main():
try:
run_config = CrawlerRunConfig(
excluded_tags=["header", "footer", "nav", "aside", "script", "style"], # Remove entire tag blocks
exclude_external_links=True
)
# Create an instance of AsyncWebCrawler
async with AsyncWebCrawler() as crawler:
# Run the crawler on a URL
result = await crawler.arun(
url="https://www.example.com",
run_config=run_config
)
# Extracted markdown content
markdown_content = result.markdown
# Define output file path
output_file = "output.md"
# Write the markdown content to a file
with open(output_file, "w", encoding="utf-8") as file:
file.write(markdown_content)
print(f"Markdown content successfully saved to {output_file}")
except Exception as e:
print(f"Error occurred: {e}")
# Run the async main function
await main()
OS
macOS
Python version
3.10.9
Browser
Chrome
Browser version
132.0.6834.112
Error logs & Screenshots (if applicable)
No response
Activity