Description
crawl4ai version
0.4.248
Expected Behavior
I followed the example from this link and experimented with various configurations. Specifically, I adjusted the threshold and min_word_threshold parameters, but these changes had no effect on the length of fit_markdown. To test further, I tried extreme values—setting threshold close to 1 or 0 and min_word_threshold to 1 and 100,000—yet the output remained unchanged.
I also tried this on different urls, and this behavior stayed the same.
Current Behavior
This is the output no matter the the configuration of PruningContentFilter:
[INIT].... → Crawl4AI 0.4.248
[FETCH]... ↓ https://news.ycombinator.com... | Status: True | Time: 0.01s
[COMPLETE] ● https://news.ycombinator.com... | Status: True | Total: 0.01s
Raw Markdown length: 17825
Fit Markdown length: 3756
Process finished with exit code 0
Is this reproducible?
Yes
Inputs Causing the Bug
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.content_filter_strategy import PruningContentFilter
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
async def main():
# Step 1: Create a pruning filter
prune_filter = PruningContentFilter(
# Lower → more content retained, higher → more content pruned
threshold=0.45,
# "fixed" or "dynamic"
threshold_type="dynamic",
# Ignore nodes with <5 words
min_word_threshold=5
)
# Step 2: Insert it into a Markdown Generator
md_generator = DefaultMarkdownGenerator(content_filter=prune_filter)
# Step 3: Pass it to CrawlerRunConfig
config = CrawlerRunConfig(
markdown_generator=md_generator
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://news.ycombinator.com",
config=config
)
if result.success:
# 'fit_markdown' is your pruned content, focusing on "denser" text
print("Raw Markdown length:", len(result.markdown_v2.raw_markdown))
print("Fit Markdown length:", len(result.markdown_v2.fit_markdown))
else:
print("Error:", result.error_message)
if __name__ == "__main__":
asyncio.run(main())
Steps to Reproduce
Code snippets
OS
macOS
Python version
3.12
Browser
No response
Browser version
No response
Error logs & Screenshots (if applicable)
No response
Activity