MinerU-HTML: An SLM-powered HTML main content extractor that outputs clean HTML bodies. Perfect for Deep Research Agents, RAG applications, and training data generation.
-
Updated
Mar 27, 2026 - Python
MinerU-HTML: An SLM-powered HTML main content extractor that outputs clean HTML bodies. Perfect for Deep Research Agents, RAG applications, and training data generation.
Self-hosted search + markdown harvester for AI agents. SearXNG (100+ engines) + FastAPI + trafilatura. Tavily-compatible /search plus /extract with size presets and pagination. One-command Docker Compose.
pebkac Chrome Nonautomation - A Local LLM-Driven Web Co-Browser using Smolagents, Zendriver, Trafilatura.
Fast, accurate web content extraction in Rust. ML page-type classification, per-type extraction, confidence scoring. F1=0.966 on ScrapingHub (#1), F1=0.859 across 2,008 annotated pages (1,497 development + 511 held-out test
web Scrapper In Python
Tools for LLMs to anonymously search and browse the web
Fast and accurate web content extraction
Telegram Mini App that saves internet articles to read them later
Lean Python tool for extracting clean, LLM-optimized markdown from web pages. Handles dynamic content with Playwright + Trafilatura for maximum information extraction efficiency.
Selective web content extraction for AI agents — URL + query returns only the chunks that matter (Python library + MCP server)
ChatGPT AI Clone
A pipe-based news article scraping and metadata extraction library for Python
Real-time AI search and chat backend with WebSocket streaming, powered by Tavily web search and Google Gemini for Flutter apps.
Google search skill for OpenClaw via Serper. Full page content extracted concurrently with trafilatura — every result returns article text, not just snippets.
Protocole de collecte et d'analyse d'archives de la Wayback Machine pour une analyse textuelle et statistique
Tools for LLMs to anonymously search and browse the web
🕵️♂️ Enable anonymous web searches for your LLM with the first-ever Model Context Protocol server utilizing Tor for secure and private information retrieval.
🕷️ Clean, chunked documentation crawler optimized for RAG & AnythingLLM. Dockerized.
Add a description, image, and links to the trafilatura topic page so that developers can more easily learn about it.
To associate your repository with the trafilatura topic, visit your repo's landing page and select "manage topics."