A lightweight Model Context Protocol (MCP) server
Specialized in intelligently extracting core web content, automatically filtering ads and irrelevant elements, and converting to clean Markdown format
π Quick Start β’ π Documentation β’ π§ Configuration β’ π€ Contributing
| π Smart Extraction | π§Ή Content Cleaning | π Format Conversion | β‘ Lightweight Deploy |
|---|---|---|---|
| Axios + Cheerio + Readability | Auto-filter ads & distractions | HTML β Markdown | Zero browser dependency |
- π Smart Content Extraction: Uses Axios + Cheerio + Readability algorithm to extract main web content
- π§Ή Intelligent Content Cleaning: Automatically removes ads, navigation, sidebars and other distracting elements
- π Markdown Conversion: Converts HTML content to clean Markdown format
- πΌοΈ Image Link Optimization: Automatically handles overly long image links for better readability
- β‘ Lightweight Deployment: No browser dependencies, simple and fast deployment
- π§ Multiple Output Formats: Supports pure Markdown or JSON format with metadata
- π MCP Protocol: Fully compatible with Model Context Protocol standard
# Install from npm
npm install cleanweb-mcp
# Or clone the repository
git clone https://github.com/guangxiangdebizi/cleanweb-mcp.git
cd cleanweb-mcp
npm installπ‘ Advantage: Uses lightweight HTTP client, no browser download required, simpler deployment! Focused on content cleaning and optimization.
npm run buildnpm run mcp:stdionpm run mcp:sseServer will start at http://localhost:3100/sse
npm run mcp:wsnpm run mcp:devAdd to Claude's configuration file:
{
"mcpServers": {
"cleanweb-mcp": {
"command": "node",
"args": ["path/to/your/project/build/index.js"]
}
}
}{
"mcpServers": {
"cleanweb-mcp-sse": {
"type": "sse",
"url": "http://localhost:3100/sse",
"timeout": 600
}
}
}Intelligently extract web content and convert to Markdown format.
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
url |
string | β | - | The web URL to extract content from |
format |
string | β | markdown |
Return format: markdown or json |
timeout |
number | β | 30000 |
Page loading timeout (milliseconds) |
// Basic usage
extract_web_content({
url: "https://example.com/article"
})
// Advanced usage
extract_web_content({
url: "https://example.com/article",
format: "json",
timeout: 60000
})cleanweb-mcp/
βββ π README.md # Project documentation
βββ π¦ package.json # Project configuration
βββ βοΈ tsconfig.json # TypeScript configuration
βββ π§ claude-config-example.json # Claude configuration example
βββ π example-usage.md # Usage examples
βββ ποΈ build/ # Compiled output
β βββ index.js
β βββ tools/
β βββ web-content-extractor.js
βββ π src/ # Source code
βββ index.ts # MCP server main entry
βββ tools/
βββ web-content-extractor.ts # Web content extraction tool
The original Express server (server.js) can still run independently:
npm startThe MCP version provides the same core functionality but integrates with AI assistants through the MCP protocol.
- Lightweight Implementation: Uses HTTP client to fetch static content, no browser dependencies required
- Network Access: Requires access to target websites
- Static Content: Primarily suitable for static HTML content, dynamically rendered content may not be accessible
- Timeout Settings: For slow-loading websites, you can appropriately increase the timeout parameter
- Content Optimization: Automatically optimizes image link display for better readability
Welcome to submit Issues and Pull Requests! If you have any questions or suggestions, feel free to contact me.
- GitHub: guangxiangdebizi
- Email: guangxiangdebizi@gmail.com
- LinkedIn: Xingyu Chen
- NPM: @xingyuchen
- GitHub Repository: https://github.com/guangxiangdebizi/cleanweb-mcp
- NPM Package: https://www.npmjs.com/package/cleanweb-mcp
MIT License - See LICENSE file for details