Skip to content

A lightweight Model Context Protocol (MCP) server Specialized in intelligently extracting core web content, automatically filtering ads and irrelevant elements, and converting to clean Markdown format

Notifications You must be signed in to change notification settings

guangxiangdebizi/cleanweb-mcp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

3 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

🌐 CleanWeb MCP

npm version GitHub stars License: MIT

A lightweight Model Context Protocol (MCP) server

Specialized in intelligently extracting core web content, automatically filtering ads and irrelevant elements, and converting to clean Markdown format

πŸš€ Quick Start β€’ πŸ“– Documentation β€’ πŸ”§ Configuration β€’ 🀝 Contributing

✨ Features

🌐 Smart Extraction 🧹 Content Cleaning πŸ“ Format Conversion ⚑ Lightweight Deploy
Axios + Cheerio + Readability Auto-filter ads & distractions HTML β†’ Markdown Zero browser dependency

🎯 Core Advantages

  • 🌐 Smart Content Extraction: Uses Axios + Cheerio + Readability algorithm to extract main web content
  • 🧹 Intelligent Content Cleaning: Automatically removes ads, navigation, sidebars and other distracting elements
  • πŸ“ Markdown Conversion: Converts HTML content to clean Markdown format
  • πŸ–ΌοΈ Image Link Optimization: Automatically handles overly long image links for better readability
  • ⚑ Lightweight Deployment: No browser dependencies, simple and fast deployment
  • πŸ”§ Multiple Output Formats: Supports pure Markdown or JSON format with metadata
  • πŸš€ MCP Protocol: Fully compatible with Model Context Protocol standard

πŸ› οΈ Tech Stack

TypeScript Node.js Axios Cheerio

πŸš€ Quick Start

πŸ“¦ Installation

# Install from npm
npm install cleanweb-mcp

# Or clone the repository
git clone https://github.com/guangxiangdebizi/cleanweb-mcp.git
cd cleanweb-mcp
npm install

πŸ’‘ Advantage: Uses lightweight HTTP client, no browser download required, simpler deployment! Focused on content cleaning and optimization.

πŸ”§ Build Project

npm run build

🎯 Usage

1. Stdio Mode (Local Development)

npm run mcp:stdio

2. SSE Mode (via Supergateway)

npm run mcp:sse

Server will start at http://localhost:3100/sse

3. WebSocket Mode

npm run mcp:ws

4. Development Mode (Watch file changes)

npm run mcp:dev

πŸ› οΈ Claude Configuration

Stdio Mode Configuration

Add to Claude's configuration file:

{
  "mcpServers": {
    "cleanweb-mcp": {
      "command": "node",
      "args": ["path/to/your/project/build/index.js"]
    }
  }
}

SSE Mode Configuration

{
  "mcpServers": {
    "cleanweb-mcp-sse": {
      "type": "sse",
      "url": "http://localhost:3100/sse",
      "timeout": 600
    }
  }
}

πŸ”¨ API Reference

extract_web_content

Intelligently extract web content and convert to Markdown format.

Parameters

Parameter Type Required Default Description
url string βœ… - The web URL to extract content from
format string ❌ markdown Return format: markdown or json
timeout number ❌ 30000 Page loading timeout (milliseconds)

Usage Examples

// Basic usage
extract_web_content({
  url: "https://example.com/article"
})

// Advanced usage
extract_web_content({
  url: "https://example.com/article",
  format: "json",
  timeout: 60000
})

πŸ“ Project Structure

cleanweb-mcp/
β”œβ”€β”€ πŸ“„ README.md                 # Project documentation
β”œβ”€β”€ πŸ“¦ package.json              # Project configuration
β”œβ”€β”€ βš™οΈ tsconfig.json             # TypeScript configuration
β”œβ”€β”€ πŸ”§ claude-config-example.json # Claude configuration example
β”œβ”€β”€ πŸ“– example-usage.md          # Usage examples
β”œβ”€β”€ πŸ—οΈ build/                    # Compiled output
β”‚   β”œβ”€β”€ index.js
β”‚   └── tools/
β”‚       └── web-content-extractor.js
└── πŸ“ src/                      # Source code
    β”œβ”€β”€ index.ts                 # MCP server main entry
    └── tools/
        └── web-content-extractor.ts # Web content extraction tool

πŸ”„ Migration from Express Server

The original Express server (server.js) can still run independently:

npm start

The MCP version provides the same core functionality but integrates with AI assistants through the MCP protocol.

🚨 Important Notes

  1. Lightweight Implementation: Uses HTTP client to fetch static content, no browser dependencies required
  2. Network Access: Requires access to target websites
  3. Static Content: Primarily suitable for static HTML content, dynamically rendered content may not be accessible
  4. Timeout Settings: For slow-loading websites, you can appropriately increase the timeout parameter
  5. Content Optimization: Automatically optimizes image link display for better readability

🀝 Contributing

Welcome to submit Issues and Pull Requests! If you have any questions or suggestions, feel free to contact me.

πŸ“ž Contact

πŸ”— Related Links

πŸ“„ License

MIT License - See LICENSE file for details

About

A lightweight Model Context Protocol (MCP) server Specialized in intelligently extracting core web content, automatically filtering ads and irrelevant elements, and converting to clean Markdown format

Topics

Resources

Stars

Watchers

Forks

Packages

No packages published