MinerU-Webkit

MinerU-Webkit is a high-performance web content conversion toolkit builtl. It intelligently parses and extracts structured content from HTML web pages, supporting various output formats and customizable configurations.

Key Features

🚀 High-Performance Parsing: Leverages Python and lxml for fast processing and low memory footprint.
🎯 Multi-Format Output: Supports Markdown, JSON, Txt, and other structured formats to meet diverse needs.
⚡ Asynchronous Processing: Supports asynchronous batch processing for improved efficiency with multiple web pages.
🌐 Dual-Protocol Support: A unified service gateway that supports both Model Context Protocol (MCP) and traditional RESTful APIs enables your web conversion service to be seamlessly invoked by both AI agents (such as Claude, Cursor) and traditional web clients and mobile applications.
🔧 Error Resilience: Incorporates robust error recovery mechanisms to handle malformed HTML gracefully.

Installation

Prerequisites

Python >= 3.13

Basic Installation (Core Functionality)

For basic usage of MinerU-Webkit, install with core dependencies only:

# Clone the repository
git clone https://github.com/ccprocessor/MinerU-Webkit.git
cd MinerU-Webkit

# Dependencies from pyproject.toml are automatically installed
uv sync --package webpage_converter

Quick Start

1. Basic Usage

from webpage_converter.convert import convert_html_to_structured_data

# Extract main content from HTML
html_content = """
<html>
  <body>
    <div>
    <h1>This is a title</h1>
    <p>This is a paragraph</p>
    <p>This is another paragraph</p>
    </div>
    <div>
    <p>Related content</p>
    <p>Advertising content</p>
    </div>
  </body>
</html>
"""
result = convert_html_to_structured_data(main_html=html_content, url="http://www.example.com", output_format='mm_md')
print(result)

Configuration

Configuration Options

Parameter	Type	Default	Description
`main_html`	str	Required	HTML that needs to be converted
`url`	str	https://example.com	The URL link for HTML is required in mm_md mode
`output_format`	str	mm_md	Conversion format, supports mm_md (markdown), md (markdown with images), json, txt
`use_raw_image_url`	bool	True	Whether to use the original image URL (only valid for mm_md format)

Optional values for `output_format`

mm_md: The output format is markdown
md: The output format is Markdown with images
json: The output format is json
txt: The output format is txt

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.github/workflows		.github/workflows
docs/images		docs/images
packages		packages
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MinerU-Webkit

Key Features

Installation

Prerequisites

Basic Installation (Core Functionality)

Quick Start

1. Basic Usage

Configuration

Configuration Options

Optional values for `output_format`

TODO

contributors

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MinerU-Webkit

Key Features

Installation

Prerequisites

Basic Installation (Core Functionality)

Quick Start

1. Basic Usage

Configuration

Configuration Options

Optional values for output_format

TODO

contributors

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Optional values for `output_format`

Packages