A Rust library for extracting main content from web pages using text density analysis. This is an implementation of the Content Extraction via Text Density (CETD) algorithm described in the paper by Fei Sun, Dandan Song and Lejian Liao: Content Extraction via Text Density.
See specs/overview.md for detailed architecture and internals.
Web pages often contain a lot of peripheral content like navigation menus, advertisements, footers, and sidebars. This makes it challenging to extract just the main content programmatically. This library helps solve this problem by:
- Analyzing the text density patterns in HTML documents
- Identifying content-rich sections versus navigational/peripheral elements
- Extracting the main content while filtering out noise
- Handling various HTML layouts and structures
- Build a density tree representing text distribution in the HTML document
- Calculate composite text density using multiple metrics
- Extract main content blocks based on density patterns
- Unicode Support
- Support for nested HTML structures
- Efficient processing of large documents
- Error handling for malformed HTML
- Markdown output (optional feature) - Extract content as structured markdown
DOM Content Extraction includes Unicode support for handling multilingual content:
- Proper character counting using Unicode grapheme clusters
- Unicode normalization (NFC) for consistent text representation
- Support for various writing systems including Latin, Cyrillic, and CJK scripts
- Accurate text density calculations across different languages
This ensures accurate content extraction from web pages in any language, with proper handling of:
- Combining characters (like accents in European languages)
- Bidirectional text
- Complex script rendering
- Multi-code-point graphemes (like emojis)
MSRV is 1.85 due to 2024 edition. Living on the edge!
Basic usage example:
use scraper::Html;
use dom_content_extraction::get_content;
fn main() {
let html = r#"<!DOCTYPE html><html><body>
<nav>Home | About</nav>
<main>
<article>
<h1>Main Article</h1>
<p>This is the primary content that contains enough text to maintain proper density metrics. The paragraph needs sufficient length to establish text-to-link ratio.</p>
<p>Second paragraph adds more textual density to ensure the content extraction algorithm works correctly.</p>
<a href="\#">Related link</a>
</article>
</main>
<footer>Copyright 2024</footer>
</body></html>"#;
let document = Html::parse_document(html);
let content = get_content(&document).unwrap();
println!("{}", content);
}Add it it with:
cargo add dom-content-extractionor add to you Cargo.toml
dom-content-extraction = "0.4"To enable markdown output support:
dom-content-extraction = { version = "0.4", features = ["markdown"] }Read the docs!
dom-content-extraction documentation
use dom_content_extraction::{DensityTree, extract_content_as_markdown, scraper::Html};
let html = "<html><body><article><h1>Title</h1><p>Content</p></article></body></html>";
let document = Html::parse_document(html);
let mut dtree = DensityTree::from_document(&document)?;
dtree.calculate_density_sum()?;
// Extract as markdown
let markdown = extract_content_as_markdown(&dtree, &document)?;
println!("{}", markdown);
# Ok::<(), dom_content_extraction::DomExtractionError>(())Check examples.
This one will extract content from generated "lorem ipsum" page
cargo run --example check -- lorem-ipsum This one prints node with highest density:
cargo run --example check -- test4Extract content as markdown from lorem ipsum (requires markdown feature):
cargo run --example check -- lorem-ipsum-markdownThere is scoring example i'm trying to implement scoring. You will need to download GoldenStandard and finalrun-input datasets from:
https://sigwac.org.uk/cleaneval/
and unpack archives into data/ directory.
cargo run --example ce_scoreAs far as i see there is problem opening some files:
Error processing file 730: Failed to read file: "data/finalrun-input/730.html"
Caused by:
stream did not contain valid UTF-8But overall extraction works pretty well:
Overall Performance:
Files processed: 653
Average Precision: 0.88
Average Recall: 0.83
Average F1 Score: 0.78
Average Sorensen-Dice: 0.79
Total processing time: 11.32s
Average time per file: 17.34ms
For command-line usage (URL fetching, file processing, encoding detection), see pageinfo-rs.