Module for automatic summarization of text documents and HTML pages.
-
Updated
Aug 11, 2025 - Python
Module for automatic summarization of text documents and HTML pages.
Reworked https://www.readability.com/ parsing library (now https://mercury.postlight.com/ is living alternative)
Domain-specific language for extracting structured data from HTML documents
Xtract-html is a tool for extracting HTML display code from a website, which you can also use for your website.
Xtract-htmlV2 is a tool for getting the HTML code from the website you want and is the successor to the previous version
Article extraction benchmark: dataset and evaluation scripts
Extract embedded metadata from HTML markup
Script for extracting units from http://vocab.nerc.ac.uk/collection/P06/current/ to easily add units to the database (This should only be temporarily to demonstrate how units can work)
fast python port of arc90's readability tool, updated to match latest readability.js!
Heuristic based boilerplate removal tool
Parse numbers written in natural language
extracts and saves HTML, CSS, and JavaScript files from a specified URL.
Extract price amount and currency symbol from a raw text string
A Java-based server leveraging Apache Tika to extract content and metadata from files (PDF, DOCX, TXT, etc.) in a local files-to-extract directory. Supports HTML (with CSS styling) and text extraction, file listing, and metadata retrieval via MCP-compliant tools and REST APIs. Built with Spring Boot, Jetty, and MCP SDK.
Add a description, image, and links to the html-extraction topic page so that developers can more easily learn about it.
To associate your repository with the html-extraction topic, visit your repo's landing page and select "manage topics."