tssujt tssujt

🎯

Focusing

Achievements

Stars

6 repositories

Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML

Python 4,128 288 Updated Mar 17, 2025

A simple HTML content extractor in Python. Can be run as a wrapper for Mozilla's Readability.js package or in pure-python mode.

HTML 297 40 Updated Dec 2, 2024

A Python 3 compatible version of goose http://goose3.readthedocs.io/en/latest/index.html

HTML 862 106 Updated Dec 22, 2024

Heuristic based boilerplate removal tool

Python 765 83 Updated Feb 25, 2025

A standalone version of the readability lib

JavaScript 9,768 637 Updated Mar 25, 2025

Python version of the Playwright testing and automation library.

Python 12,841 982 Updated Apr 14, 2025