Langscrape: LLM-Powered HTML Extraction

Langscrape is a lightweight agentic pipeline for extracting structured data from raw HTML using Large Language Models (LLMs) and helper tools.
It combines robust scraping (via Patchright), LLM reasoning, and XPath-based extraction into a simple, extensible framework.

📦 Project Overview

The flow of the system can be visualized as:

html_fetcher
A robust scraping node, using Patchright (a stealthy Playwright fork) to fetch HTML from any website, passing anti-bot protection.
llm
The central reasoning component. The LLM analyzes the HTML and decides how to extract the required information. It can:
- Call tools (currently: XPath extractor).
- Produce structured outputs for downstream use.
tools (XPath Extractor)
Following the approach of XPath Agent: An Efficient XPath Programming Agent Based on LLM for Web Crawler, the LLM proposes candidate XPath selectors, we evaluate them against the DOM, and iteratively refine until the desired data is extracted.
This enables:
- Minimal LLM token usage → fast and cost-efficient
- Reduced hallucinations
output_formatter
Converts extracted data into clean, structured outputs (e.g., JSON or Python dicts; extensible to CSV or databases).

🚀 Quickstart

git clone https://github.com/DelmedigoA/langscrape
cd langscrape
python -m venv .env
source .env/bin/activate
pip install -e .
bash setup.sh
python3 test.py

Name		Name	Last commit message	Last commit date
Latest commit History 107 Commits
assets		assets
config		config
feilian		feilian
langscrape		langscrape
notebooks		notebooks
scripts		scripts
tests		tests
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml
setup.sh		setup.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Langscrape: LLM-Powered HTML Extraction

📦 Project Overview

🚀 Quickstart

About

Uh oh!

Languages

DelmedigoA/langscrape

Folders and files

Latest commit

History

Repository files navigation

Langscrape: LLM-Powered HTML Extraction

📦 Project Overview

🚀 Quickstart

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Languages