data-extraction

Here are 3,331 public repositories matching this topic...

firecrawl / firecrawl

The API to search, scrape, and interact with the web at scale. 🔥

markdown crawler scraper ai html-to-markdown web-crawler scraping web-scraper web-scraping data-extraction webscraping web-data-extraction ai-agents web-search ai-search web-data llm ai-crawler ai-scraping

Updated Jul 24, 2026
TypeScript

D4Vinci / Scrapling

Sponsor

Star

🕷️ An adaptive Web Scraping framework that handles everything from a single request to a full-scale crawl!

Updated Jul 23, 2026
Python

ScrapeGraphAI / Scrapegraph-ai

Sponsor

Star

Python scraper based on AI

Updated Jul 20, 2026
Python

getmaxun / maxun

Star

🔥 The open-source no-code platform for web scraping, crawling, search and AI data extraction • Turn websites into structured APIs in minutes 🔥

api crawler scraper automation crawling web-scraper self-hosted web-scraping data-extraction webscraping agents browser-automation no-code web-search rpa robotic-process-automation nocode playwright

Updated Jul 18, 2026
TypeScript

MontFerret / ferret

Star

Declarative data automation language and Go runtime for structured extraction workflows.

go html golang library runtime dsl web-scraping data-extraction golang-library query-language web-crawling browser-automation chrome-devtools-protocol data-automation

Updated Jul 24, 2026
Go

vi3k6i5 / flashtext

Star

Extract Keywords from sentence or Replace keywords in sentences.

nlp word2vec search-in-text data-extraction keyword-extraction

Updated Apr 13, 2025
Python

Browser automation CLI built for AI agents. Break through anti-bot walls, hand off to humans across platforms when stuck. Parallel multi-task execution, independent multi-session operation, isolated multi-account browsing.

Updated Jul 21, 2026
Python

brightdata / brightdata-mcp

Star

A powerful Model Context Protocol (MCP) server that provides an all-in-one solution for public web access.

mcp scraping web-scraping data-extraction data-collection structured-data web-crawling browser-automation ai-agents web-data scraping-tools anti-bot-detection llm ai-integrations mcp-server modelcontextprotocol

Updated Jun 21, 2026
JavaScript

hhursev / recipe-scrapers

Star

Python package for scraping recipes data

python food library schema-org json-ld microdata data-extraction opengraph html-parsing meal recipe-scraper recipe-schema recipe-parser recipe-extraction

Updated Jul 4, 2026
Python

saifyxpro / HeadlessX

Star

The undetected self-hosted browser automation platform. Powered by Camoufox (Firefox) for 0% detection rates. Built for speed, privacy, and scalability.

automation headless web-scraping chromedriver data-extraction automation-api chrome-headless browser-automation headless-chrome browser-testing web-automation puppeteer browserless automation-platform playwright headless-service scraping-service playwright-automation container-automation

Updated Jun 25, 2026
TypeScript

0xMassi / webclaw

Sponsor

Star

Fast, local-first web content extraction for LLMs. Scrape, crawl, extract structured data — all from Rust. CLI, REST API, and MCP server.

Updated Jul 23, 2026
Rust

shcherbak-ai / contextgem

Star

ContextGem: Effortless LLM extraction from documents

nlp ai text-analysis docx data-extraction contract-analysis legaltech docx2txt unstructured-data document-intelligence llm docx2md prompt-engineering llms generative-ai llm-framework llm-pipeline llm-extraction

Updated Jun 6, 2026
Python

JonathanLink / PDFLayoutTextStripper

Star

Converts a pdf file into a text file while keeping the layout of the original pdf. Useful to extract the content from a table in a pdf file for instance. This is a subclass of PDFTextStripper class (from the Apache PDFBox library).

java pdf text layout extract pdfbox data-extraction

Updated Dec 17, 2023
Java

hi-primus / optimus

Star

🚚 Agile Data Preparation Workflows made easy with Pandas, Dask, cuDF, Dask-cuDF, Vaex and PySpark

data-science machine-learning spark bigdata data-transformation pyspark data-extraction data-analysis data-wrangling dask data-exploration data-preparation data-cleaning data-profiling data-cleansing big-data-cleaning data-cleaner cudf dask-cudf

Updated Dec 2, 2024
Python

thinh-vu / vnstock

Sponsor

Star

A beginner-friendly yet powerful Python toolkit for financial analysis and automation — built to make modern investing accessible to everyone

stock-market data-extraction quantitative-finance quantitative-trading quantitative-analysis stock-screener

Updated Jul 23, 2026
Python

raznem / parsera

Star

Lightweight library for scraping web-sites with LLMs

python opensource ai scraping data-extraction webscraping playwright llm ai-scraping

Updated Dec 17, 2025
Python

soxoj / socid-extractor

Sponsor

Star

⛏️ The extraction engine behind Maigret: turn any profile URL into a structured OSINT record across 150+ sites

osint scraping data-extraction html-parsing osint-framework socmint osint-python osint-tool socid account-extraction

Updated Jul 21, 2026
Python

yfedoseev / pdf_oxide

Star

The fastest PDF library for Python and Rust. Text extraction, image extraction, markdown conversion, PDF creation & editing. 0.8ms mean, 5× faster than industry leaders, 100% pass rate on 3,830 PDFs. MIT/Apache-2.0.

python markdown rust fast pdf text-extraction data-extraction pdf-generation pdf-to-text pdf-library pdf-parser document-processing rag pyo3 pdf-editor image-extraction llm pdf-to-markdown

Updated Jul 24, 2026
Rust

eclaire-labs / eclaire

Star

Local-first, open-source AI assistant for your data. Unify tasks, notes, docs, photos, and bookmarks. Private, self-hosted, and extensible via APIs.

open-source automation privacy ocr ai rest-api bookmarks self-hosted data-extraction note-taking web-archiving bookmark-manager task-management document-processing on-device-ai local-first personal-knowledge-management ai-assistant llm

Updated May 14, 2026
TypeScript

polyrabbit / hacker-news-digest

Star

📰 Let ChatGPT Summarize Hacker News for You

python rss crawler machine-learning spider hacker-news news-aggregator openai hacker-news-reader data-extraction extract-summaries hacker-news-digest openai-api chatgpt chatgpt-api

Updated Jul 10, 2026
Python

Improve this page

Add a description, image, and links to the data-extraction topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the data-extraction topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data-extraction

Here are 3,331 public repositories matching this topic...

firecrawl / firecrawl

D4Vinci / Scrapling

ScrapeGraphAI / Scrapegraph-ai

getmaxun / maxun

MontFerret / ferret

vi3k6i5 / flashtext

browser-act / skills

brightdata / brightdata-mcp

hhursev / recipe-scrapers

saifyxpro / HeadlessX

0xMassi / webclaw

shcherbak-ai / contextgem

JonathanLink / PDFLayoutTextStripper

hi-primus / optimus

thinh-vu / vnstock

raznem / parsera

soxoj / socid-extractor

yfedoseev / pdf_oxide

eclaire-labs / eclaire

polyrabbit / hacker-news-digest

Improve this page

Add this topic to your repo