Python scraper based on AI
-
Updated
Dec 8, 2025 - Python
Python scraper based on AI
Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with BeautifulSoup, Playwright, and raw HTTP. Both headful and headless mode. With proxy rotation.
Ingest, parse, and optimize any data format ➡️ from documents to multimedia ➡️ for enhanced compatibility with GenAI frameworks
The Ultimate Information Gathering Toolkit
简单易用的Python爬虫框架,QQ交流群:597510560
Scalable Python web scraping scripts for +40 popular domains
Official repository for "Craw4LLM: Efficient Web Crawling for LLM Pretraining"
The only open-source toolkit that can download SEC EDGAR financial reports and extract textual data from specific item sections into nice & clean structured JSON files. Presented at WWW 2025 @ Sydney, Australia (https://dl.acm.org/doi/10.1145/3701716.3715289)
Opensource Korean chatbot framework
The simple, easy to use command line web crawler.
Undetected web-scraping & seamless HTML parsing in Python!
Free desktop SEO crawler - open source alternative to Screaming Frog and similar tools. Crawl websites, analyze links, extract SEO data, and export results without subscription fees. Fully customizable and extensible!
Data Analysis & Mining for lagou.com
旨在将网易云、酷狗、QQ、酷我等各音乐平台集于一体
A simple distributed crawler for zhihu && data analysis
Python script, which empowers people with no programming background to generate robust leads on a mass scale. This repo will be compiled of various versatile techniques used in lead generation.
Ignareo the Carillon, a web crawler/spider template of ultimate high concurrency built for leprechauns. Carillons as the best web spiders; Long live the golden years of leprechauns! (ISML=international saimoe; 2022 ISML is last ISML)
Easy way to brute-force web directory.
Add a description, image, and links to the web-crawler topic page so that developers can more easily learn about it.
To associate your repository with the web-crawler topic, visit your repo's landing page and select "manage topics."