GPT-Scraper

GPT-Scraper is an autonomous, LLM-based agent that generates code to extract structured information from web pages. It is specifically designed to facilitate the process of web scraping using advanced language models such as GPT-4. This project aims to simplify the extraction of data from web pages by converting user-defined requirements into Python code that executes the desired web scraping tasks.

Features

Dynamic Code Generation: Generates Python parsing code based on user requirements and webpage content.
Flexible Data Structures: Supports the use of Pydantic models to define the structure of the scraped data.
Webpage Source Handling: Capable of extracting HTML content from both static and dynamic web pages using Selenium.

Installation

Prerequisites

Python 3.6 or higher: Ensure you have Python installed. You can download it from the official website.
ChromeDriver: Selenium requires ChromeDriver to interact with the Chrome browser. Download it from here and ensure it's in your system's PATH.

Install from git

$ pip install git+https://github.com/bes-dev/gpt-scraper.git

Install from pip

$ pip install gpt-scraper

CLI Tool Usage

Commands

$ gpt-scraper --help
usage: gpt-scraper [-h] (--requirements REQUIREMENTS | --scraper-file SCRAPER_FILE) --url URL [--output OUTPUT] [--wait-by {id,xpath,css_selector}] [--wait-value WAIT_VALUE]
                   [--save-file SAVE_FILE] [--model-name MODEL_NAME] [--simplify-html] [--use-sandbox]

GPT-Scraper CLI

options:
  -h, --help            show this help message and exit
  --requirements REQUIREMENTS
                        Scraping requirements
  --scraper-file SCRAPER_FILE
                        Path to the scraper file to load
  --url URL             URL of the webpage to scrape
  --output OUTPUT       Output file path to save scraped data as JSON
  --wait-by {id,xpath,css_selector}
                        Type of locator to wait for
  --wait-value WAIT_VALUE
                        Value of the locator to wait for
  --save-file SAVE_FILE
                        Path to save the created GPTScraper to file
  --model-name MODEL_NAME
                        Name of the model to use for scraping
  --simplify-html       Simplify the HTML content before parsing
  --use-sandbox         Use the sandboxed environment for parsing

Sample session

$ gpt-scraper --url https://news.ycombinator.com/ --requirements 'extract threads list from the web page (extract link and title)' --save-file hn.py --model-name o1-mini
2024-10-29 05:23:25,989 [INFO] Fetching page content from URL: https://news.ycombinator.com/
2024-10-29 05:23:25,989 [INFO] Attempt 1 to fetch URL: https://news.ycombinator.com/
2024-10-29 05:23:27,915 [INFO] Successfully fetched page source for URL: https://news.ycombinator.com/
2024-10-29 05:23:27,977 [INFO] Generating parser using GPTScraper.
2024-10-29 05:23:34,517 [INFO] HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-10-29 05:23:34,605 [INFO] Saving scraper to file: hn.json
2024-10-29 05:23:34,605 [INFO] Scraper saved successfully.
2024-10-29 05:23:34,605 [INFO] Parsing HTML content.
2024-10-29 05:23:34,638 [INFO] Printing scraped data:
[
    {
        "title": "Excel Turing Machine (2013)",
        "link": "https://www.felienne.com/archives/2974"
    },
    {
        "title": "High-resolution postmortem human brain MRI at 7 tesla",
        "link": "https://pulkit-khandelwal.github.io/exvivo-brain-upenn/"
    },
    {
        "title": "How Gothic architecture became spooky",
        "link": "https://www.architecturaldigest.com/story/how-gothic-architecture-became-spooky"
    },
    {
        "title": "Using reinforcement learning and $4.80 of GPU time to find the best HN post",
        "link": "https://openpipe.ai/blog/hacker-news-rlhf-part-1"
    }
]

Example

from gpt_scraper import GPTScraper
from gpt_scraper.selenium_utils import fetch_dynamic_page
from pydantic import BaseModel

class Data(BaseModel):
    title: str
    url: str

page_source = fetch_dynamic_page("https://news.ycombinator.com/")
scraper = GPTScraper.from_html(
    page_source,
    "extract threads list from the web page (extract link and title)",
    data_structure=Data,
    model_name="o1-mini"
)
data = scraper.parse_html(page_source, use_sandbox=True)
print(data)

Disclaimer

This application assists users in generating code with AI. While a sandbox environment with limited system access is provided for added security, we cannot guarantee complete protection. We strongly recommend executing all generated code within the provided sandbox environment to help minimize potential risks. However, users should not rely on the sandbox as an absolute security measure.

The development team is not liable for any consequences resulting from the generated code, including system damage, data loss, or any incurred losses. By using this application, you acknowledge and accept all risks associated with the generated code and assume full responsibility for any potential impact on your system.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
gpt_scraper		gpt_scraper
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

GPT-Scraper

Features

Installation

Prerequisites

Install from git

Install from pip

CLI Tool Usage

Commands

Sample session

Example

Disclaimer

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

bes-dev/gpt-scraper

Folders and files

Latest commit

History

Repository files navigation

GPT-Scraper

Features

Installation

Prerequisites

Install from git

Install from pip

CLI Tool Usage

Commands

Sample session

Example

Disclaimer

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages