GPT Browser Scraper is a powerful automation tool that loads webpages, converts their content into clean markdown, and applies intelligent GPT instructions to transform, summarize, or analyze the extracted text. It streamlines the process of browsing, processing, and interpreting web content using AI. This tool is ideal for anyone who needs fast, repeatable, and scalable page analysis powered by GPT.
Created by Bitbash, built to showcase our approach to Scraping and Automation!
If you are looking for GPT Browser you've just found your team β Letβs Chat. ππ
GPT Browser Scraper automates webpage loading, extracts readable content, and processes it using OpenAIβs GPT models. It solves the challenge of manually collecting and interpreting large amounts of web data by enabling automated, prompt-driven content transformation. It is designed for developers, analysts, content researchers, and anyone who needs structured insights from websites at scale.
- Loads webpages using Playwright and waits for visible content.
- Removes hidden HTML, unnecessary attributes, and redundant markup.
- Converts final cleaned content into markdown for efficient GPT processing.
- Applies user-defined prompt instructions to produce targeted outputs.
- Supports fast mode for high-speed extraction when screenshots or full render arenβt required.
| Feature | Description |
|---|---|
| Markdown Conversion | Converts webpage content into clean markdown for optimal GPT input. |
| Prompt-Driven Output | Allows users to define GPT instructions for customized analysis or transformation. |
| Hidden Content Filtering | Removes unnecessary HTML to reduce token usage and cost. |
| Two Speed Modes | Choose between accurate (with screenshots) and extremely fast (no rendering) modes. |
| Batch URL Processing | Processes long lists of URLs or files at scale. |
| Cost Control | Uses user-provided API keys to keep usage transparent and predictable. |
| Field Name | Field Description |
|---|---|
| url | The webpage URL being processed. |
| markdownContent | Cleaned and converted markdown extracted from the page. |
| gptResponse | The GPT-generated output based on the user prompt. |
| screenshotPath | File path to the screenshot if full render mode is enabled. |
| metadata | Information such as status, load time, and truncation status. |
GPT Browser/
βββ src/
β βββ main.js
β βββ browser/
β β βββ playwright-loader.js
β β βββ markdown-cleaner.js
β βββ gpt/
β β βββ gpt-runner.js
β βββ utils/
β β βββ logger.js
β β βββ truncation-handler.js
β βββ config/
β βββ settings.example.json
βββ data/
β βββ urls.sample.csv
β βββ sample-output.json
βββ package.json
βββ README.md
- Researchers use it to summarize long articles so they can understand content quickly without manual reading.
- SEO analysts use it to extract keywords and competitor insights, enabling data-backed optimization strategies.
- Developers use it to analyze code snippets on documentation sites and automatically detect errors or improvements.
- Marketing teams use it to scan landing pages and collect key messaging insights for competitive analysis.
- QA engineers use it to detect typos, inconsistencies, or broken UI elements across multiple pages automatically.
Does the browser truncate content if the page is too long? Yes. If the extracted markdown exceeds model limits, the scraper trims the content while retaining the most relevant sections.
Can I use my own GPT prompt for each page? Absolutely. You can provide any instructionβsummaries, extraction requests, analysis, transformations, or custom logic.
What happens if a page has popups or requires loading time? In standard mode, the browser waits for visible content and interactions, ensuring that popups and dynamic elements are loaded before extraction.
How fast is the scraper? Two speed modes exist: a full-render mode for accuracy and a lightweight high-speed mode for rapid URL processing.
Primary Metric: Processes ~250 pages/hour in render mode and up to ~10,000 pages/min in fast mode. Reliability Metric: Maintains a 97% successful page load and extraction rate in typical conditions. Efficiency Metric: Reduces GPT token usage by up to 40% due to markdown cleaning and HTML removal. Quality Metric: Achieves high content fidelity, with >90% of relevant text preserved after cleaning and formatting.
