Tools to scrape and extract the "Thinking" process data from ChatGPT threads and projects.
- Python 3.13+
- Chrome/Chromium browser (for remote debugging)
-
Clone the repository:
git clone <repository_url> cd gpt-thinking-extractor
-
Create and activate a virtual environment (Optional but recommended):
python -m venv .venv source .venv/bin/activate # On Windows: .venv\Scripts\activate
-
Install the package:
pip install . -
Install Playwright browsers:
playwright install chromium
The scraper connects to an existing Chrome instance via the Chrome DevTools Protocol (CDP). You must launch Chrome with remote debugging enabled and ensure you are logged in to ChatGPT.
Windows:
& "C:\Program Files\Google\Chrome\Application\chrome.exe" --remote-debugging-port=9222Linux / macOS:
google-chrome --remote-debugging-port=9222Windows Subsystem for Linux (WSL): If you are running the scraper inside WSL, launch Chrome on the Windows host (using the command above). The scraper in WSL needs to connect to the Windows host IP.
- Identify your Windows host IP:
cat /etc/resolv.conf | grep nameserver - In the GUI configuration, update the CDP URL to
http://<WINDOWS_IP>:9222
After installation, two commands are available in your shell:
GUI Version (Recommended):
gpt-scrape-guiProvides a graphical interface to monitor progress and stop the scraper easily.
CLI Version:
gpt-scrapeRuns the scraping process in the terminal.
- Selectors: The scraper uses a
selectors.jsonfile (located in the package) to find elements on the page. You can modify this if ChatGPT's UI changes. - Persistence: Scraped URLs are tracked in a local SQLite database (
scraped_urls.db) to prevent re-scraping the same threads after a restart. - Output: By default, data is saved to the
data/folder in your current working directory.
If you want to modify the code or run it without installing:
-
Install dev dependencies:
pip install -e .[dev]
-
Run tests:
pytest tests/
-
Run scripts directly:
# Make sure to set PYTHONPATH to src export PYTHONPATH=$PYTHONPATH:$(pwd)/src python src/gpt_thinking_extractor/scraper_gui.py
- GUI Support: To use
gpt-scrape-guiinside WSL, ensure you have an X-Server (like GWSL or VcXsrv) installed on Windows, or use WSLg (Windows 11). - Networking: By default,
localhost:9222inside WSL refers to the WSL instance itself. Use the Windows host IP if Chrome is running on Windows. - Permissions: Ensure the output directory has write permissions.
src/gpt_thinking_extractor/: Source code package.scraper_engine.py: Core logic for scraping, persistence, and file I/O.scraper_gui.py: Tkinter-based graphical interface.scrape_thoughts_final.py: Standalone CLI entry point.selectors.json: Externalized CSS selectors configuration.
tests/: Unit tests.data/: Default output directory for extracted thoughts.scraped_urls.db: SQLite database tracking processed URLs.
NO_COLOR=1 FORCE_COLOR=0 TERM=dumb gpt-scrape-gui -- --run --reporter=verbose 2>&1 | tee gpt-scrape-gui.txt