An open-source web scraping tool that converts websites into markdown format. Featuring customizable options, LLM-based filtering, and an easy-to-use API for efficient content extraction and analysis.
This project is a web scraping tool inspired by markdowner by @dhravya. While the original project served as a foundation, this version has been rebuilt to address deployment issues that I was not able to resolve.
Special thanks to @dhravya for open-sourcing the original markdowner, which provided valuable insights and inspiration for this tool.
SnapScrape offers a versatile set of capabilities for web content extraction and processing:
-
Website to Markdown Conversion
- Transform any web page into clean, formatted Markdown
- Preserve essential content structure and formatting
-
LLM-Powered Content Filtering
- Utilize Large Language Models to refine and curate scraped content
- Enhance relevance and quality of extracted information
-
Flexible Output Formats
- Text: Receive scraped content as plain text for easy integration
- JSON: Get structured data output for programmatic use
- Markdown: Obtain content in Markdown format for documentation or content management systems
These features make SnapScrape suitable for a wide range of applications, from content aggregation to data analysis and documentation automation.
- Rate Limiting: Prevents overwhelming the target server by limiting the number of requests per interval.
- Fingerprinting Rotation: Randomizes browser fingerprints to mimic different users and avoid detection.
- Random Delays: Adds random delays between requests to simulate human behavior and avoid detection.
- User Agent Rotation: Randomizes browser user agents to mimic different users and avoid detection.
Make a GET request to the snapscrape.jyothepro.com
To convert a website to Markdown, use the following curl command:
curl 'https://snapscrape.jyothepro.com/?url=https://example.com' --output result.md
url
(string): The website URL to convert into Markdown.
llmFilter
(boolean, default: false): When set totrue
, uses LLM to filter out unnecessary information.
SnapScrape supports multiple response formats. Specify the desired format using the Accept
header:
- Plain Text:
curl 'https://snapscrape.jyothepro.com/?url=https://example.com'
-H 'Accept: text/plain'
--output result.txt
- JSON:
curl 'https://snapscrape.jyothepro.com/?url=https://example.com'
-H 'Accept: application/json'
--output result.json
- Convert a website to Markdown with LLM filtering:
curl 'https://snapscrape.jyothepro.com/?url=https://example.com&llmFilter=true'
--output filtered_result.md
- Get JSON response:
curl 'https://snapscrape.jyothepro.com/?url=https://example.com'
-H 'Accept: application/json'
--output result.json
SnapScrape leverages cutting-edge Cloudflare technologies to provide efficient and scalable web scraping capabilities:
-
Cloudflare Browser Rendering: This service allows us to spin up headless browser instances in the cloud. It enables SnapScrape to render JavaScript-heavy pages and capture dynamic content that traditional scraping methods might miss.
-
Cloudflare Durable Objects: Durable Objects provide a consistent and low-latency environment for our scraping operations. They allow us to maintain stateful interactions with web pages and manage concurrent scraping tasks efficiently.
-
Turndown: After capturing the rendered HTML content, we use Turndown to convert it into clean, readable Markdown format. This step ensures that the scraped content is easily consumable and can be integrated into various documentation systems or content management platforms.
This combination of technologies enables SnapScrape to handle complex web pages, maintain high performance, and produce high-quality Markdown output suitable for a wide range of applications.
- You cannot dpeloy on your local machine as it needs browser rendering and Durable Objects from cloudflare.
You can easily host this project on cloudflare. To use the browser rendering and Durable Objects, you need the Workers paid plan
- Clone the repo and download dependencies
git clone https://github.com/jyothepro/snapscrape
npm install
- Run this command:
npx wrangler kv namespace create md_cache
- Open Wrangler.toml and change the kv_namespaces section
- Run
npm run deploy
SnapScrape is a web scraping tool that requires Cloudflare's special features. This guide will walk you through the process of deploying SnapScrape to Cloudflare's servers.
Before you begin, ensure you have the following:
- Cloudflare Account: Sign up at Cloudflare.com if you don't have an account.
- Workers Paid Plan: SnapScrape requires Cloudflare's Workers Paid Plan for Browser Rendering and Durable Objects.
- Node.js and npm: Install from nodejs.org.
- Git: Install from git-scm.com.
-
Clone the repository: git clone https://github.com/jyothepro/snapscrape
-
Move into the project directory: cd snapscrape
Run the following command to install necessary packages: npm install
Wrangler is Cloudflare's command-line tool for managing and deploying Workers projects.
-
Install Wrangler globally: npm install -g wrangler
-
Verify the installation: wrangler --version
-
Authenticate Wrangler with your Cloudflare account: wrangler login
Follow the prompts to log in via your web browser.
-
Create a KV namespace for caching: npx wrangler kv:namespace create md_cache
Copy the output that looks like
[[kv_namespaces]] binding = "md_cache" id = "6895f05a78334c86abe1a91a86133642"
-
Open the
wrangler.toml
file in a text editor. -
Find the
kv_namespaces
section and replace it with the text you copied in step 1.
Run the deployment command: npm run deploy
- After successful deployment, Cloudflare will provide a URL where your app is live.
- Visit this URL in your web browser to ensure it's working correctly.
If you encounter any issues during deployment:
- Double-check that you've signed up for the Cloudflare Workers Paid Plan.
- Ensure you've correctly copied the KV namespace information into
wrangler.toml
. - Verify that you're logged in to Wrangler with the correct Cloudflare account.
- Check that all prerequisites are installed correctly.
- For Wrangler-specific issues:
- Ensure npm's global bin directory is in your system's PATH.
- On Windows, try running the command prompt as an administrator.
- For permission issues on Unix-based systems, try using
sudo
with the installation command.
For additional assistance, consult the Cloudflare Workers documentation.
SnapScrape cannot be deployed on a local machine as it requires Cloudflare's Browser Rendering and Durable Objects features. Always deploy directly to Cloudflare using the steps outlined above.