Generate LLM-optimized documentation files automatically by crawling any website
This demo shows you how to use Scrapfly's Crawler API to automatically convert website documentation into the llms.txt format - a markdown-based standard designed to help Large Language Models (LLMs) better understand and answer questions about your content.
llms.txt is a markdown file format specifically designed for providing website content to AI language models like ChatGPT, Claude, and others. It was created to give LLMs structured, clean access to documentation and content.
- ✅ LLM-optimized: Structured format that LLMs can easily parse and understand
- ✅ Markdown-based: Clean, readable format without HTML clutter
- ✅ Standardized: Following the specification at llmstxt.org
- ✅ Comprehensive: Can include full documentation in a single file
- 📖 Official Specification: https://llmstxt.org
- 🔧 llms-txt/llms.txt on GitHub: https://github.com/llms-txt/llms.txt
- 📝 Examples: See real-world llms.txt files from major projects
Before you begin, make sure you have:
- Python 3.7+ installed
- Scrapfly API key - Get one free at scrapfly.io/dashboard
- Scrapfly Python SDK installed:
pip install scrapfly-sdk- Clone this repository or download the example files:
git clone https://github.com/scrapfly/python-scrapfly.git
cd python-scrapfly/examples/demos/llm-txt-generator- Set your API key as an environment variable:
export SCRAPFLY_API_KEY='your-api-key-here'💡 Tip: On Windows, use
setinstead ofexport
Simply run the script:
python generate_llm_txt.pyThis will:
- Crawl the Scrapfly documentation at https://scrapfly.io/docs
- Extract markdown content from all pages
- Generate a
scrapfly_docs_llms-full.txtfile
Expected output:
======================================================================
LLM.txt Generator - Scrapfly Crawler API Demo
======================================================================
🔧 Initializing Scrapfly client...
📋 Crawler Configuration:
• Target URL: https://scrapfly.io/docs
• Path filter: /docs/*
• Page limit: 50
• Max depth: 5
• Using sitemaps: Yes
• Respecting robots.txt: Yes
🚀 Starting crawler...
📊 Crawling in progress...
[Progress updates...]
✅ Crawl completed!
Pages crawled: 50
Pages failed: 0
Total discovered: 893
💾 Successfully saved to: scrapfly_docs_llms-full.txt
File size: 330.4 KB
Pages included: 50
Total content: 340,315 characters
======================================================================
✅ LLM.txt generation complete!
======================================================================
The crawler is configured to:
- Start at a specific URL (e.g.,
https://scrapfly.io/docs) - Restrict crawling to certain paths (e.g.,
/docs/*only) - Respect robots.txt and use sitemaps
- Extract markdown content from each page
crawler_config = CrawlerConfig(
url="https://scrapfly.io/docs",
include_only_paths=["/docs/*"], # Only crawl /docs pages
page_limit=50, # Limit for demo
max_depth=5, # How deep to crawl
use_sitemaps=True, # Use sitemap.xml
respect_robots_txt=True, # Follow robots.txt
content_formats=['markdown'], # Extract as markdown
)The crawler runs asynchronously on Scrapfly's infrastructure:
crawl = Crawl(client, crawler_config).crawl()
crawl.wait(poll_interval=5, verbose=True) # Wait for completionInstead of making separate API calls for each page, we use the batch content API which retrieves up to 100 URLs per request:
# Get URLs from WARC artifact
warc_artifact = crawl.warc()
urls = [record.url for record in warc_artifact.iter_responses()]
# Fetch content in batches of 100
all_contents = {}
for i in range(0, len(urls), 100):
batch = urls[i:i + 100]
contents = crawl.read_batch(batch, formats=['markdown'])
all_contents.update(contents)Why batch retrieval?
- ⚡ Much faster: 1 API call for 100 URLs vs 100 separate calls
- 💰 More efficient: Reduced API overhead
- 🎯 Optimized: Multipart/related response format
The file follows the llmstxt.org specification:
# Required: H1 heading with project name
llm_lines.append("# Scrapfly Documentation")
llm_lines.append("")
# Optional: Blockquote summary
llm_lines.append("> Comprehensive guides and API references...")
llm_lines.append("")
# Optional: About section
llm_lines.append("## About")
llm_lines.append("This document contains...")
# Main content: All pages
llm_lines.append("## Content")
for url, content in all_contents.items():
llm_lines.append(f"### {url}")
llm_lines.append(content['markdown'])Convert your entire documentation into an LLM-friendly format:
generate_llm_txt(
base_url="https://docs.yourproject.com",
site_name="YourProject Documentation",
description="Complete API and usage documentation",
path_filter="/docs/*",
)Create an LLM-optimized archive of your blog:
generate_llm_txt(
base_url="https://yourblog.com/posts",
site_name="YourBlog Articles",
description="Collection of technical blog posts and tutorials",
path_filter="/posts/*",
page_limit=100,
)Archive support articles or knowledge base content:
generate_llm_txt(
base_url="https://support.yourcompany.com",
site_name="Support Knowledge Base",
description="Help articles and troubleshooting guides",
path_filter="/kb/*",
)Crawl more/fewer pages:
page_limit=100, # Crawl up to 100 pages (None = unlimited)Change crawl depth:
max_depth=3, # Stay within 3 clicks of the starting URLInclude external links:
follow_external_links=True, # Follow links to other domainsExclude certain paths:
exclude_paths=['/api/*', '/admin/*'], # Don't crawl theseModify the generate_llm_txt() function to customize:
- File structure: Add more sections, change headings
- Content filtering: Skip certain pages or content
- Metadata: Add timestamps, versions, etc.
Example - add a table of contents:
# Add after About section
llm_lines.append("## Table of Contents")
llm_lines.append("")
for url in all_contents.keys():
title = extract_title(url) # Your custom function
llm_lines.append(f"- [{title}]({url})")
llm_lines.append("")The generated llms-full.txt file follows this structure:
# Project Name
> Short description of the content
## About
Context and information about this document
## Content
---
### https://example.com/page1
[Markdown content of page 1]
---
### https://example.com/page2
[Markdown content of page 2]
---
## Metadata
- Total pages: 50
- Source: https://example.com
- Format: llms.txt (https://llmstxt.org)Once generated, you can use the file with:
ChatGPT:
- Upload the file to ChatGPT
- Ask questions like: "Based on this documentation, how do I...?"
Claude:
- Paste the content or upload the file
- Query: "Using this documentation, explain..."
API-based:
# Use as context in API calls
with open('llms-full.txt') as f:
context = f.read()
response = openai.ChatCompletion.create(
messages=[
{"role": "system", "content": f"Documentation:\n{context}"},
{"role": "user", "content": "How do I configure the API?"}
]
)- Path Filtering:
include_only_pathsrestricts crawling to specific URL patterns - Sitemap Integration:
use_sitemaps=Truediscovers pages from sitemap.xml - Robots.txt Compliance:
respect_robots_txt=Truefollows site crawling guidelines - Markdown Extraction:
content_formats=['markdown']extracts clean markdown - Batch Content API:
read_batch()retrieves multiple URLs efficiently
The crawler stores results in WARC format (Web ARChive):
- Industry-standard format for web crawling
- Contains all HTTP responses and metadata
- Automatically compressed (gzip)
# Access WARC artifact
warc = crawl.warc()
# Iterate through responses
for record in warc.iter_responses():
print(f"{record.url}: {record.status_code}")Efficient content retrieval via multipart/related responses:
# Single API call for up to 100 URLs
contents = crawl.read_batch(
urls=['https://example.com/page1', ...],
formats=['markdown'] # Can also request 'html', 'text'
)
# Returns: {'https://example.com/page1': {'markdown': '...'}, ...}Solution: Set your API key before running:
export SCRAPFLY_API_KEY='scp-live-xxxxxxxx'Possible causes:
- Page limit too low (increase
page_limit) - Path filter too restrictive (check
include_only_paths) - Site blocks crawlers (try adding
asp=Truefor anti-bot bypass)
Solution: Adjust crawler config:
CrawlerConfig(
page_limit=None, # Remove limit
include_only_paths=None, # Crawl everything
)Solution: Reduce scope:
page_limit=20, # Fewer pages
max_depth=2, # Shallower crawlSolution: Split into multiple files or reduce content:
# Option 1: Lower page limit
page_limit=30,
# Option 2: Generate separate files per section
# (Modify script to create multiple smaller files)- Specification: https://llmstxt.org
- GitHub Repository: https://github.com/llms-txt/llms.txt
- llms-txt/llms_txt2ctx: CLI tool for parsing llms.txt files
- Examples: See llms.txt files from Next.js, Supabase, and other projects
- Documentation: https://scrapfly.io/docs
- API Reference: https://scrapfly.io/docs/crawler-api
- Python SDK: https://github.com/scrapfly/python-scrapfly
- Dashboard: https://scrapfly.io/dashboard (monitor crawls, view usage)
- Support: support@scrapfly.io
- Crawler Webhook Demo: Real-time notifications when crawls complete
- WARC Parser Example: Advanced WARC file processing
- Content Extraction: Using extraction rules with crawler
Test with a small page_limit first:
page_limit=10, # Test with 10 pagesOnce confirmed working, increase or remove the limit.
Restrict crawling to relevant content:
include_only_paths=['/docs/*', '/guides/*'],This saves API credits and reduces irrelevant content.
Check the dashboard at https://scrapfly.io/dashboard to:
- Monitor crawl progress in real-time
- View discovered vs crawled URLs
- Check for errors
- Review API credit usage
For large sites, be mindful of:
- API rate limits (check your plan)
- Site server load
- Crawl politeness (delay between requests)
Re-run the generator periodically to keep your llms.txt file updated:
# Weekly cron job example
0 0 * * 0 cd /path/to/script && python generate_llm_txt.pyFound a bug or want to improve this example? Contributions welcome!
- Fork the repository
- Create a feature branch
- Submit a pull request
This example is part of the Scrapfly Python SDK and follows the same license.
Need Help?
- 📧 Email: support@scrapfly.io
- 📖 Docs: https://scrapfly.io/docs
- 🐛 Issues: https://github.com/scrapfly/python-scrapfly/issues