A CLI tool and a Rust library for crawling GitBook sites, downloading their pages, and converting them to Markdown and plain text.
- π·οΈ Automatic Crawling: Automatically discovers all pages of a GitBook
- β GitBook Verification: Detects if a site is indeed a GitBook before crawling
- π All-in-One Mode: Crawl and download in a single command
- π Improved CLI Interface: Clear subcommands with
clap
cargo install gitbook2textAdd this to your Cargo.toml:
[dependencies]
gitbook2text = "0.3"Crawls and downloads all pages in a single command:
gitbook2text all https://docs.example.comGenerates the links.txt file with all found links:
gitbook2text crawl https://docs.example.com
# With a custom output file
gitbook2text crawl https://docs.example.com -o my-links.txtDownloads pages from an existing links file:
gitbook2text download
# With a custom file
gitbook2text download -i my-links.txtWithout a subcommand, downloads from links.txt:
gitbook2textFiles are saved in:
data/md/- Original markdown filesdata/txt/- Cleaned text files
use gitbook2text::{is_gitbook, extract_gitbook_links, crawl_and_save};
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
let url = "https://docs.example.com";
// Check if it's a GitBook
if is_gitbook(url).await? {
println!("It's a GitBook!");
// Extract all links
let links = extract_gitbook_links(url).await?;
println!("Found {} pages", links.len());
// Or directly save to a file
crawl_and_save(url, "links.txt").await?;
}
Ok(())
}use gitbook2text::{download_page, markdown_to_text, txt_sanitize};
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
let url = "https://docs.example.com/page.md";
// Download the page
let content = download_page(url).await?;
// Convert to text
let text = markdown_to_text(&content);
// Clean the text
let cleaned = txt_sanitize(&text);
println!("{}", cleaned);
Ok(())
}- β Smart crawling: Automatically discovers all pages of a documentation
- β GitBook verification: Detects GitBook sites via their specific markers
- β Concurrent downloading: Processes multiple pages simultaneously
- β Markdown to text conversion: Clean content extraction
- β Advanced cleaning: Removes special GitBook tags
- β Code block support: Preserves titles and content
- β Normalization: Uniform spaces and characters
- π Archive a complete documentation
- π Index content for a search engine
- π€ Prepare data for model training
- π Analyze the structure of documentation
- πΎ Create documentation backups
# All in one
gitbook2text all https://docs.mydomain.com
# Or step by step
gitbook2text crawl https://docs.mydomain.com
gitbook2text download#!/bin/bash
# backup-docs.sh
GITBOOK_URL="https://docs.example.com"
BACKUP_DIR="backups/$(date +%Y-%m-%d)"
mkdir -p "$BACKUP_DIR"
cd "$BACKUP_DIR"
gitbook2text all "$GITBOOK_URL"
echo "Backup completed in $BACKUP_DIR"For the full API documentation, visit docs.rs/gitbook2text.
Contributions are welcome! Feel free to open an issue or a pull request.
See CHANGELOG.md for the version history.
This project is dual-licensed under MIT or Apache-2.0, your choice.
- MIT License (LICENSE-MIT or http://opensource.org/licenses/MIT)
- Apache License, Version 2.0 (LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0)