A comprehensive system for processing any text content - including EPUB books, plain text, markdown, and other formats - into manageable chunks with structured summaries.
This system processes text content by:
- Extracting sections from various sources (EPUB, TXT, MD, etc.)
- Dividing text into 2000-word chunks with intelligent merging
- Facilitating creation of 140-160 word summaries per chunk
- Generating formatted output in HTML, Markdown, and plain text
- Universal text processing: Works with any text format - EPUB, TXT, MD, or pasted content
- Content-based extraction: Uses text markers for reliable section extraction
- Smart chunking: Automatically merges small final chunks to maintain consistency
- Flexible output formats: HTML with navigation, Markdown, and plain text
- Large file handling: Extracts individual chunks when files exceed tool limits
- EPUB support: Includes specialized script with proper numerical file sorting
- Python 3.x
- BeautifulSoup4 for EPUB parsing
- Claude Code environment (recommended for optimal workflow)
Extracts specific sections from EPUB files with proper numerical file sorting.
python scripts/extract_book_section_fixed.py "book.epub" \
--start-markers "Chapter 1" "Beginning text" \
--end-markers "Chapter 2" "Ending text" \
-o output.txtNote: This is the only EPUB-specific script. All other scripts work with any text format.
Splits any text file into chunks of specified word count.
python create_chunks.py input.txt -o output_chunks.txt -s 2000 -f standard
# Use -f numbered to create individual chunk files
# Works with .txt, .md, or any plain text fileExtracts specific chunks as individual files for easier processing.
python extract_chunks_batch.py all_chunks.txt 1 10
# Creates individual files: chunk1.txt through chunk10.txtGenerates formatted output from summaries.
python format_output.py -s summaries.txt -t "Book Title" \
-a "Author Name" -o output_name -f allThe system is designed to work optimally with Claude Code, which provides:
- Parallel processing capabilities for reading multiple chunks
- Natural language interface for summary creation
- Built-in file management and validation
- Integrated todo tracking for complex multi-step processes
-
Extract book section:
"Extract Book One from the EPUB using the fixed extraction script" -
Create chunks:
"Create 2000-word chunks from the extracted text" -
Generate summaries:
"Read chunks 1-10 and create 140-160 word summaries for each" -
Format output:
"Generate HTML, Markdown, and text output from the summaries"
text_summarizer/
├── scripts/
│ └── extract_book_section_fixed.py
├── create_chunks.py
├── extract_chunks_batch.py
├── format_output.py
├── process_book.py
├── output/ # Generated content (git-ignored)
├── books/ # Source EPUB files (git-ignored)
├── CLAUDE.md # Project-specific configuration
└── README.md # This file
Summaries must follow this exact format:
=== SUMMARY [number]: Words [start]-[end] ===
Word count: [140-160]
[Summary text capturing key events, character developments, and themes]
(Note the blank line after each summary)
- v1.0 (2025): Initial release with core functionality
- Fixed numerical sorting bug in EPUB extraction
- Validated workflow for processing long texts
- Integrated Claude Code optimization
This project is designed for educational and personal use.
When contributing, please:
- Focus on core system functionality, not specific book processing
- Maintain the clean separation between system scripts and output
- Update documentation for any new features
- Test with various EPUB formats