Text Chunking & Summarization System v1.0

A comprehensive system for processing any text content - including EPUB books, plain text, markdown, and other formats - into manageable chunks with structured summaries.

Overview

This system processes text content by:

Extracting sections from various sources (EPUB, TXT, MD, etc.)
Dividing text into 2000-word chunks with intelligent merging
Facilitating creation of 140-160 word summaries per chunk
Generating formatted output in HTML, Markdown, and plain text

Key Features

Universal text processing: Works with any text format - EPUB, TXT, MD, or pasted content
Content-based extraction: Uses text markers for reliable section extraction
Smart chunking: Automatically merges small final chunks to maintain consistency
Flexible output formats: HTML with navigation, Markdown, and plain text
Large file handling: Extracts individual chunks when files exceed tool limits
EPUB support: Includes specialized script with proper numerical file sorting

System Requirements

Python 3.x
BeautifulSoup4 for EPUB parsing
Claude Code environment (recommended for optimal workflow)

Core Scripts

1. `extract_book_section_fixed.py` (EPUB-specific)

Extracts specific sections from EPUB files with proper numerical file sorting.

python scripts/extract_book_section_fixed.py "book.epub" \
    --start-markers "Chapter 1" "Beginning text" \
    --end-markers "Chapter 2" "Ending text" \
    -o output.txt

Note: This is the only EPUB-specific script. All other scripts work with any text format.

2. `create_chunks.py` (Universal)

Splits any text file into chunks of specified word count.

python create_chunks.py input.txt -o output_chunks.txt -s 2000 -f standard
# Use -f numbered to create individual chunk files
# Works with .txt, .md, or any plain text file

3. `extract_chunks_batch.py`

Extracts specific chunks as individual files for easier processing.

python extract_chunks_batch.py all_chunks.txt 1 10
# Creates individual files: chunk1.txt through chunk10.txt

4. `format_output.py`

Generates formatted output from summaries.

python format_output.py -s summaries.txt -t "Book Title" \
    -a "Author Name" -o output_name -f all

Recommended Workflow with Claude Code

The system is designed to work optimally with Claude Code, which provides:

Parallel processing capabilities for reading multiple chunks
Natural language interface for summary creation
Built-in file management and validation
Integrated todo tracking for complex multi-step processes

Example Claude Code Workflow:

Extract book section:

"Extract Book One from the EPUB using the fixed extraction script"

Create chunks:

"Create 2000-word chunks from the extracted text"

Generate summaries:

"Read chunks 1-10 and create 140-160 word summaries for each"

Format output:

"Generate HTML, Markdown, and text output from the summaries"

File Structure

text_summarizer/
├── scripts/
│   └── extract_book_section_fixed.py
├── create_chunks.py
├── extract_chunks_batch.py
├── format_output.py
├── process_book.py
├── output/              # Generated content (git-ignored)
├── books/               # Source EPUB files (git-ignored)
├── CLAUDE.md            # Project-specific configuration
└── README.md            # This file

Summary Format

Summaries must follow this exact format:

=== SUMMARY [number]: Words [start]-[end] ===
Word count: [140-160]
[Summary text capturing key events, character developments, and themes]

(Note the blank line after each summary)

Version History

v1.0 (2025): Initial release with core functionality
- Fixed numerical sorting bug in EPUB extraction
- Validated workflow for processing long texts
- Integrated Claude Code optimization

License

This project is designed for educational and personal use.

Contributing

When contributing, please:

Focus on core system functionality, not specific book processing
Maintain the clean separation between system scripts and output
Update documentation for any new features
Test with various EPUB formats

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Text Chunking & Summarization System v1.0

Overview

Key Features

System Requirements

Core Scripts

1. `extract_book_section_fixed.py` (EPUB-specific)

2. `create_chunks.py` (Universal)

3. `extract_chunks_batch.py`

4. `format_output.py`

Recommended Workflow with Claude Code

Example Claude Code Workflow:

File Structure

Summary Format

Version History

License

Contributing

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
scripts		scripts
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
README.md		README.md
create_chunks.py		create_chunks.py
extract_chunks_batch.py		extract_chunks_batch.py
format_output.py		format_output.py
process_book.py		process_book.py

registered2nd/Smart-text-summarizer

Folders and files

Latest commit

History

Repository files navigation

Text Chunking & Summarization System v1.0

Overview

Key Features

System Requirements

Core Scripts

1. extract_book_section_fixed.py (EPUB-specific)

2. create_chunks.py (Universal)

3. extract_chunks_batch.py

4. format_output.py

Recommended Workflow with Claude Code

Example Claude Code Workflow:

File Structure

Summary Format

Version History

License

Contributing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1. `extract_book_section_fixed.py` (EPUB-specific)

2. `create_chunks.py` (Universal)

3. `extract_chunks_batch.py`

4. `format_output.py`

Packages