Extract clean plain-text from subtitle files with intelligent deduplication and format support.
Subtexty is a lightweight, open-source CLI tool and TypeScript library that extracts clean, deduplicated plain-text from subtitle files. It intelligently handles styling tags, timing metadata, and removes redundant content while preserving the original text flow.
- 🎯 Smart Text Extraction: Removes timing, positioning, and style tags while preserving content
- 🔄 Intelligent Deduplication: Eliminates redundant lines and prefix duplicates
- 🌐 Multi-Format Support: WebVTT (.vtt), SRT (.srt), TTML (.ttml/.xml), SBV (.sbv), JSON3 (.json/.json3)
- 🔤 Encoding Handling: UTF-8 by default with fallback encoding detection and manual override support
- 📝 Dual Interface: Both CLI tool and programmatic library
- ⚡ Performance: Stream processing for memory efficiency
- 🧪 Well Tested: 80%+ test coverage with comprehensive test suite
npm install -g subtexty
npm install subtexty
# Extract text to stdout
subtexty input.vtt
# Save to file
subtexty input.srt -o clean-text.txt
# Specify encoding
subtexty input.vtt --encoding utf-8
import { extractText } from 'subtexty';
// Basic extraction
const cleanText = await extractText('subtitles.vtt');
console.log(cleanText);
// With options
const cleanText = await extractText('subtitles.srt', {
encoding: 'utf-8'
});
subtexty [options] <input-file>
input-file
- Subtitle file to process (required)
-v, --version
- Display version number-o, --output <file>
- Output file (default: stdout)--encoding <encoding>
- File encoding (default: utf-8)-h, --help
- Display help for command
# Basic text extraction
subtexty movie-subtitles.vtt
# Multiple file processing with output
subtexty episode1.srt -o episode1-text.txt
subtexty episode2.srt -o episode2-text.txt
# Handle different encodings
subtexty foreign-film.srt --encoding latin1
# Pipe to other tools
subtexty subtitles.vtt | wc -w # Word count
subtexty subtitles.vtt | grep "keyword" # Search
0
- Success1
- File error (not found, permissions, etc.)2
- Parsing error (invalid format, corrupted data)
Extracts clean text from a subtitle file.
Parameters:
filePath
(string) - Path to the subtitle fileoptions
(object, optional) - Extraction optionsencoding
(string) - File encoding (default: utf-8)
Returns:
Promise<string>
- Clean extracted text
Example:
import { extractText } from 'subtexty';
try {
const text = await extractText('./subtitles.vtt');
console.log(text);
} catch (error) {
console.error('Extraction failed:', error.message);
}
import { extractText, isSubtextyError } from 'subtexty';
try {
const text = await extractText('file.vtt', { encoding: 'utf-8' });
// Process text...
} catch (error) {
if (isSubtextyError(error)) {
// Handle specific subtexty errors
switch (error.code) {
case 'FILE_NOT_FOUND':
console.error('Subtitle file does not exist');
break;
case 'UNSUPPORTED_FORMAT':
console.error('File format not supported');
break;
case 'FILE_NOT_READABLE':
console.error('Cannot read the file');
break;
default:
console.error('Extraction error:', error.message);
}
} else {
console.error('Unexpected error:', error.message);
}
}
Format | Extensions | Description |
---|---|---|
WebVTT | .vtt |
Web Video Text Tracks |
SRT | .srt |
SubRip Subtitle |
TTML | .ttml , .xml |
Timed Text Markup Language |
SBV | .sbv |
YouTube SBV format |
JSON3 | .json , .json3 |
JSON-based subtitle format |
Removes HTML, XML, and styling tags:
Input: <b>Bold text</b> and <i>italic</i>
Output: Bold text and italic
Converts HTML entities:
Input: Tom & Jerry say "Hello"
Output: Tom & Jerry say "Hello"
Removes redundant content intelligently:
Exact Duplicates:
Input: Same line
Same line
Different line
Output: Same line
Different line
Prefix Removal:
Input: I love coding
I love coding with TypeScript
Amazing results
Output: I love coding with TypeScript
Amazing results
Cleans up spacing issues:
Input: Multiple spaces and tabs
Output: Multiple spaces and tabs
- Node.js ≥14.0.0
- pnpm (recommended) or npm
git clone https://github.com/bytesnack114/subtexty.git
cd subtexty
pnpm install
# Development
pnpm dev input.vtt # Run CLI in development mode
pnpm build # Build TypeScript
pnpm clean # Clean build artifacts
# Testing
pnpm test # Run test suite
pnpm test:watch # Watch mode testing
pnpm test:coverage # Coverage report
# Code Quality
pnpm lint # Run ESLint
pnpm lint:fix # Fix linting issues
subtexty/
├── src/
│ ├── cli.ts # CLI interface
│ ├── constants.ts # Application constants
│ ├── errors.ts # Custom error classes
│ ├── index.ts # Library entry point
│ ├── validation.ts # Input validation
│ ├── cli/ # CLI-specific modules
│ ├── parsers/ # Format-specific parsers
│ ├── types/ # TypeScript definitions
│ ├── utils/ # Text cleaning utilities
│ └── __tests__/ # Test suite
├── coverage/ # Coverage Report (if run `pnpm test:coverage`)
├── dist/ # Built files (if run `pnpm build`)
└── example/ # Example input files
- Fork the repository
- Create a feature branch:
git checkout -b feature/amazing-feature
- Make changes and add tests
- Run tests with coverage:
pnpm test:coverage
- Commit changes:
git commit -m 'Add amazing feature'
- Push to branch:
git push origin feature/amazing-feature
- Open a Pull Request
Subtexty has comprehensive test coverage:
# Run all tests
pnpm test
# Generate coverage report
pnpm test:coverage
# View coverage report
open coverage/lcov-report/index.html
- Unit Tests: Individual component testing
- Integration Tests: End-to-end workflow testing
- Parser Tests: Format-specific parsing validation
- CLI Tests: Command-line interface testing
- Memory Efficient: Stream processing for large files
- Fast Processing: Optimized text cleaning pipeline
- Minimal Dependencies: Only essential packages included
File Not Found Error
Error: Input file not found: subtitle.vtt
Solution: Check file path and permissions
Unsupported Format
Error: Unsupported file format: .txt
Solution: Use supported subtitle formats (.vtt, .srt, .ttml, .sbv, .json)
Encoding Issues
# Specify encoding manually
subtexty file.srt --encoding latin1
Permission Errors
# Check file permissions
ls -la subtitle-file.vtt
chmod +r subtitle-file.vtt
MIT License - see LICENSE.md file for details.
- 🐛 Bug Reports: GitHub Issues
- 📧 Email: bytesnack114@gmail.com