Skip to content

bytesnack114/subtexty

Repository files navigation

Subtexty

Extract clean plain-text from subtitle files with intelligent deduplication and format support.

License: MIT

Overview

Subtexty is a lightweight, open-source CLI tool and TypeScript library that extracts clean, deduplicated plain-text from subtitle files. It intelligently handles styling tags, timing metadata, and removes redundant content while preserving the original text flow.

Features

  • 🎯 Smart Text Extraction: Removes timing, positioning, and style tags while preserving content
  • 🔄 Intelligent Deduplication: Eliminates redundant lines and prefix duplicates
  • 🌐 Multi-Format Support: WebVTT (.vtt), SRT (.srt), TTML (.ttml/.xml), SBV (.sbv), JSON3 (.json/.json3)
  • 🔤 Encoding Handling: UTF-8 by default with fallback encoding detection and manual override support
  • 📝 Dual Interface: Both CLI tool and programmatic library
  • Performance: Stream processing for memory efficiency
  • 🧪 Well Tested: 80%+ test coverage with comprehensive test suite

Installation

NPM (Global CLI)

npm install -g subtexty

NPM (Project Dependency)

npm install subtexty

Quick Start

CLI Usage

# Extract text to stdout
subtexty input.vtt

# Save to file
subtexty input.srt -o clean-text.txt

# Specify encoding
subtexty input.vtt --encoding utf-8

Library Usage

import { extractText } from 'subtexty';

// Basic extraction
const cleanText = await extractText('subtitles.vtt');
console.log(cleanText);

// With options
const cleanText = await extractText('subtitles.srt', {
  encoding: 'utf-8'
});

CLI Reference

Basic Usage

subtexty [options] <input-file>

Arguments

  • input-file - Subtitle file to process (required)

Options

  • -v, --version - Display version number
  • -o, --output <file> - Output file (default: stdout)
  • --encoding <encoding> - File encoding (default: utf-8)
  • -h, --help - Display help for command

Examples

# Basic text extraction
subtexty movie-subtitles.vtt

# Multiple file processing with output
subtexty episode1.srt -o episode1-text.txt
subtexty episode2.srt -o episode2-text.txt

# Handle different encodings
subtexty foreign-film.srt --encoding latin1

# Pipe to other tools
subtexty subtitles.vtt | wc -w  # Word count
subtexty subtitles.vtt | grep "keyword"  # Search

Exit Codes

  • 0 - Success
  • 1 - File error (not found, permissions, etc.)
  • 2 - Parsing error (invalid format, corrupted data)

Library API

extractText(filePath, options?)

Extracts clean text from a subtitle file.

Parameters:

  • filePath (string) - Path to the subtitle file
  • options (object, optional) - Extraction options
    • encoding (string) - File encoding (default: utf-8)

Returns:

  • Promise<string> - Clean extracted text

Example:

import { extractText } from 'subtexty';

try {
  const text = await extractText('./subtitles.vtt');
  console.log(text);
} catch (error) {
  console.error('Extraction failed:', error.message);
}

Error Handling

import { extractText, isSubtextyError } from 'subtexty';

try {
  const text = await extractText('file.vtt', { encoding: 'utf-8' });
  // Process text...
} catch (error) {
  if (isSubtextyError(error)) {
    // Handle specific subtexty errors
    switch (error.code) {
      case 'FILE_NOT_FOUND':
        console.error('Subtitle file does not exist');
        break;
      case 'UNSUPPORTED_FORMAT':
        console.error('File format not supported');
        break;
      case 'FILE_NOT_READABLE':
        console.error('Cannot read the file');
        break;
      default:
        console.error('Extraction error:', error.message);
    }
  } else {
    console.error('Unexpected error:', error.message);
  }
}

Supported Formats

Format Extensions Description
WebVTT .vtt Web Video Text Tracks
SRT .srt SubRip Subtitle
TTML .ttml, .xml Timed Text Markup Language
SBV .sbv YouTube SBV format
JSON3 .json, .json3 JSON-based subtitle format

Text Processing Features

Tag Removal

Removes HTML, XML, and styling tags:

Input:  <b>Bold text</b> and <i>italic</i>
Output: Bold text and italic

Entity Conversion

Converts HTML entities:

Input:  Tom &amp; Jerry say &quot;Hello&quot;
Output: Tom & Jerry say "Hello"

Smart Deduplication

Removes redundant content intelligently:

Exact Duplicates:

Input:  Same line
        Same line
        Different line
Output: Same line
        Different line

Prefix Removal:

Input:  I love coding
        I love coding with TypeScript
        Amazing results
Output: I love coding with TypeScript
        Amazing results

Whitespace Normalization

Cleans up spacing issues:

Input:  Multiple   spaces    and	tabs
Output: Multiple spaces and tabs

Development

Prerequisites

  • Node.js ≥14.0.0
  • pnpm (recommended) or npm

Installation

git clone https://github.com/bytesnack114/subtexty.git
cd subtexty
pnpm install

Development Scripts

# Development
pnpm dev input.vtt              # Run CLI in development mode
pnpm build                      # Build TypeScript
pnpm clean                      # Clean build artifacts

# Testing
pnpm test                       # Run test suite
pnpm test:watch                 # Watch mode testing
pnpm test:coverage              # Coverage report

# Code Quality
pnpm lint                       # Run ESLint
pnpm lint:fix                   # Fix linting issues

Project Structure

subtexty/
├── src/
│   ├── cli.ts              # CLI interface
│   ├── constants.ts        # Application constants
│   ├── errors.ts           # Custom error classes
│   ├── index.ts            # Library entry point
│   ├── validation.ts       # Input validation
│   ├── cli/                # CLI-specific modules
│   ├── parsers/            # Format-specific parsers
│   ├── types/              # TypeScript definitions
│   ├── utils/              # Text cleaning utilities
│   └── __tests__/          # Test suite
├── coverage/               # Coverage Report (if run `pnpm test:coverage`)
├── dist/                   # Built files (if run `pnpm build`)
└── example/                # Example input files

Contributing

Quick Contribution Steps

  1. Fork the repository
  2. Create a feature branch: git checkout -b feature/amazing-feature
  3. Make changes and add tests
  4. Run tests with coverage: pnpm test:coverage
  5. Commit changes: git commit -m 'Add amazing feature'
  6. Push to branch: git push origin feature/amazing-feature
  7. Open a Pull Request

Testing

Subtexty has comprehensive test coverage:

# Run all tests
pnpm test

# Generate coverage report
pnpm test:coverage

# View coverage report
open coverage/lcov-report/index.html

Test Categories

  • Unit Tests: Individual component testing
  • Integration Tests: End-to-end workflow testing
  • Parser Tests: Format-specific parsing validation
  • CLI Tests: Command-line interface testing

Performance

  • Memory Efficient: Stream processing for large files
  • Fast Processing: Optimized text cleaning pipeline
  • Minimal Dependencies: Only essential packages included

Troubleshooting

Common Issues

File Not Found Error

Error: Input file not found: subtitle.vtt

Solution: Check file path and permissions

Unsupported Format

Error: Unsupported file format: .txt

Solution: Use supported subtitle formats (.vtt, .srt, .ttml, .sbv, .json)

Encoding Issues

# Specify encoding manually
subtexty file.srt --encoding latin1

Permission Errors

# Check file permissions
ls -la subtitle-file.vtt
chmod +r subtitle-file.vtt

License

MIT License - see LICENSE.md file for details.

Support

About

Extract clean plain-text from subtitle files.

Topics

Resources

License

Stars

Watchers

Forks