Skip to content

JoeKarlsson/movie-script-scraper

Movie Script Scraper

Coverage Status Build Status snyk Maintainability Package Quality npm version stars pr license Node.js Version Jest Coverage ESLint Prettier Dependencies

Last Commit GitHub Issues Contributors

NPM

A high-performance scraper to retrieve movie scripts by genre from IMSDB.

Features

  • 🎬 Genre-based scraping - Download multiple scripts from any movie genre
  • 🌐 All-scripts scraping - Download from IMSDB's complete script database
  • High performance - Parallel processing with optimized concurrency control
  • 🔄 Smart retry logic - Robust error handling with exponential backoff
  • 📁 Organized output - Scripts saved in structured directories
  • 🧪 Comprehensive testing - Unit, integration, and e2e test coverage
  • 🔧 Developer-friendly - Modern tooling with ESLint, Prettier, and Jest
  • 📊 Quality assurance - 80%+ test coverage with automated quality checks
  • 🚀 CLI support - Run directly from command line or as a Node.js module

Installation

npm install -S movie-script-scraper

Usage

Example Usage

Movie Script Scraper exposes a function; simply pass this function the options (see below), and it will return a promise with an array of the file paths of the scripts it saved.

const mss = require('movie-script-scraper');

const options = {
 genre: 'Action',
 total: 10,
};

mss(options)
 .then(filePaths => {
  console.log(filePaths);
 })
 .catch(err => {
  console.error('There was a problem');
 });

Options

  • genre [string] - Any valid film genre from the following list:
    • Action | Adventure | Animation | Comedy | Crime | Drama
    • Family | Fantasy | Film-Noir | Horror | Musical | Mystery
    • Romance | Sci-Fi | Short | Thriller | War | Western
    • Defaults to "Action".
  • all [boolean] - Download from IMSDB's complete script database (all-scripts.html).
    • Defaults to false.
  • total [number] - the total number of scripts you want to download.
    • Defaults to 10 for genre-based, 50 for all-scripts.
  • dest [string] - Location that you want to save your scripts.
    • Defaults to ./scripts in the root directory.

Note: Title-based scraping has been disabled due to IMSDB URL structure limitations. Genre-based scraping and the --all option are more reliable and efficient.

Available Genres

The scraper supports downloading scripts from the following movie genres:

Action Adventure Animation Comedy Crime Drama
Family Fantasy Film-Noir Horror Musical Mystery
Romance Sci-Fi Short Thriller War Western

Running from command line

You can run the Movie Script Scraper directly from the CLI (if it's globally available in your PATH, e.g. by npm install -g movie-script-scraper) with variety of useful options.

# Download Comedy scripts
movie-script-scraper --total 10 --genre Comedy

# Download Action scripts to custom directory
movie-script-scraper --genre Action --total 5 --dest ./my-scripts

# Download Horror scripts
movie-script-scraper --genre Horror --total 8

# Download from all available scripts
movie-script-scraper --all --total 20 --dest ./all-scripts

# Use defaults (Action genre, 10 scripts)
movie-script-scraper

How it Works

The Movie Script Scraper works by leveraging IMSDB's RSS feeds and web scraping capabilities:

Technical Process

  1. RSS Feed Parsing: IMSDB provides RSS feeds based on movie genre (e.g., http://www.imsdb.com/feeds/genre.php?genre=Comedy)
  2. URL Extraction: Using node-fetch to retrieve the RSS feed and regex patterns to extract movie script URLs
  3. Script Scraping: Each script URL is visited using node-fetch and parsed with cheerio to extract the actual script content
  4. File Management: Scripts are saved to organized directories with proper file naming conventions

Architecture

The scraper supports two high-performance scraping modes:

Genre-Based Scraping

  • Fetches multiple scripts from a specified genre
  • Uses RSS feeds to discover available scripts
  • Parallel processing with configurable concurrency (default: 5 concurrent downloads)
  • Smart selection - pre-filters URLs to avoid unnecessary downloads
  • Retry logic with exponential backoff for robust error handling
  • Memory optimization - streaming for large responses
  • Supports configurable download limits

All-Scripts Scraping

  • Fetches scripts from IMSDB's complete database (all-scripts.html)
  • Uses comprehensive alphabetical script list
  • Same performance optimizations as genre-based scraping
  • Larger default limit (50 scripts) due to extensive available content
  • Organized by "All" genre for easy file management
  • Perfect for discovering diverse script content

Performance Features

  • Concurrent Downloads: Multiple scripts download simultaneously
  • Smart URL Selection: Only downloads the exact number requested
  • Robust Error Handling: Automatic retries with exponential backoff
  • Memory Efficient: Streaming for large HTML responses
  • Progress Reporting: Real-time feedback during downloads

Dependencies

  • cheerio: Server-side jQuery implementation for HTML parsing
  • node-fetch: Lightweight HTTP client for web requests
  • lodash: Utility library for data manipulation
  • minimist: Command-line argument parsing
  • mkdirp: Directory creation utility

Running Locally

  1. Install dependencies with:

    npm install
  2. Run Tests

    npm test

Available Scripts

The project includes several npm scripts for development and testing:

Testing Scripts

  • npm test - Run all tests with coverage
  • npm run test:unit - Run only unit tests
  • npm run test:integration - Run only integration tests
  • npm run test:e2e - Run only end-to-end tests
  • npm run test:watch - Run tests in watch mode
  • npm run test:coverage - Run tests with detailed coverage report

Development Scripts

  • npm start - Run the application with default settings
  • npm run start:title - Run with a specific title (e.g., 'frozen')
  • npm run dev - Start development mode with test watching

Code Quality Scripts

  • npm run lint - Run ESLint on source code
  • npm run lint:fix - Run ESLint and fix auto-fixable issues
  • npm run format - Format code with Prettier
  • npm run format:check - Check if code is properly formatted

Maintenance Scripts

  • npm run healthcheck - Run linting and full test suite
  • npm run clean - Clean coverage and cache directories
  • npm run audit:fix - Fix security vulnerabilities
  • npm run update:deps - Update dependencies

Testing

This project uses Jest for testing with comprehensive coverage across different test types:

Test Structure

  • Unit Tests (tests/unit/) - Test individual functions and modules in isolation
  • Integration Tests (tests/integration/) - Test component interactions and API calls
  • End-to-End Tests (tests/e2e/) - Test complete application workflows

Coverage Requirements

The project maintains high code quality with coverage thresholds:

  • Branches: 80%
  • Functions: 80%
  • Lines: 80%
  • Statements: 80%

Running Tests

# Run all tests with coverage
npm test

# Run specific test suites
npm run test:unit
npm run test:integration
npm run test:e2e

# Run tests in watch mode for development
npm run test:watch

# Generate detailed coverage report
npm run test:coverage

Test Configuration

Tests are configured in config/jest.config.js with:

  • Babel transformation for modern JavaScript
  • Module path mapping for cleaner imports
  • Global test utilities and mocks
  • Extended timeouts for integration tests

Development Workflow

Prerequisites

  • Node.js: Version 16.0.0 or higher
  • npm: Latest version recommended

Code Quality Tools

  • ESLint: Code linting with Airbnb base configuration
  • Prettier: Code formatting for consistent style
  • Husky: Git hooks for pre-commit and pre-push validation

Development Process

  1. Fork and clone the repository
  2. Install dependencies: npm install
  3. Create feature branch: git checkout -b feature-name
  4. Make changes following the coding standards
  5. Run tests: npm run healthcheck
  6. Commit changes: Git hooks will run linting and tests
  7. Push and create PR: Submit pull request for review

Project Structure

src/
├── app.js                 # Main application entry point
├── mss.js                 # Core scraper logic
├── genre/                 # Genre-based scraping
├── title/                 # Title-based scraping
├── getScript/             # Script retrieval logic
└── helper/                # Utility functions

tests/
├── unit/                  # Unit tests
├── integration/           # Integration tests
├── e2e/                   # End-to-end tests
└── fixtures/              # Test data and mocks

config/
└── jest.config.js         # Jest configuration

Contributing

We welcome contributions! Please read our Contributing Guidelines for detailed information about our development process.

Quick Start for Contributors

  1. Fork the repository and clone your fork
  2. Install dependencies: npm install
  3. Create a feature branch: git checkout -b feature/your-feature-name
  4. Make your changes following our coding standards
  5. Run the health check: npm run healthcheck
  6. Commit your changes: git commit -m "Add your feature"
  7. Push to your fork: git push origin feature/your-feature-name
  8. Create a Pull Request with a clear description

Code Standards

  • Follow ESLint configuration (Airbnb base)
  • Use Prettier for code formatting
  • Write tests for new features
  • Maintain test coverage above 80%
  • Update documentation as needed

Development Setup

# Clone and setup
git clone https://github.com/your-username/movie-script-scraper.git
cd movie-script-scraper
npm install

# Run development mode
npm run dev

Troubleshooting

Common Issues

Installation Problems

  • Node.js version: Ensure you're using Node.js 16.0.0 or higher
  • Permission errors: Try using sudo for global installations or use nvm to manage Node versions

Test Failures

  • Network timeouts: Integration tests may fail due to network issues. Try running npm run test:unit first
  • Coverage issues: Ensure all new code is properly tested to meet the 80% coverage threshold

Script Download Issues

  • Invalid genre: Check the valid genres list for supported genres
  • Network errors: The scraper depends on IMSDB availability. Check if the site is accessible
  • File permission errors: Ensure the destination directory is writable

Development Issues

  • Linting errors: Run npm run lint:fix to auto-fix common issues
  • Formatting issues: Run npm run format to format your code
  • Test watching not working: Try npm run clean and restart the watch mode

Getting Help

  • Check existing issues
  • Create a new issue with detailed information about your problem
  • Include error messages, Node.js version, and steps to reproduce

Maintainers


Joe Karlsson

License

About

A simple scrapper to retrieve scripts by genre or title from IMSDB.

Topics

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •