Movie Script Scraper

A high-performance scraper to retrieve movie scripts by genre from IMSDB.

Features

🎬 Genre-based scraping - Download multiple scripts from any movie genre
🌐 All-scripts scraping - Download from IMSDB's complete script database
⚡ High performance - Parallel processing with optimized concurrency control
🔄 Smart retry logic - Robust error handling with exponential backoff
📁 Organized output - Scripts saved in structured directories
🧪 Comprehensive testing - Unit, integration, and e2e test coverage
🔧 Developer-friendly - Modern tooling with ESLint, Prettier, and Jest
📊 Quality assurance - 80%+ test coverage with automated quality checks
🚀 CLI support - Run directly from command line or as a Node.js module

Installation

npm install -S movie-script-scraper

Usage

Example Usage

Movie Script Scraper exposes a function; simply pass this function the options (see below), and it will return a promise with an array of the file paths of the scripts it saved.

const mss = require('movie-script-scraper');

const options = {
 genre: 'Action',
 total: 10,
};

mss(options)
 .then(filePaths => {
  console.log(filePaths);
 })
 .catch(err => {
  console.error('There was a problem');
 });

Options

genre [string] - Any valid film genre from the following list:
- Action | Adventure | Animation | Comedy | Crime | Drama
- Family | Fantasy | Film-Noir | Horror | Musical | Mystery
- Romance | Sci-Fi | Short | Thriller | War | Western
- Defaults to "Action".
all [boolean] - Download from IMSDB's complete script database (all-scripts.html).
- Defaults to false.
total [number] - the total number of scripts you want to download.
- Defaults to 10 for genre-based, 50 for all-scripts.
dest [string] - Location that you want to save your scripts.
- Defaults to ./scripts in the root directory.

Note: Title-based scraping has been disabled due to IMSDB URL structure limitations. Genre-based scraping and the --all option are more reliable and efficient.

Available Genres

The scraper supports downloading scripts from the following movie genres:

Action	Adventure	Animation	Comedy	Crime	Drama
Family	Fantasy	Film-Noir	Horror	Musical	Mystery
Romance	Sci-Fi	Short	Thriller	War	Western

Running from command line

You can run the Movie Script Scraper directly from the CLI (if it's globally available in your PATH, e.g. by npm install -g movie-script-scraper) with variety of useful options.

# Download Comedy scripts
movie-script-scraper --total 10 --genre Comedy

# Download Action scripts to custom directory
movie-script-scraper --genre Action --total 5 --dest ./my-scripts

# Download Horror scripts
movie-script-scraper --genre Horror --total 8

# Download from all available scripts
movie-script-scraper --all --total 20 --dest ./all-scripts

# Use defaults (Action genre, 10 scripts)
movie-script-scraper

How it Works

The Movie Script Scraper works by leveraging IMSDB's RSS feeds and web scraping capabilities:

Technical Process

RSS Feed Parsing: IMSDB provides RSS feeds based on movie genre (e.g., http://www.imsdb.com/feeds/genre.php?genre=Comedy)
URL Extraction: Using node-fetch to retrieve the RSS feed and regex patterns to extract movie script URLs
Script Scraping: Each script URL is visited using node-fetch and parsed with cheerio to extract the actual script content
File Management: Scripts are saved to organized directories with proper file naming conventions

Architecture

The scraper supports two high-performance scraping modes:

Genre-Based Scraping

Fetches multiple scripts from a specified genre
Uses RSS feeds to discover available scripts
Parallel processing with configurable concurrency (default: 5 concurrent downloads)
Smart selection - pre-filters URLs to avoid unnecessary downloads
Retry logic with exponential backoff for robust error handling
Memory optimization - streaming for large responses
Supports configurable download limits

All-Scripts Scraping

Fetches scripts from IMSDB's complete database (all-scripts.html)
Uses comprehensive alphabetical script list
Same performance optimizations as genre-based scraping
Larger default limit (50 scripts) due to extensive available content
Organized by "All" genre for easy file management
Perfect for discovering diverse script content

Performance Features

Concurrent Downloads: Multiple scripts download simultaneously
Smart URL Selection: Only downloads the exact number requested
Robust Error Handling: Automatic retries with exponential backoff
Memory Efficient: Streaming for large HTML responses
Progress Reporting: Real-time feedback during downloads

Dependencies

cheerio: Server-side jQuery implementation for HTML parsing
node-fetch: Lightweight HTTP client for web requests
lodash: Utility library for data manipulation
minimist: Command-line argument parsing
mkdirp: Directory creation utility

Running Locally

Install dependencies with:
```
npm install
```
Run Tests
```
npm test
```

Available Scripts

The project includes several npm scripts for development and testing:

Testing Scripts

npm test - Run all tests with coverage
npm run test:unit - Run only unit tests
npm run test:integration - Run only integration tests
npm run test:e2e - Run only end-to-end tests
npm run test:watch - Run tests in watch mode
npm run test:coverage - Run tests with detailed coverage report

Development Scripts

npm start - Run the application with default settings
npm run start:title - Run with a specific title (e.g., 'frozen')
npm run dev - Start development mode with test watching

Code Quality Scripts

npm run lint - Run ESLint on source code
npm run lint:fix - Run ESLint and fix auto-fixable issues
npm run format - Format code with Prettier
npm run format:check - Check if code is properly formatted

Maintenance Scripts

npm run healthcheck - Run linting and full test suite
npm run clean - Clean coverage and cache directories
npm run audit:fix - Fix security vulnerabilities
npm run update:deps - Update dependencies

Testing

This project uses Jest for testing with comprehensive coverage across different test types:

Test Structure

Unit Tests (tests/unit/) - Test individual functions and modules in isolation
Integration Tests (tests/integration/) - Test component interactions and API calls
End-to-End Tests (tests/e2e/) - Test complete application workflows

Coverage Requirements

The project maintains high code quality with coverage thresholds:

Branches: 80%
Functions: 80%
Lines: 80%
Statements: 80%

Running Tests

# Run all tests with coverage
npm test

# Run specific test suites
npm run test:unit
npm run test:integration
npm run test:e2e

# Run tests in watch mode for development
npm run test:watch

# Generate detailed coverage report
npm run test:coverage

Test Configuration

Tests are configured in config/jest.config.js with:

Babel transformation for modern JavaScript
Module path mapping for cleaner imports
Global test utilities and mocks
Extended timeouts for integration tests

Development Workflow

Prerequisites

Node.js: Version 16.0.0 or higher
npm: Latest version recommended

Code Quality Tools

ESLint: Code linting with Airbnb base configuration
Prettier: Code formatting for consistent style
Husky: Git hooks for pre-commit and pre-push validation

Development Process

Fork and clone the repository
Install dependencies: npm install
Create feature branch: git checkout -b feature-name
Make changes following the coding standards
Run tests: npm run healthcheck
Commit changes: Git hooks will run linting and tests
Push and create PR: Submit pull request for review

Project Structure

src/
├── app.js                 # Main application entry point
├── mss.js                 # Core scraper logic
├── genre/                 # Genre-based scraping
├── title/                 # Title-based scraping
├── getScript/             # Script retrieval logic
└── helper/                # Utility functions

tests/
├── unit/                  # Unit tests
├── integration/           # Integration tests
├── e2e/                   # End-to-end tests
└── fixtures/              # Test data and mocks

config/
└── jest.config.js         # Jest configuration

Contributing

We welcome contributions! Please read our Contributing Guidelines for detailed information about our development process.

Quick Start for Contributors

Fork the repository and clone your fork
Install dependencies: npm install
Create a feature branch: git checkout -b feature/your-feature-name
Make your changes following our coding standards
Run the health check: npm run healthcheck
Commit your changes: git commit -m "Add your feature"
Push to your fork: git push origin feature/your-feature-name
Create a Pull Request with a clear description

Code Standards

Follow ESLint configuration (Airbnb base)
Use Prettier for code formatting
Write tests for new features
Maintain test coverage above 80%
Update documentation as needed

Development Setup

# Clone and setup
git clone https://github.com/your-username/movie-script-scraper.git
cd movie-script-scraper
npm install

# Run development mode
npm run dev

Troubleshooting

Common Issues

Installation Problems

Node.js version: Ensure you're using Node.js 16.0.0 or higher
Permission errors: Try using sudo for global installations or use nvm to manage Node versions

Test Failures

Network timeouts: Integration tests may fail due to network issues. Try running npm run test:unit first
Coverage issues: Ensure all new code is properly tested to meet the 80% coverage threshold

Script Download Issues

Invalid genre: Check the valid genres list for supported genres
Network errors: The scraper depends on IMSDB availability. Check if the site is accessible
File permission errors: Ensure the destination directory is writable

Development Issues

Linting errors: Run npm run lint:fix to auto-fix common issues
Formatting issues: Run npm run format to format your code
Test watching not working: Try npm run clean and restart the watch mode

Getting Help

Check existing issues
Create a new issue with detailed information about your problem
Include error messages, Node.js version, and steps to reproduce

Maintainers

Joe Karlsson

Name		Name	Last commit message	Last commit date
Latest commit History 148 Commits
.github		.github
config		config
docs		docs
src		src
tests		tests
.eslintignore		.eslintignore
.gitignore		.gitignore
.npmignore		.npmignore
.npmrc		.npmrc
.prettierignore		.prettierignore
.remarkrc		.remarkrc
.travis.yml		.travis.yml
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
PULL_REQUEST_TEMPLATE.md		PULL_REQUEST_TEMPLATE.md
README.md		README.md
package-lock.json		package-lock.json
package.json		package.json

License

JoeKarlsson/movie-script-scraper

Folders and files

Latest commit

History

Repository files navigation

Movie Script Scraper

Features

Installation

Usage

Example Usage

Options

Available Genres

Running from command line

How it Works

Technical Process

Architecture

Genre-Based Scraping

All-Scripts Scraping

Performance Features

Dependencies

Running Locally

Available Scripts

Testing Scripts

Development Scripts

Code Quality Scripts

Maintenance Scripts

Testing

Test Structure

Coverage Requirements

Running Tests

Test Configuration

Development Workflow

Prerequisites

Code Quality Tools

Development Process

Project Structure

Contributing

Quick Start for Contributors

Code Standards

Development Setup

Troubleshooting

Common Issues

Installation Problems

Test Failures

Script Download Issues

Development Issues

Getting Help

Maintainers

License

MIT

About

Topics

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 4

Uh oh!

Languages

Packages