A high-performance scraper to retrieve movie scripts by genre from IMSDB.
- 🎬 Genre-based scraping - Download multiple scripts from any movie genre
- 🌐 All-scripts scraping - Download from IMSDB's complete script database
- ⚡ High performance - Parallel processing with optimized concurrency control
- 🔄 Smart retry logic - Robust error handling with exponential backoff
- 📁 Organized output - Scripts saved in structured directories
- 🧪 Comprehensive testing - Unit, integration, and e2e test coverage
- 🔧 Developer-friendly - Modern tooling with ESLint, Prettier, and Jest
- 📊 Quality assurance - 80%+ test coverage with automated quality checks
- 🚀 CLI support - Run directly from command line or as a Node.js module
npm install -S movie-script-scraperMovie Script Scraper exposes a function; simply pass this function the options (see below), and it will return a promise with an array of the file paths of the scripts it saved.
const mss = require('movie-script-scraper');
const options = {
genre: 'Action',
total: 10,
};
mss(options)
.then(filePaths => {
console.log(filePaths);
})
.catch(err => {
console.error('There was a problem');
});genre[string] - Any valid film genre from the following list:- Action | Adventure | Animation | Comedy | Crime | Drama
- Family | Fantasy | Film-Noir | Horror | Musical | Mystery
- Romance | Sci-Fi | Short | Thriller | War | Western
- Defaults to "Action".
all[boolean] - Download from IMSDB's complete script database (all-scripts.html).- Defaults to false.
total[number] - the total number of scripts you want to download.- Defaults to 10 for genre-based, 50 for all-scripts.
dest[string] - Location that you want to save your scripts.- Defaults to ./scripts in the root directory.
Note: Title-based scraping has been disabled due to IMSDB URL structure limitations. Genre-based scraping and the --all option are more reliable and efficient.
The scraper supports downloading scripts from the following movie genres:
| Action | Adventure | Animation | Comedy | Crime | Drama |
|---|---|---|---|---|---|
| Family | Fantasy | Film-Noir | Horror | Musical | Mystery |
| Romance | Sci-Fi | Short | Thriller | War | Western |
You can run the Movie Script Scraper directly from the CLI (if it's globally available in your PATH, e.g. by npm install -g movie-script-scraper) with variety of useful options.
# Download Comedy scripts
movie-script-scraper --total 10 --genre Comedy
# Download Action scripts to custom directory
movie-script-scraper --genre Action --total 5 --dest ./my-scripts
# Download Horror scripts
movie-script-scraper --genre Horror --total 8
# Download from all available scripts
movie-script-scraper --all --total 20 --dest ./all-scripts
# Use defaults (Action genre, 10 scripts)
movie-script-scraperThe Movie Script Scraper works by leveraging IMSDB's RSS feeds and web scraping capabilities:
- RSS Feed Parsing: IMSDB provides RSS feeds based on movie genre (e.g., http://www.imsdb.com/feeds/genre.php?genre=Comedy)
- URL Extraction: Using
node-fetchto retrieve the RSS feed and regex patterns to extract movie script URLs - Script Scraping: Each script URL is visited using
node-fetchand parsed withcheerioto extract the actual script content - File Management: Scripts are saved to organized directories with proper file naming conventions
The scraper supports two high-performance scraping modes:
- Fetches multiple scripts from a specified genre
- Uses RSS feeds to discover available scripts
- Parallel processing with configurable concurrency (default: 5 concurrent downloads)
- Smart selection - pre-filters URLs to avoid unnecessary downloads
- Retry logic with exponential backoff for robust error handling
- Memory optimization - streaming for large responses
- Supports configurable download limits
- Fetches scripts from IMSDB's complete database (all-scripts.html)
- Uses comprehensive alphabetical script list
- Same performance optimizations as genre-based scraping
- Larger default limit (50 scripts) due to extensive available content
- Organized by "All" genre for easy file management
- Perfect for discovering diverse script content
- Concurrent Downloads: Multiple scripts download simultaneously
- Smart URL Selection: Only downloads the exact number requested
- Robust Error Handling: Automatic retries with exponential backoff
- Memory Efficient: Streaming for large HTML responses
- Progress Reporting: Real-time feedback during downloads
- cheerio: Server-side jQuery implementation for HTML parsing
- node-fetch: Lightweight HTTP client for web requests
- lodash: Utility library for data manipulation
- minimist: Command-line argument parsing
- mkdirp: Directory creation utility
-
Install dependencies with:
npm install
-
Run Tests
npm test
The project includes several npm scripts for development and testing:
npm test- Run all tests with coveragenpm run test:unit- Run only unit testsnpm run test:integration- Run only integration testsnpm run test:e2e- Run only end-to-end testsnpm run test:watch- Run tests in watch modenpm run test:coverage- Run tests with detailed coverage report
npm start- Run the application with default settingsnpm run start:title- Run with a specific title (e.g., 'frozen')npm run dev- Start development mode with test watching
npm run lint- Run ESLint on source codenpm run lint:fix- Run ESLint and fix auto-fixable issuesnpm run format- Format code with Prettiernpm run format:check- Check if code is properly formatted
npm run healthcheck- Run linting and full test suitenpm run clean- Clean coverage and cache directoriesnpm run audit:fix- Fix security vulnerabilitiesnpm run update:deps- Update dependencies
This project uses Jest for testing with comprehensive coverage across different test types:
- Unit Tests (
tests/unit/) - Test individual functions and modules in isolation - Integration Tests (
tests/integration/) - Test component interactions and API calls - End-to-End Tests (
tests/e2e/) - Test complete application workflows
The project maintains high code quality with coverage thresholds:
- Branches: 80%
- Functions: 80%
- Lines: 80%
- Statements: 80%
# Run all tests with coverage
npm test
# Run specific test suites
npm run test:unit
npm run test:integration
npm run test:e2e
# Run tests in watch mode for development
npm run test:watch
# Generate detailed coverage report
npm run test:coverageTests are configured in config/jest.config.js with:
- Babel transformation for modern JavaScript
- Module path mapping for cleaner imports
- Global test utilities and mocks
- Extended timeouts for integration tests
- Node.js: Version 16.0.0 or higher
- npm: Latest version recommended
- ESLint: Code linting with Airbnb base configuration
- Prettier: Code formatting for consistent style
- Husky: Git hooks for pre-commit and pre-push validation
- Fork and clone the repository
- Install dependencies:
npm install - Create feature branch:
git checkout -b feature-name - Make changes following the coding standards
- Run tests:
npm run healthcheck - Commit changes: Git hooks will run linting and tests
- Push and create PR: Submit pull request for review
src/
├── app.js # Main application entry point
├── mss.js # Core scraper logic
├── genre/ # Genre-based scraping
├── title/ # Title-based scraping
├── getScript/ # Script retrieval logic
└── helper/ # Utility functions
tests/
├── unit/ # Unit tests
├── integration/ # Integration tests
├── e2e/ # End-to-end tests
└── fixtures/ # Test data and mocks
config/
└── jest.config.js # Jest configuration
We welcome contributions! Please read our Contributing Guidelines for detailed information about our development process.
- Fork the repository and clone your fork
- Install dependencies:
npm install - Create a feature branch:
git checkout -b feature/your-feature-name - Make your changes following our coding standards
- Run the health check:
npm run healthcheck - Commit your changes:
git commit -m "Add your feature" - Push to your fork:
git push origin feature/your-feature-name - Create a Pull Request with a clear description
- Follow ESLint configuration (Airbnb base)
- Use Prettier for code formatting
- Write tests for new features
- Maintain test coverage above 80%
- Update documentation as needed
# Clone and setup
git clone https://github.com/your-username/movie-script-scraper.git
cd movie-script-scraper
npm install
# Run development mode
npm run dev- Node.js version: Ensure you're using Node.js 16.0.0 or higher
- Permission errors: Try using
sudofor global installations or usenvmto manage Node versions
- Network timeouts: Integration tests may fail due to network issues. Try running
npm run test:unitfirst - Coverage issues: Ensure all new code is properly tested to meet the 80% coverage threshold
- Invalid genre: Check the valid genres list for supported genres
- Network errors: The scraper depends on IMSDB availability. Check if the site is accessible
- File permission errors: Ensure the destination directory is writable
- Linting errors: Run
npm run lint:fixto auto-fix common issues - Formatting issues: Run
npm run formatto format your code - Test watching not working: Try
npm run cleanand restart the watch mode
- Check existing issues
- Create a new issue with detailed information about your problem
- Include error messages, Node.js version, and steps to reproduce
|
Joe Karlsson |

