Skip to content

A lightweight and configurable web scraper utility built with Puppeteer and Cheerio for automated data extraction, research, and aggregation. Features include configurable scraping rules, auto-throttling, and easy export to JSON/CSV.

License

Notifications You must be signed in to change notification settings

The-Node-Forge/simple-web-scraper

Repository files navigation

Simple Web Scraper

License: MIT Made with TypeScript

NPM Version Build Status Platform

Live Documentation

A lightweight and efficient web scraping package for JavaScript/TypeScript applications. This package helps developers fetch HTML content, parse web pages, and extract data effortlessly.


✨ Features

  • βœ… Fetch Web Content – Retrieve HTML from any URL with ease.
  • βœ… Parse and Extract Data – Utilize integrated parsing tools to extract information.
  • βœ… Configurable Options – Customize scraping behaviors using CSS selectors.
  • βœ… Headless Browser Support – Optionally use Puppeteer for JavaScript-rendered pages.
  • βœ… Lightweight & Fast – Uses Cheerio for static HTML scraping.
  • βœ… TypeScript Support – Fully typed for robust development.
  • βœ… Data Export Support – Export scraped data in JSON or CSV formats.
  • βœ… CSV Import Support – Read CSV files and convert them to JSON.

πŸ“š Installation

Install via npm:

npm install simple-web-scraper

or using Yarn:

yarn add simple-web-scraper

πŸš€ Why Use Cheerio and Puppeteer?

This package leverages Cheerio and Puppeteer for powerful web scraping capabilities:

πŸ”Ή Cheerio (Fast and Lightweight)

  • Ideal for static HTML parsing (like jQuery for the backend).
  • Extremely fast and lightweight – perfect for pages without JavaScript rendering.
  • Provides easy CSS selector querying for extracting structured data.

πŸ”Ή Puppeteer (Headless Browser Automation)

  • Handles JavaScript-rendered pages – essential for scraping dynamic content.
  • Can interact with pages, click buttons, and fill out forms.
  • Allows screenshot capturing, PDF generation, and full-page automation.

βœ… Best of Both Worlds

  • Use Cheerio for speed when scraping static pages.
  • Switch to Puppeteer for JavaScript-heavy sites requiring full rendering.
  • Provides flexibility to choose the best approach for your project.

βœ… API Reference

WebScraper Class

new WebScraper(options?: ScraperOptions)

πŸ“Š Props

Parameter Type Description
usePuppeteer boolean (optional) Whether to use Puppeteer (default: true)
throttle number (optional) Delay in milliseconds between requests (default: 1000)
rules Record<string, string> CSS selectors defining data extraction rules

Methods

scrape(url: string): Promise<Record<string, any>>

  • Scrapes the given URL based on the configured options.

exportToJSON(data: any, filePath: string): void

  • Exports the given data to a JSON file.

exportToCSV(data: any | any[], filePath: string): void

  • Exports the given data to a CSV file.

πŸ› οΈ Basic Usage

1. Scraping Web Pages

You can scrape web pages using either Puppeteer (for JavaScript-heavy pages) or Cheerio (for static HTML pages).

import { WebScraper } from 'simple-web-scraper';

const scraper = new WebScraper({
  usePuppeteer: false, // Set to true for dynamic pages
  rules: {
    title: 'h1',
    description: 'meta[name=\"description\"]',
  },
});

(async () => {
  const data = await scraper.scrape('https://example.com');
  console.log(data);
})();

2. Using Puppeteer for JavaScript-heavy Pages

To scrape pages that require JavaScript execution:

const scraper = new WebScraper({
  usePuppeteer: true, // Enable Puppeteer for JavaScript-rendered content
  rules: {
    heading: 'h1',
    price: '.product-price',
  },
});

(async () => {
  const data = await scraper.scrape('https://example.com/product');
  console.log(data);
})();

3. Exporting Data

  • Scraped data can be exported to JSON or CSV files using utility functions.

Export to JSON

import { exportToJSON } from 'simple-web-scraper';

const data = { name: 'Example', value: 42 };
exportToJSON(data, 'output.json');

Export to CSV

import { exportToCSV } from 'simple-web-scraper';

const data = [
  { name: 'Example 1', value: 42 },
  { name: 'Example 2', value: 99 },
];
exportToCSV(data, 'output.csv');

// Preserve null and undefined values as null
exportToCSV(data, 'output.csv', { preserveNulls: true });

πŸ–₯ Backend Example - Module (import)

This example demonstrates how to use simple-web-scraper in a Node.js backend:

import express from 'express';
import { WebScraper, exportToJSON, exportToCSV } from 'simple-web-scraper';

const app = express();
const scraper = new WebScraper({
  usePuppeteer: true,
  rules: { title: 'h1', content: 'p' },
});

app.get('/scrape-example', async (req, res) => {
  try {
    const url = 'https://github.com/The-Node-Forge';
    const data = await scraper.scrape(url);

    exportToJSON(data, 'output.json'); // export JSON
    exportToCSV(data, 'output.csv', { preserveNulls: true }); // export CSV

    res.status(200).json({ success: true, data });
  } catch (error) {
    res.status(500).json({ success: false, error: error.message });
  }
});

πŸ–₯ Backend Example - CommonJS (require)

This example demonstrates how to use simple-web-scraper in a Node.js backend:

const {
  WebScraper,
  exportToJSON,
  exportToCSV,
} = require('@the-node-forge/simple-web-scraper/dist');

const scraper = new WebScraper({
  usePuppeteer: true,
  rules: {
    fullHTML: 'html', // Entire page HTML
    title: 'head > title', // Page title
    description: 'meta[name="description"]', // Meta description
    keywords: 'meta[name="keywords"]', // Meta keywords
    favicon: 'link[rel="icon"]', // Favicon URL
    mainHeading: 'h1', // First H1 heading
    allHeadings: 'h1, h2, h3, h4, h5, h6', // All headings on the page
    firstParagraph: 'p', // First paragraph
    allParagraphs: 'p', // All paragraphs on the page
    links: 'a', // All links on the page
    images: 'img', // All image URLs
    imageAlts: 'img', // Alternative text for images
    videos: 'video, iframe[src*="youtube.com"], iframe[src*="vimeo.com"]', // Video sources
    tables: 'table', // Capture table elements
    tableData: 'td', // Capture table cells
    lists: 'ul, ol', // Capture all lists
    listItems: 'li', // Capture all list items
    scripts: 'script', // JavaScript file sources
    stylesheets: 'link[rel="stylesheet"]', // External CSS files
    structuredData: 'script[type="application/ld+json"]', // JSON-LD structured data
    socialLinks:
      'a[href*="facebook.com"], a[href*="twitter.com"], a[href*="linkedin.com"], a[href*="instagram.com"]', // Social media links
    author: 'meta[name="author"]', // Author meta tag
    publishDate: 'meta[property="article:published_time"], time', // Publish date
    modifiedDate: 'meta[property="article:modified_time"]', // Last modified date
    canonicalURL: 'link[rel="canonical"]', // Canonical URL
    openGraphTitle: 'meta[property="og:title"]', // OpenGraph title
    openGraphDescription: 'meta[property="og:description"]', // OpenGraph description
    openGraphImage: 'meta[property="og:image"]', // OpenGraph image
    twitterCard: 'meta[name="twitter:card"]', // Twitter card type
    twitterTitle: 'meta[name="twitter:title"]', // Twitter title
    twitterDescription: 'meta[name="twitter:description"]', // Twitter description
    twitterImage: 'meta[name="twitter:image"]', // Twitter image
  },
});

app.get('/test-scraper', async (req, res) => {
  try {
    const url = 'https://github.com/The-Node-Forge';
    const data = await scraper.scrape(url);

    exportToJSON(data, 'output.json'); // export JSON
    exportToCSV(data, 'output.csv'); // export CSV

    res.status(200).json({ success: true, data });
  } catch (error) {
    res.status(500).json({ success: false, error: error.message });
  }
});

πŸ› οΈ Full Usage Example

import { WebScraper } from 'simple-web-scraper';

const scraper = new WebScraper({
  usePuppeteer: true, // Set to false if scraping static pages
  rules: {
    fullHTML: 'html', // Entire page HTML
    title: 'head > title', // Page title
    description: 'meta[name="description"]', // Meta description
    keywords: 'meta[name="keywords"]', // Meta keywords
    favicon: 'link[rel="icon"]', // Favicon URL
    mainHeading: 'h1', // First H1 heading
    allHeadings: 'h1, h2, h3, h4, h5, h6', // All headings on the page
    firstParagraph: 'p', // First paragraph
    allParagraphs: 'p', // All paragraphs on the page
    links: 'a', // All links on the page
    images: 'img', // All image URLs
    imageAlts: 'img', // Alternative text for images
    videos: 'video, iframe[src*="youtube.com"], iframe[src*="vimeo.com"]', // Video sources
    tables: 'table', // Capture table elements
    tableData: 'td', // Capture table cells
    lists: 'ul, ol', // Capture all lists
    listItems: 'li', // Capture all list items
    scripts: 'script', // JavaScript file sources
    stylesheets: 'link[rel="stylesheet"]', // External CSS files
    structuredData: 'script[type="application/ld+json"]', // JSON-LD structured data
    socialLinks:
      'a[href*="facebook.com"], a[href*="twitter.com"], a[href*="linkedin.com"], a[href*="instagram.com"]', // Social media links
    author: 'meta[name="author"]', // Author meta tag
    publishDate: 'meta[property="article:published_time"], time', // Publish date
    modifiedDate: 'meta[property="article:modified_time"]', // Last modified date
    canonicalURL: 'link[rel="canonical"]', // Canonical URL
    openGraphTitle: 'meta[property="og:title"]', // OpenGraph title
    openGraphDescription: 'meta[property="og:description"]', // OpenGraph description
    openGraphImage: 'meta[property="og:image"]', // OpenGraph image
    twitterCard: 'meta[name="twitter:card"]', // Twitter card type
    twitterTitle: 'meta[name="twitter:title"]', // Twitter title
    twitterDescription: 'meta[name="twitter:description"]', // Twitter description
    twitterImage: 'meta[name="twitter:image"]', // Twitter image
  },
});

(async () => {
  const data = await scraper.scrape('https://example.com');
  console.log(data);
})();

πŸ“Š Rule Set Table

Rule CSS Selector Target Data
fullHTML html The entire HTML of the page
title head > title The <title> of the page
description meta[name="description"] Meta description for SEO
keywords meta[name="keywords"] Meta keywords
favicon link[rel="icon"] Website icon
mainHeading h1 The first <h1> heading
allHeadings h1, h2, h3, h4, h5, h6 All headings (h1-h6)
firstParagraph p The first paragraph (<p>)
allParagraphs p All paragraphs on the page
links a All anchor <a> links
images img All image <img> sources
imageAlts img All image alt texts
videos video, iframe[src*="youtube.com"], iframe[src*="vimeo.com"] Video sources (<video>, YouTube, Vimeo)
tables table All <table> elements
tableData td Individual <td> elements
lists ul, ol All ordered <ol> and unordered <ul> lists
listItems li All list <li> items
scripts script JavaScript files included (<script src="...">)
stylesheets link[rel="stylesheet"] Stylesheets (<link rel="stylesheet">)
structuredData script[type="application/ld+json"] JSON-LD structured data for SEO
socialLinks a[href*="facebook.com"], a[href*="twitter.com"], a[href*="linkedin.com"], a[href*="instagram.com"] Facebook, Twitter, LinkedIn, Instagram links
author meta[name="author"] Page author (meta[name="author"])
publishDate meta[property="article:published_time"], time Date article was published
modifiedDate meta[property="article:modified_time"] Last modified date
canonicalURL link[rel="canonical"] Canonical URL (avoids duplicate content)
openGraphTitle meta[property="og:title"] OpenGraph metadata for social sharing
openGraphDescription meta[property="og:description"] OpenGraph description
openGraphImage meta[property="og:image"] OpenGraph image URL
twitterCard meta[name="twitter:card"] Twitter card type (summary, summary_large_image)
twitterTitle meta[name="twitter:title"] Twitter title metadata
twitterDescription meta[name="twitter:description"] Twitter description metadata
twitterImage meta[name="twitter:image"] Twitter image metadata

πŸ’‘ Contributing

Contributions are welcome! Please submit issues or pull requests.


About

A lightweight and configurable web scraper utility built with Puppeteer and Cheerio for automated data extraction, research, and aggregation. Features include configurable scraping rules, auto-throttling, and easy export to JSON/CSV.

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •