A lightweight and efficient web scraping package for JavaScript/TypeScript applications. This package helps developers fetch HTML content, parse web pages, and extract data effortlessly.
- β Fetch Web Content β Retrieve HTML from any URL with ease.
- β Parse and Extract Data β Utilize integrated parsing tools to extract information.
- β Configurable Options β Customize scraping behaviors using CSS selectors.
- β Headless Browser Support β Optionally use Puppeteer for JavaScript-rendered pages.
- β Lightweight & Fast β Uses Cheerio for static HTML scraping.
- β TypeScript Support β Fully typed for robust development.
- β Data Export Support β Export scraped data in JSON or CSV formats.
- β CSV Import Support β Read CSV files and convert them to JSON.
Install via npm:
npm install simple-web-scraperor using Yarn:
yarn add simple-web-scraperThis package leverages Cheerio and Puppeteer for powerful web scraping capabilities:
- Ideal for static HTML parsing (like
jQueryfor the backend). - Extremely fast and lightweight β perfect for pages without JavaScript rendering.
- Provides easy CSS selector querying for extracting structured data.
- Handles JavaScript-rendered pages β essential for scraping dynamic content.
- Can interact with pages, click buttons, and fill out forms.
- Allows screenshot capturing, PDF generation, and full-page automation.
- Use Cheerio for speed when scraping static pages.
- Switch to Puppeteer for JavaScript-heavy sites requiring full rendering.
- Provides flexibility to choose the best approach for your project.
new WebScraper(options?: ScraperOptions)| Parameter | Type | Description |
|---|---|---|
usePuppeteer |
boolean (optional) |
Whether to use Puppeteer (default: true) |
throttle |
number (optional) |
Delay in milliseconds between requests (default: 1000) |
rules |
Record<string, string> |
CSS selectors defining data extraction rules |
- Scrapes the given URL based on the configured options.
- Exports the given data to a JSON file.
- Exports the given data to a CSV file.
You can scrape web pages using either Puppeteer (for JavaScript-heavy pages) or Cheerio (for static HTML pages).
import { WebScraper } from 'simple-web-scraper';
const scraper = new WebScraper({
usePuppeteer: false, // Set to true for dynamic pages
rules: {
title: 'h1',
description: 'meta[name=\"description\"]',
},
});
(async () => {
const data = await scraper.scrape('https://example.com');
console.log(data);
})();To scrape pages that require JavaScript execution:
const scraper = new WebScraper({
usePuppeteer: true, // Enable Puppeteer for JavaScript-rendered content
rules: {
heading: 'h1',
price: '.product-price',
},
});
(async () => {
const data = await scraper.scrape('https://example.com/product');
console.log(data);
})();- Scraped data can be exported to JSON or CSV files using utility functions.
import { exportToJSON } from 'simple-web-scraper';
const data = { name: 'Example', value: 42 };
exportToJSON(data, 'output.json');import { exportToCSV } from 'simple-web-scraper';
const data = [
{ name: 'Example 1', value: 42 },
{ name: 'Example 2', value: 99 },
];
exportToCSV(data, 'output.csv');
// Preserve null and undefined values as null
exportToCSV(data, 'output.csv', { preserveNulls: true });This example demonstrates how to use simple-web-scraper in a Node.js backend:
import express from 'express';
import { WebScraper, exportToJSON, exportToCSV } from 'simple-web-scraper';
const app = express();
const scraper = new WebScraper({
usePuppeteer: true,
rules: { title: 'h1', content: 'p' },
});
app.get('/scrape-example', async (req, res) => {
try {
const url = 'https://github.com/The-Node-Forge';
const data = await scraper.scrape(url);
exportToJSON(data, 'output.json'); // export JSON
exportToCSV(data, 'output.csv', { preserveNulls: true }); // export CSV
res.status(200).json({ success: true, data });
} catch (error) {
res.status(500).json({ success: false, error: error.message });
}
});This example demonstrates how to use simple-web-scraper in a Node.js backend:
const {
WebScraper,
exportToJSON,
exportToCSV,
} = require('@the-node-forge/simple-web-scraper/dist');
const scraper = new WebScraper({
usePuppeteer: true,
rules: {
fullHTML: 'html', // Entire page HTML
title: 'head > title', // Page title
description: 'meta[name="description"]', // Meta description
keywords: 'meta[name="keywords"]', // Meta keywords
favicon: 'link[rel="icon"]', // Favicon URL
mainHeading: 'h1', // First H1 heading
allHeadings: 'h1, h2, h3, h4, h5, h6', // All headings on the page
firstParagraph: 'p', // First paragraph
allParagraphs: 'p', // All paragraphs on the page
links: 'a', // All links on the page
images: 'img', // All image URLs
imageAlts: 'img', // Alternative text for images
videos: 'video, iframe[src*="youtube.com"], iframe[src*="vimeo.com"]', // Video sources
tables: 'table', // Capture table elements
tableData: 'td', // Capture table cells
lists: 'ul, ol', // Capture all lists
listItems: 'li', // Capture all list items
scripts: 'script', // JavaScript file sources
stylesheets: 'link[rel="stylesheet"]', // External CSS files
structuredData: 'script[type="application/ld+json"]', // JSON-LD structured data
socialLinks:
'a[href*="facebook.com"], a[href*="twitter.com"], a[href*="linkedin.com"], a[href*="instagram.com"]', // Social media links
author: 'meta[name="author"]', // Author meta tag
publishDate: 'meta[property="article:published_time"], time', // Publish date
modifiedDate: 'meta[property="article:modified_time"]', // Last modified date
canonicalURL: 'link[rel="canonical"]', // Canonical URL
openGraphTitle: 'meta[property="og:title"]', // OpenGraph title
openGraphDescription: 'meta[property="og:description"]', // OpenGraph description
openGraphImage: 'meta[property="og:image"]', // OpenGraph image
twitterCard: 'meta[name="twitter:card"]', // Twitter card type
twitterTitle: 'meta[name="twitter:title"]', // Twitter title
twitterDescription: 'meta[name="twitter:description"]', // Twitter description
twitterImage: 'meta[name="twitter:image"]', // Twitter image
},
});
app.get('/test-scraper', async (req, res) => {
try {
const url = 'https://github.com/The-Node-Forge';
const data = await scraper.scrape(url);
exportToJSON(data, 'output.json'); // export JSON
exportToCSV(data, 'output.csv'); // export CSV
res.status(200).json({ success: true, data });
} catch (error) {
res.status(500).json({ success: false, error: error.message });
}
});import { WebScraper } from 'simple-web-scraper';
const scraper = new WebScraper({
usePuppeteer: true, // Set to false if scraping static pages
rules: {
fullHTML: 'html', // Entire page HTML
title: 'head > title', // Page title
description: 'meta[name="description"]', // Meta description
keywords: 'meta[name="keywords"]', // Meta keywords
favicon: 'link[rel="icon"]', // Favicon URL
mainHeading: 'h1', // First H1 heading
allHeadings: 'h1, h2, h3, h4, h5, h6', // All headings on the page
firstParagraph: 'p', // First paragraph
allParagraphs: 'p', // All paragraphs on the page
links: 'a', // All links on the page
images: 'img', // All image URLs
imageAlts: 'img', // Alternative text for images
videos: 'video, iframe[src*="youtube.com"], iframe[src*="vimeo.com"]', // Video sources
tables: 'table', // Capture table elements
tableData: 'td', // Capture table cells
lists: 'ul, ol', // Capture all lists
listItems: 'li', // Capture all list items
scripts: 'script', // JavaScript file sources
stylesheets: 'link[rel="stylesheet"]', // External CSS files
structuredData: 'script[type="application/ld+json"]', // JSON-LD structured data
socialLinks:
'a[href*="facebook.com"], a[href*="twitter.com"], a[href*="linkedin.com"], a[href*="instagram.com"]', // Social media links
author: 'meta[name="author"]', // Author meta tag
publishDate: 'meta[property="article:published_time"], time', // Publish date
modifiedDate: 'meta[property="article:modified_time"]', // Last modified date
canonicalURL: 'link[rel="canonical"]', // Canonical URL
openGraphTitle: 'meta[property="og:title"]', // OpenGraph title
openGraphDescription: 'meta[property="og:description"]', // OpenGraph description
openGraphImage: 'meta[property="og:image"]', // OpenGraph image
twitterCard: 'meta[name="twitter:card"]', // Twitter card type
twitterTitle: 'meta[name="twitter:title"]', // Twitter title
twitterDescription: 'meta[name="twitter:description"]', // Twitter description
twitterImage: 'meta[name="twitter:image"]', // Twitter image
},
});
(async () => {
const data = await scraper.scrape('https://example.com');
console.log(data);
})();| Rule | CSS Selector | Target Data |
|---|---|---|
| fullHTML | html |
The entire HTML of the page |
| title | head > title |
The <title> of the page |
| description | meta[name="description"] |
Meta description for SEO |
| keywords | meta[name="keywords"] |
Meta keywords |
| favicon | link[rel="icon"] |
Website icon |
| mainHeading | h1 |
The first <h1> heading |
| allHeadings | h1, h2, h3, h4, h5, h6 |
All headings (h1-h6) |
| firstParagraph | p |
The first paragraph (<p>) |
| allParagraphs | p |
All paragraphs on the page |
| links | a |
All anchor <a> links |
| images | img |
All image <img> sources |
| imageAlts | img |
All image alt texts |
| videos | video, iframe[src*="youtube.com"], iframe[src*="vimeo.com"] |
Video sources (<video>, YouTube, Vimeo) |
| tables | table |
All <table> elements |
| tableData | td |
Individual <td> elements |
| lists | ul, ol |
All ordered <ol> and unordered <ul> lists |
| listItems | li |
All list <li> items |
| scripts | script |
JavaScript files included (<script src="...">) |
| stylesheets | link[rel="stylesheet"] |
Stylesheets (<link rel="stylesheet">) |
| structuredData | script[type="application/ld+json"] |
JSON-LD structured data for SEO |
| socialLinks | a[href*="facebook.com"], a[href*="twitter.com"], a[href*="linkedin.com"], a[href*="instagram.com"] |
Facebook, Twitter, LinkedIn, Instagram links |
| author | meta[name="author"] |
Page author (meta[name="author"]) |
| publishDate | meta[property="article:published_time"], time |
Date article was published |
| modifiedDate | meta[property="article:modified_time"] |
Last modified date |
| canonicalURL | link[rel="canonical"] |
Canonical URL (avoids duplicate content) |
| openGraphTitle | meta[property="og:title"] |
OpenGraph metadata for social sharing |
| openGraphDescription | meta[property="og:description"] |
OpenGraph description |
| openGraphImage | meta[property="og:image"] |
OpenGraph image URL |
| twitterCard | meta[name="twitter:card"] |
Twitter card type (summary, summary_large_image) |
| twitterTitle | meta[name="twitter:title"] |
Twitter title metadata |
| twitterDescription | meta[name="twitter:description"] |
Twitter description metadata |
| twitterImage | meta[name="twitter:image"] |
Twitter image metadata |
Contributions are welcome! Please submit issues or pull requests.