This project helps you to read the content of URLs, and return the title, length, html, text, markdown, excerpt.
"node": ">=20.11.0"
yarn add url-reader
# or npm install url-reader
import URLReader from 'url-reader';
const reader = new URLReader();
await reader.init();
const results = await reader.read({
urls: ['https://www.google.com'],
timeout: 10000, // ms, default: 60000
enableMarkdown: false, // default: true
runScripts: 'dangerously', // run the scripts included in the HTML and fetch remote resources, default is closed.
});
Parsed Result:
interface IReaderResult {
title: string;
length: number;
html: string;
text: string;
markdown?: string;
excerpt: string;
}
- start server
git clone https://github.com/yokingma/url-reader.git
cd url-reader
# default listen on port 3030
yarn install & yarn run start
- api
GET /reader?url=https://www.google.com
POST /reader
Body:
{
urls: ['https://www.google.com', 'https://www.bing.com']
}
docker build -t urlreader . # urlreader is your image's tag name
The service will listen on port 3030
.
- puppeteer When you install Puppeteer, it will automatically downloads a recent version of Chrome for Testing (~170MB macOS, ~282MB Linux, ~280MB Windows) and a chrome-headless-shell binary.
- install error with puppeteer
Error [ERR_TLS_CERT_ALTNAME_INVALID]: Hostname/IP does not match certificate's altnames...
remove .npmrc file and re-install.