Dead simple website crawler

Objectives and Scope

This is a dead simple crawler created to make data scraping as easy as possible. It's API is and will always somehow limiting if you plan to clone websites, make an indexer or anything like that.

Examples

Get package names from npm search result

const c = new Crawler<string>();
await c.start("https://www.npmjs.com/search?q=tank").find("main h3").textContent().result();
const results = await c.results(); // ["tank", "tanker", "tankify", ...]

Get package name + readme content from npm search result

const c = new Crawler<string>();
await c.start("https://www.npmjs.com/search?q=tank").find("main a[target=_self][href^='/package']").click()
    .each(job => job.find("h2:first,#readme").textContent().result());
const results = await c.results(); // ["tank", "tank is a package ...", "tanker", "tanker is an awesome ...", ...]

Get package name + readme content from npm search result, wrap in nice object

const c = new Crawler<{ name: string; description: string }>();
await c.start("https://www.npmjs.com/search?q=tank").find("main a[target=_self][href^='/package']").click()
    .each(job => job.find("h2:first,#readme").textContent().replace((s) => ({
        name: s[0],
        description: s[1],
    })).result());

const results = await c.results(); // [{ name: "tank", description: "tank is a package ..."}, { name: "tanker", description: "tanker is an awesome ..."}, ...]

Get all images from old reddit page

const c = new Crawler<string>();
await c.start("https://old.reddit.com/r/aww/").find("img").attr("src").resolve().result();
const results = await c.results(); // ["https://a.thumbs.redditmedia.com/...", "https://b.thumbs.redditmedia.com/...", ...]

Documentation

Proper documentation is not yet available.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.github/workflows		.github/workflows
build-scripts		build-scripts
src		src
test		test
tutorials		tutorials
.editorconfig		.editorconfig
.gitignore		.gitignore
.npmignore		.npmignore
.prettierignore		.prettierignore
.prettierrc.json		.prettierrc.json
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
eslint.config.mjs		eslint.config.mjs
jest.config.cjs		jest.config.cjs
nodemon.json		nodemon.json
package.json		package.json
pagesconfig.json		pagesconfig.json
pnpm-lock.yaml		pnpm-lock.yaml
tsconfig.cjs.json		tsconfig.cjs.json
tsconfig.esm.json		tsconfig.esm.json
tsconfig.json		tsconfig.json
tsconfig.lint.json		tsconfig.lint.json
typedoc.mjs		typedoc.mjs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Dead simple website crawler

Objectives and Scope

Examples

Documentation

License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

dzek69/czolg

Folders and files

Latest commit

History

Repository files navigation

Dead simple website crawler

Objectives and Scope

Examples

Documentation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages