article-parser

Extract main article, main image and meta data from URL.

Demo

Install & Usage

Node.js

npm i article-parser

# pnpm
pnpm i article-parser

# yarn
yarn add article-parser

import { extract } from 'article-parser'

// with CommonJS environments
// const { extract } = require('article-parser/dist/cjs/article-parser.js')

const url = 'https://www.freethink.com/technology/virtual-world'

extract(url).then((article) => {
  console.log(article)
}).catch((err) => {
  console.trace(err)
})

Deno

import { extract } from 'https://esm.sh/article-parser'

(async () => {
  const data = await extract('https://www.freethink.com/technology/virtual-world')
  console.log(data)
})();

View more examples.

APIs

.extract(String url | String html)
Transformations
Configuration methods

extract(String url | String html)

Load and extract article data. Return a Promise object.

Example:

import { extract } from 'article-parser'

const getArticle = async (url) => {
  try {
    const article = await extract(url)
    return article
  } catch (err) {
    console.trace(err)
    return null
  }
}

getArticle('https://domain.com/path/to/article')

If the extraction works well, you should get an article object with the structure as below:

{
  "url": URI String,
  "title": String,
  "description": String,
  "image": URI String,
  "author": String,
  "content": HTML String,
  "published": Date String,
  "source": String, // original publisher
  "links": Array, // list of alternative links
  "ttr": Number, // time to read in second, 0 = unknown
}

Click here for seeing an actual result.

Transformations

Sometimes the default extraction algorithm may not work well. That is the time when we need transformations.

By adding some functions before and after the main extraction step, we aim to come up with a better result as much as possible.

transformation is available since article-parser@7.0.0, as the improvement of queryRule in the older versions.

To play with transformations, article-parser provides 2 public methods as below:

addTransformations(Object transformation | Array transformations)
removeTransformations(Array patterns)

At first, let's talk about transformation object.

`transformation` object

In article-parser, transformation is an object with the following properties:

patterns: required, a list of regexps to match the URLs
pre: optional, a function to process raw HTML
post: optional, a function to proces extracted article

Basically, the meaning of transformation can be interpreted like this:

with the urls which match these patterns
let's run pre function to normalize HTML content
then extract main article content with normalized HTML, and if success
let's run post function to normalize extracted article content

Here is an example transformation:

{
  patterns: [
    /([\w]+.)?domain.tld\/*/,
    /domain.tld\/articles\/*/
  ],
  pre: (document) => {
    // remove all .advertise-area and its siblings from raw HTML content
    document.querySelectorAll('.advertise-area').forEach((element) => {
      if (element.nodeName === 'DIV') {
        while (element.nextSibling) {
          element.parentNode.removeChild(element.nextSibling)
        }
        element.parentNode.removeChild(element)
      }
    })
    return document
  },
  post: (document) => {
    // with extracted article, replace all h4 tags with h2
    document.querySelectorAll('h4').forEach((element) => {
      const h2Element = document.createElement('h2')
      h2Element.innerHTML = element.innerHTML
      element.parentNode.replaceChild(h2Element, element)
    })
    // change small sized images to original version
    document.querySelectorAll('img').forEach((element) => {
      const src = element.getAttribute('src')
      if (src.includes('domain.tld/pics/150x120/')) {
        const fullSrc = src.replace('/pics/150x120/', '/pics/original/')
        element.setAttribute('src', fullSrc)
      }
    })
    return document
  }
}

To write better transformation logic, please refer linkedom and Document Object.

`addTransformations(Object transformation | Array transformations)`

Add a single transformation or a list of transformations. For example:

import { addTransformations } from 'article-parser'

addTransformations({
  patterns: [
    /([\w]+.)?abc.tld\/*/
  ],
  pre: (document) => {
    // do something with document
    return document
  },
  post: (document) => {
    // do something with document
    return document
  }
})

addTransformations([
  {
    patterns: [
      /([\w]+.)?def.tld\/*/
    ],
    pre: (document) => {
      // do something with document
      return document
    },
    post: (document) => {
      // do something with document
      return document
    }
  },
  {
    patterns: [
      /([\w]+.)?xyz.tld\/*/
    ],
    pre: (document) => {
      // do something with document
      return document
    },
    post: (document) => {
      // do something with document
      return document
    }
  }
])

The transformations without patterns will be ignored.

`removeTransformations(Array patterns)`

To remove transformations that match the specific patterns.

For example, we can remove all added transformations above:

import { removeTransformations } from 'article-parser'

removeTransformations([
  /([\w]+.)?abc.tld\/*/,
  /([\w]+.)?def.tld\/*/,
  /([\w]+.)?xyz.tld\/*/
])

Calling removeTransformations() without parameter will remove all current transformations.

Priority order

While processing an article, more than one transformation can be applied.

Suppose that we have the following transformations:

[
  {
    patterns: [
      /http(s?):\/\/google.com\/*/,
      /http(s?):\/\/goo.gl\/*/
    ],
    pre: function_one,
    post: function_two
  },
  {
    patterns: [
      /http(s?):\/\/goo.gl\/*/,
      /http(s?):\/\/google.inc\/*/
    ],
    pre: function_three,
    post: function_four
  }
]

As you can see, an article from goo.gl certainly matches both them.

In this scenario, article-parser will execute both transformations, one by one:

function_one -> function_three -> extraction -> function_two -> function_four

Configuration methods

In addition, this lib provides some methods to customize default settings. Don't touch them unless you have reason to do that.

getParserOptions()
setParserOptions(Object parserOptions)
getSanitizeHtmlOptions()
setSanitizeHtmlOptions(Object sanitizeHtmlOptions)

Here are default properties/values:

Object `parserOptions`:

View default options

Object `sanitizeHtmlOptions`:

View default options

Read sanitize-html docs for more info.

Quick evaluation

git clone https://github.com/ndaidong/article-parser.git
cd article-parser
pnpm i

npm run eval {URL_TO_PARSE_ARTICLE}

License

The MIT License (MIT)

Name		Name	Last commit message	Last commit date
Latest commit History 610 Commits
.github/workflows		.github/workflows
dist		dist
examples		examples
src		src
test-data		test-data
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
build.js		build.js
build.test.js		build.test.js
eval.js		eval.js
index.d.ts		index.d.ts
index.js		index.js
jest.config.js		jest.config.js
package.json		package.json
prettier.config.cjs		prettier.config.cjs
reset.js		reset.js

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

article-parser

Demo

Install & Usage

Node.js

Deno

APIs

extract(String url | String html)

Transformations

`transformation` object

`addTransformations(Object transformation | Array transformations)`

`removeTransformations(Array patterns)`

Priority order

Configuration methods

Object `parserOptions`:

Object `sanitizeHtmlOptions`:

Quick evaluation

License

About

Uh oh!

Releases

Packages

Languages

License

mpuz/article-parser-headers

Folders and files

Latest commit

History

Repository files navigation

article-parser

Demo

Install & Usage

Node.js

Deno

APIs

extract(String url | String html)

Transformations

transformation object

addTransformations(Object transformation | Array transformations)

removeTransformations(Array patterns)

Priority order

Configuration methods

Object parserOptions:

Object sanitizeHtmlOptions:

Quick evaluation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

`transformation` object

`addTransformations(Object transformation | Array transformations)`

`removeTransformations(Array patterns)`

Object `parserOptions`:

Object `sanitizeHtmlOptions`:

Packages