Extract main article, main image and meta data from URL.
npm i article-parser
# pnpm
pnpm i article-parser
# yarn
yarn add article-parser
import { extract } from 'article-parser'
// with CommonJS environments
// const { extract } = require('article-parser/dist/cjs/article-parser.js')
const url = 'https://www.freethink.com/technology/virtual-world'
extract(url).then((article) => {
console.log(article)
}).catch((err) => {
console.trace(err)
})
import { extract } from 'https://esm.sh/article-parser'
(async () => {
const data = await extract('https://www.freethink.com/technology/virtual-world')
console.log(data)
})();
View more examples.
Load and extract article data. Return a Promise object.
Example:
import { extract } from 'article-parser'
const getArticle = async (url) => {
try {
const article = await extract(url)
return article
} catch (err) {
console.trace(err)
return null
}
}
getArticle('https://domain.com/path/to/article')
If the extraction works well, you should get an article
object with the structure as below:
{
"url": URI String,
"title": String,
"description": String,
"image": URI String,
"author": String,
"content": HTML String,
"published": Date String,
"source": String, // original publisher
"links": Array, // list of alternative links
"ttr": Number, // time to read in second, 0 = unknown
}
Click here for seeing an actual result.
Sometimes the default extraction algorithm may not work well. That is the time when we need transformations.
By adding some functions before and after the main extraction step, we aim to come up with a better result as much as possible.
transformation
is available since article-parser@7.0.0
, as the improvement of queryRule
in the older versions.
To play with transformations, article-parser
provides 2 public methods as below:
addTransformations(Object transformation | Array transformations)
removeTransformations(Array patterns)
At first, let's talk about transformation
object.
In article-parser
, transformation
is an object with the following properties:
patterns
: required, a list of regexps to match the URLspre
: optional, a function to process raw HTMLpost
: optional, a function to proces extracted article
Basically, the meaning of transformation
can be interpreted like this:
with the urls which match these
patterns
let's runpre
function to normalize HTML content
then extract main article content with normalized HTML, and if success
let's runpost
function to normalize extracted article content
Here is an example transformation:
{
patterns: [
/([\w]+.)?domain.tld\/*/,
/domain.tld\/articles\/*/
],
pre: (document) => {
// remove all .advertise-area and its siblings from raw HTML content
document.querySelectorAll('.advertise-area').forEach((element) => {
if (element.nodeName === 'DIV') {
while (element.nextSibling) {
element.parentNode.removeChild(element.nextSibling)
}
element.parentNode.removeChild(element)
}
})
return document
},
post: (document) => {
// with extracted article, replace all h4 tags with h2
document.querySelectorAll('h4').forEach((element) => {
const h2Element = document.createElement('h2')
h2Element.innerHTML = element.innerHTML
element.parentNode.replaceChild(h2Element, element)
})
// change small sized images to original version
document.querySelectorAll('img').forEach((element) => {
const src = element.getAttribute('src')
if (src.includes('domain.tld/pics/150x120/')) {
const fullSrc = src.replace('/pics/150x120/', '/pics/original/')
element.setAttribute('src', fullSrc)
}
})
return document
}
}
- To write better transformation logic, please refer linkedom and Document Object.
Add a single transformation or a list of transformations. For example:
import { addTransformations } from 'article-parser'
addTransformations({
patterns: [
/([\w]+.)?abc.tld\/*/
],
pre: (document) => {
// do something with document
return document
},
post: (document) => {
// do something with document
return document
}
})
addTransformations([
{
patterns: [
/([\w]+.)?def.tld\/*/
],
pre: (document) => {
// do something with document
return document
},
post: (document) => {
// do something with document
return document
}
},
{
patterns: [
/([\w]+.)?xyz.tld\/*/
],
pre: (document) => {
// do something with document
return document
},
post: (document) => {
// do something with document
return document
}
}
])
The transformations without patterns
will be ignored.
To remove transformations that match the specific patterns.
For example, we can remove all added transformations above:
import { removeTransformations } from 'article-parser'
removeTransformations([
/([\w]+.)?abc.tld\/*/,
/([\w]+.)?def.tld\/*/,
/([\w]+.)?xyz.tld\/*/
])
Calling removeTransformations()
without parameter will remove all current transformations.
While processing an article, more than one transformation can be applied.
Suppose that we have the following transformations:
[
{
patterns: [
/http(s?):\/\/google.com\/*/,
/http(s?):\/\/goo.gl\/*/
],
pre: function_one,
post: function_two
},
{
patterns: [
/http(s?):\/\/goo.gl\/*/,
/http(s?):\/\/google.inc\/*/
],
pre: function_three,
post: function_four
}
]
As you can see, an article from goo.gl
certainly matches both them.
In this scenario, article-parser
will execute both transformations, one by one:
function_one
-> function_three
-> extraction -> function_two
-> function_four
In addition, this lib provides some methods to customize default settings. Don't touch them unless you have reason to do that.
- getParserOptions()
- setParserOptions(Object parserOptions)
- getSanitizeHtmlOptions()
- setSanitizeHtmlOptions(Object sanitizeHtmlOptions)
Here are default properties/values:
View default options
View default options
Read sanitize-html docs for more info.
git clone https://github.com/ndaidong/article-parser.git
cd article-parser
pnpm i
npm run eval {URL_TO_PARSE_ARTICLE}
The MIT License (MIT)