Skip to content

Commit c09880f

Browse files
authored
Reorganize and expand README (#673)
* Reorganize and expand README * Document options & add example for isProbablyReaderable()
1 parent 3d8baff commit c09880f

File tree

1 file changed

+61
-44
lines changed

1 file changed

+61
-44
lines changed

README.md

Lines changed: 61 additions & 44 deletions
Original file line numberDiff line numberDiff line change
@@ -2,86 +2,103 @@
22

33
A standalone version of the readability library used for Firefox Reader View.
44

5-
## Usage on the web
5+
## Basic usage
66

7-
To parse a document, you must create a new `Readability` object from a DOM document object, and then call `parse()`. Here's an example:
7+
To parse a document, you must create a new `Readability` object from a DOM document object, and then call the [`parse()`](#parse) method. Here's an example:
88

99
```javascript
1010
var article = new Readability(document).parse();
1111
```
1212

13-
This `article` object will contain the following properties:
13+
If you use Readability in a web browser, you will likely be able to use a `document` reference from elsewhere (e.g. fetched via XMLHttpRequest, in a same-origin `<iframe>` you have access to, etc.). In Node.js, you can [use an external DOM library](#nodejs-usage).
1414

15-
* `title`: article title
16-
* `content`: HTML string of processed article content
17-
* `textContent`: text content of the article (all HTML removed)
18-
* `length`: length of an article, in characters
19-
* `excerpt`: article description, or short excerpt from the content
20-
* `byline`: author metadata
21-
* `dir`: content direction
15+
## API Reference
2216

23-
If you're using Readability on the web, you will likely be able to use a `document` reference
24-
from elsewhere (e.g. fetched via XMLHttpRequest, in a same-origin `<iframe>` you have access to, etc.).
17+
### `new Readability(document, options)`
2518

26-
### Optional
19+
The `options` object accepts a number of properties, all optional:
2720

28-
Readability's `parse()` works by modifying the DOM. This removes some elements in the web page.
29-
You could avoid this by passing the clone of the `document` object while creating a `Readability` object.
21+
* `debug` (boolean, default `false`): whether to enable logging.
22+
* `maxElemsToParse` (number, default `0` i.e. no limit): the maximum number of elements to parse.
23+
* `nbTopCandidates` (number, default `5`): the number of top candidates to consider when analysing how tight the competition is among candidates.
24+
* `charThreshold` (number, default `500`): the number of characters an article must have in order to return a result.
25+
* `classesToPreserve` (array): a set of classes to preserve on HTML elements when the `keepClasses` options is set to `false`.
26+
* `keepClasses` (boolean, default `false`): whether to preserve all classes on HTML elements. When set to `false` only classes specified in the `classesToPreserve` array are kept.
27+
* `disableJSONLD` (boolean, default `false`): when extracting page metadata, Readability gives precendence to Schema.org fields specified in the JSON-LD format. Set this option to false to skip JSON-LD parsing.
28+
* `serializer` (function, default `el => el.innerHTML`) controls how the the `content` property returned by the `parse()` method is produced from the root DOM element. It may be useful to specify the `serializer` as the identity function (`el => el`) to obtain a DOM element instead of a string for `content` if you plan to process it further.
3029

31-
```
30+
### `parse()`
31+
32+
Returns an object containing the following properties:
33+
34+
* `title`: article title;
35+
* `content`: HTML string of processed article content;
36+
* `textContent`: text content of the article, with all the HTML tags removed;
37+
* `length`: length of an article, in characters;
38+
* `excerpt`: article description, or short excerpt from the content;
39+
* `byline`: author metadata;
40+
* `dir`: content direction;
41+
* `siteName`: name of the site.
42+
43+
The `parse()` method works by modifying the DOM. This removes some elements in the web page, which may be undesirable. You can avoid this by passing the clone of the `document` object to the `Readability` constructor:
44+
45+
```js
3246
var documentClone = document.cloneNode(true);
3347
var article = new Readability(documentClone).parse();
3448
```
3549

36-
## Usage from Node.js
50+
### `isProbablyReaderable(document, options)`
51+
52+
A quick-and-dirty way of figuring out if it's plausible that the contents of a given document are suitable for processing with Readability. It is likely to produce both false positives and false negatives. The reason it exists is to avoid bogging down a time-sensitive process (like loading and showing the user a webpage) with the complex logic in the core of Readability. Improvements to its logic (while not deteriorating its performance) are very welcome.
53+
54+
The `options` object accepts a number of properties, all optional:
55+
56+
* `minContentLength` (number, default `140`): the minimum node content length used to decide if the document is readerable;
57+
* `minScore` (number, default `20`): the minumum cumulated 'score' used to determine if the document is readerable;
58+
* `visibilityChecker` (function, default `isNodeVisible`): the function used to determine if a node is visible;
59+
60+
The function returns a boolean corresponding to whether or not we suspect `Readability.parse()` will suceeed at returning an article object. Here's an example:
61+
62+
```js
63+
/*
64+
Only instantiate Readability if we suspect
65+
the `parse()` method will produce a meaningful result.
66+
*/
67+
if (isProbablyReaderable(document)) {
68+
let article = new Readability(document).parse();
69+
}
70+
```
71+
72+
## Node.js usage
3773

3874
Readability is available on npm:
3975

4076
```bash
4177
npm install @mozilla/readability
4278
```
4379

44-
In Node.js, you won't generally have a DOM document object. To obtain one, you can use external
45-
libraries like [jsdom](https://github.com/jsdom/jsdom). While this repository contains a parser of
46-
its own (`JSDOMParser`), that is restricted to reading XML-compatible markup and therefore we do
47-
not recommend it for general use.
48-
49-
If you're using `jsdom` to create a DOM object, you should ensure that the page doesn't run (page)
50-
scripts (avoid fetching remote resources etc.) as well as passing it the page's URI as the `url`
51-
property of the `options` object you pass the `JSDOM` constructor.
52-
53-
### Example:
80+
Since Node.js does not come with its own DOM implementation, we rely on external libraries like [jsdom](https://github.com/jsdom/jsdom). Here's an example using `jsdom` to obtain a DOM document object:
5481

5582
```js
5683
var { Readability } = require('@mozilla/readability');
57-
var JSDOM = require('jsdom').JSDOM;
58-
var doc = new JSDOM("<body>Here's a bunch of text</body>", {
84+
var { JSDOM } = require('jsdom');
85+
var doc = new JSDOM("<body>Look at this cat: <img src='./cat.jpg'></body>", {
5986
url: "https://www.example.com/the-page-i-got-the-source-from"
6087
});
6188
let reader = new Readability(doc.window.document);
6289
let article = reader.parse();
6390
```
6491

65-
## What's Readability-readerable?
92+
Remember to pass the page's URI as the `url` option in the `JSDOM` constructor (as shown in the example above), so that Readability can convert relative URLs for images, hyperlinks etc. to their absolute counterparts.
6693

67-
It's a quick-and-dirty way of figuring out if it's plausible that the contents of a given
68-
document are suitable for processing with Readability. It is likely to produce both false
69-
positives and false negatives. The reason it exists is to avoid bogging down a time-sensitive
70-
process (like loading and showing the user a webpage) with the complex logic in the core of
71-
Readability. Improvements to its logic (while not deteriorating its performance) are very
72-
welcome.
94+
`jsdom` has the ability to run the scripts included in the HTML and fetch remote resources. For security reasons these are [disabled by default](https://github.com/jsdom/jsdom#executing-scripts), and we **strongly** recommend you keep them that way.
7395

7496
## Security
7597

76-
If you're going to use Readability with untrusted input (whether in HTML or DOM form), we
77-
**strongly** recommend you use a sanitizer library like
78-
[DOMPurify](https://github.com/cure53/DOMPurify) to avoid script injection when you use
79-
the output of Readability. We would also recommend using
80-
[CSP](https://developer.mozilla.org/en-US/docs/Web/HTTP/CSP) to add further defense-in-depth
98+
If you're going to use Readability with untrusted input (whether in HTML or DOM form), we **strongly** recommend you use a sanitizer library like [DOMPurify](https://github.com/cure53/DOMPurify) to avoid script injection when you use
99+
the output of Readability. We would also recommend using [CSP](https://developer.mozilla.org/en-US/docs/Web/HTTP/CSP) to add further defense-in-depth
81100
restrictions to what you allow the resulting content to do. The Firefox integration of
82-
reader mode uses both of these techniques itself. Sanitizing unsafe content out of the input
83-
is explicitly not something we aim to do as part of Readability itself - there are other
84-
good sanitizer libraries out there, use them!
101+
reader mode uses both of these techniques itself. Sanitizing unsafe content out of the input is explicitly not something we aim to do as part of Readability itself - there are other good sanitizer libraries out there, use them!
85102

86103
## Contributing
87104

0 commit comments

Comments
 (0)