|
2 | 2 |
|
3 | 3 | A standalone version of the readability library used for Firefox Reader View.
|
4 | 4 |
|
5 |
| -## Usage on the web |
| 5 | +## Basic usage |
6 | 6 |
|
7 |
| -To parse a document, you must create a new `Readability` object from a DOM document object, and then call `parse()`. Here's an example: |
| 7 | +To parse a document, you must create a new `Readability` object from a DOM document object, and then call the [`parse()`](#parse) method. Here's an example: |
8 | 8 |
|
9 | 9 | ```javascript
|
10 | 10 | var article = new Readability(document).parse();
|
11 | 11 | ```
|
12 | 12 |
|
13 |
| -This `article` object will contain the following properties: |
| 13 | +If you use Readability in a web browser, you will likely be able to use a `document` reference from elsewhere (e.g. fetched via XMLHttpRequest, in a same-origin `<iframe>` you have access to, etc.). In Node.js, you can [use an external DOM library](#nodejs-usage). |
14 | 14 |
|
15 |
| -* `title`: article title |
16 |
| -* `content`: HTML string of processed article content |
17 |
| -* `textContent`: text content of the article (all HTML removed) |
18 |
| -* `length`: length of an article, in characters |
19 |
| -* `excerpt`: article description, or short excerpt from the content |
20 |
| -* `byline`: author metadata |
21 |
| -* `dir`: content direction |
| 15 | +## API Reference |
22 | 16 |
|
23 |
| -If you're using Readability on the web, you will likely be able to use a `document` reference |
24 |
| -from elsewhere (e.g. fetched via XMLHttpRequest, in a same-origin `<iframe>` you have access to, etc.). |
| 17 | +### `new Readability(document, options)` |
25 | 18 |
|
26 |
| -### Optional |
| 19 | +The `options` object accepts a number of properties, all optional: |
27 | 20 |
|
28 |
| -Readability's `parse()` works by modifying the DOM. This removes some elements in the web page. |
29 |
| -You could avoid this by passing the clone of the `document` object while creating a `Readability` object. |
| 21 | +* `debug` (boolean, default `false`): whether to enable logging. |
| 22 | +* `maxElemsToParse` (number, default `0` i.e. no limit): the maximum number of elements to parse. |
| 23 | +* `nbTopCandidates` (number, default `5`): the number of top candidates to consider when analysing how tight the competition is among candidates. |
| 24 | +* `charThreshold` (number, default `500`): the number of characters an article must have in order to return a result. |
| 25 | +* `classesToPreserve` (array): a set of classes to preserve on HTML elements when the `keepClasses` options is set to `false`. |
| 26 | +* `keepClasses` (boolean, default `false`): whether to preserve all classes on HTML elements. When set to `false` only classes specified in the `classesToPreserve` array are kept. |
| 27 | +* `disableJSONLD` (boolean, default `false`): when extracting page metadata, Readability gives precendence to Schema.org fields specified in the JSON-LD format. Set this option to false to skip JSON-LD parsing. |
| 28 | +* `serializer` (function, default `el => el.innerHTML`) controls how the the `content` property returned by the `parse()` method is produced from the root DOM element. It may be useful to specify the `serializer` as the identity function (`el => el`) to obtain a DOM element instead of a string for `content` if you plan to process it further. |
30 | 29 |
|
31 |
| -``` |
| 30 | +### `parse()` |
| 31 | + |
| 32 | +Returns an object containing the following properties: |
| 33 | + |
| 34 | +* `title`: article title; |
| 35 | +* `content`: HTML string of processed article content; |
| 36 | +* `textContent`: text content of the article, with all the HTML tags removed; |
| 37 | +* `length`: length of an article, in characters; |
| 38 | +* `excerpt`: article description, or short excerpt from the content; |
| 39 | +* `byline`: author metadata; |
| 40 | +* `dir`: content direction; |
| 41 | +* `siteName`: name of the site. |
| 42 | + |
| 43 | +The `parse()` method works by modifying the DOM. This removes some elements in the web page, which may be undesirable. You can avoid this by passing the clone of the `document` object to the `Readability` constructor: |
| 44 | + |
| 45 | +```js |
32 | 46 | var documentClone = document.cloneNode(true);
|
33 | 47 | var article = new Readability(documentClone).parse();
|
34 | 48 | ```
|
35 | 49 |
|
36 |
| -## Usage from Node.js |
| 50 | +### `isProbablyReaderable(document, options)` |
| 51 | + |
| 52 | +A quick-and-dirty way of figuring out if it's plausible that the contents of a given document are suitable for processing with Readability. It is likely to produce both false positives and false negatives. The reason it exists is to avoid bogging down a time-sensitive process (like loading and showing the user a webpage) with the complex logic in the core of Readability. Improvements to its logic (while not deteriorating its performance) are very welcome. |
| 53 | + |
| 54 | +The `options` object accepts a number of properties, all optional: |
| 55 | + |
| 56 | +* `minContentLength` (number, default `140`): the minimum node content length used to decide if the document is readerable; |
| 57 | +* `minScore` (number, default `20`): the minumum cumulated 'score' used to determine if the document is readerable; |
| 58 | +* `visibilityChecker` (function, default `isNodeVisible`): the function used to determine if a node is visible; |
| 59 | + |
| 60 | +The function returns a boolean corresponding to whether or not we suspect `Readability.parse()` will suceeed at returning an article object. Here's an example: |
| 61 | + |
| 62 | +```js |
| 63 | +/* |
| 64 | + Only instantiate Readability if we suspect |
| 65 | + the `parse()` method will produce a meaningful result. |
| 66 | +*/ |
| 67 | +if (isProbablyReaderable(document)) { |
| 68 | + let article = new Readability(document).parse(); |
| 69 | +} |
| 70 | +``` |
| 71 | + |
| 72 | +## Node.js usage |
37 | 73 |
|
38 | 74 | Readability is available on npm:
|
39 | 75 |
|
40 | 76 | ```bash
|
41 | 77 | npm install @mozilla/readability
|
42 | 78 | ```
|
43 | 79 |
|
44 |
| -In Node.js, you won't generally have a DOM document object. To obtain one, you can use external |
45 |
| -libraries like [jsdom](https://github.com/jsdom/jsdom). While this repository contains a parser of |
46 |
| -its own (`JSDOMParser`), that is restricted to reading XML-compatible markup and therefore we do |
47 |
| -not recommend it for general use. |
48 |
| - |
49 |
| -If you're using `jsdom` to create a DOM object, you should ensure that the page doesn't run (page) |
50 |
| -scripts (avoid fetching remote resources etc.) as well as passing it the page's URI as the `url` |
51 |
| -property of the `options` object you pass the `JSDOM` constructor. |
52 |
| - |
53 |
| -### Example: |
| 80 | +Since Node.js does not come with its own DOM implementation, we rely on external libraries like [jsdom](https://github.com/jsdom/jsdom). Here's an example using `jsdom` to obtain a DOM document object: |
54 | 81 |
|
55 | 82 | ```js
|
56 | 83 | var { Readability } = require('@mozilla/readability');
|
57 |
| -var JSDOM = require('jsdom').JSDOM; |
58 |
| -var doc = new JSDOM("<body>Here's a bunch of text</body>", { |
| 84 | +var { JSDOM } = require('jsdom'); |
| 85 | +var doc = new JSDOM("<body>Look at this cat: <img src='./cat.jpg'></body>", { |
59 | 86 | url: "https://www.example.com/the-page-i-got-the-source-from"
|
60 | 87 | });
|
61 | 88 | let reader = new Readability(doc.window.document);
|
62 | 89 | let article = reader.parse();
|
63 | 90 | ```
|
64 | 91 |
|
65 |
| -## What's Readability-readerable? |
| 92 | +Remember to pass the page's URI as the `url` option in the `JSDOM` constructor (as shown in the example above), so that Readability can convert relative URLs for images, hyperlinks etc. to their absolute counterparts. |
66 | 93 |
|
67 |
| -It's a quick-and-dirty way of figuring out if it's plausible that the contents of a given |
68 |
| -document are suitable for processing with Readability. It is likely to produce both false |
69 |
| -positives and false negatives. The reason it exists is to avoid bogging down a time-sensitive |
70 |
| -process (like loading and showing the user a webpage) with the complex logic in the core of |
71 |
| -Readability. Improvements to its logic (while not deteriorating its performance) are very |
72 |
| -welcome. |
| 94 | +`jsdom` has the ability to run the scripts included in the HTML and fetch remote resources. For security reasons these are [disabled by default](https://github.com/jsdom/jsdom#executing-scripts), and we **strongly** recommend you keep them that way. |
73 | 95 |
|
74 | 96 | ## Security
|
75 | 97 |
|
76 |
| -If you're going to use Readability with untrusted input (whether in HTML or DOM form), we |
77 |
| -**strongly** recommend you use a sanitizer library like |
78 |
| -[DOMPurify](https://github.com/cure53/DOMPurify) to avoid script injection when you use |
79 |
| -the output of Readability. We would also recommend using |
80 |
| -[CSP](https://developer.mozilla.org/en-US/docs/Web/HTTP/CSP) to add further defense-in-depth |
| 98 | +If you're going to use Readability with untrusted input (whether in HTML or DOM form), we **strongly** recommend you use a sanitizer library like [DOMPurify](https://github.com/cure53/DOMPurify) to avoid script injection when you use |
| 99 | +the output of Readability. We would also recommend using [CSP](https://developer.mozilla.org/en-US/docs/Web/HTTP/CSP) to add further defense-in-depth |
81 | 100 | restrictions to what you allow the resulting content to do. The Firefox integration of
|
82 |
| -reader mode uses both of these techniques itself. Sanitizing unsafe content out of the input |
83 |
| -is explicitly not something we aim to do as part of Readability itself - there are other |
84 |
| -good sanitizer libraries out there, use them! |
| 101 | +reader mode uses both of these techniques itself. Sanitizing unsafe content out of the input is explicitly not something we aim to do as part of Readability itself - there are other good sanitizer libraries out there, use them! |
85 | 102 |
|
86 | 103 | ## Contributing
|
87 | 104 |
|
|
0 commit comments