Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
800da13
Use kotlin for build.gradle to use only one programming language
lucamolteni Nov 25, 2024
bc2583d
refactor in gradle(added toml file for gradle), readme, and some chan…
NotDroidUser Mar 25, 2025
28163b6
initial 2.0-beta commit
NotDroidUser Sep 20, 2025
12a3179
Update README.md
NotDroidUser Sep 20, 2025
52c9e65
Updated README.md and blocked the Readability4JExtended as it will no…
NotDroidUser Sep 23, 2025
ae03df9
Update settings.gradle for gitpack
NotDroidUser Oct 13, 2025
8b3f8d5
Update settings.gradle for Jitpack
NotDroidUser Sep 23, 2025
24a013f
Merge remote-tracking branch 'origin/master'
NotDroidUser Oct 13, 2025
d80d4c5
Update README.md
NotDroidUser Oct 13, 2025
967cb40
Update README.md
NotDroidUser Oct 22, 2025
e08f2c7
Update build.gradle for Jitpack
NotDroidUser Oct 13, 2025
7e0cf4a
minor fixes
NotDroidUser Jan 11, 2026
df7942d
extended is here but no actual logic (for now)
NotDroidUser Jan 11, 2026
6e1e0c8
fix: now keep-tabular-data pass
NotDroidUser Jan 11, 2026
695f620
updated to d7949dc4
NotDroidUser Jan 11, 2026
73c1ef7
this fixes wikipedia tests (lang tags are removed by html-unit someti…
NotDroidUser Jan 12, 2026
8af7d99
fix: fixes nytimes loading text
NotDroidUser Jan 12, 2026
5e54929
fix: fixes title not being get on likes remove-aria-hidden (and anyon…
NotDroidUser Jan 12, 2026
3a42f5f
fix: fixes little articles (<500 chars) not being readable
NotDroidUser Jan 12, 2026
2647fa1
fix: fixes on empty json+ld autor tag, not searching for byline (fixe…
NotDroidUser Jan 12, 2026
a1c7897
version bump
NotDroidUser Jan 12, 2026
45cc06b
fixed submodule
NotDroidUser Jan 12, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 14 additions & 0 deletions .editorconfig
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
# Editorconfig on (https://editorconfig.org/)

root = true

[*]
end_of_line = lf
charset = utf-8
insert_final_newline = true

[*.java]
indent_size = 4

[*.kt]
indent_size = 4
3 changes: 3 additions & 0 deletions .gitmodules
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
[submodule "src/test/resources/readability"]
path = src/test/resources/readability
url = https://github.com/mozilla/readability/
174 changes: 123 additions & 51 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,38 +1,43 @@
# Readability4J
[![Maven Central](https://maven-badges.herokuapp.com/maven-central/net.dankito.readability4j/readability4j/badge.svg)](https://maven-badges.herokuapp.com/maven-central/net.dankito.readability4j/readability4j)
[![JitPack](https://jitpack.io/v/NotDroidUser/Readability4J.svg)](https://jitpack.io/v/NotDroidUser/Readability4J.svg)

Readability4J is a Kotlin port of Mozilla's Readability.js, which is used for Firefox's reader view: https://github.com/mozilla/readability.

It tries to detect the relevant content of a website and removes all clutter from it such as advertisements, navigation bars, social media buttons, etc.

The extracted text then can be used for indexing web pages, to provide the user a pleasant reading experience and similar.

As it‘s compatible with Mozilla‘s Readability.js it produces exact the same output as you would see in Firefox‘s Reader View (just some white spaces differ due to Jsoup‘s different formatting, but you can‘t see them anyway).
As it‘s compatible with Mozilla‘s Readability.js it produces almost exact the same output as you would see in Firefox‘s Reader View (just some differ due to Jsoup‘s don't behave exactly in some cases, yet some things that you can‘t see them anyway).

## Setup

Gradle:
```
dependencies {
compile 'net.dankito.readability4j:readability4j:1.0.8'
}
```
Add it in your root settings.gradle at the end of repositories:

Maven:
```
<dependency>
<groupId>net.dankito.readability4j</groupId>
<artifactId>readability4j</artifactId>
<version>1.0.8</version>
</dependency>
```groovy
dependencyResolutionManagement {
repositoriesMode.set(RepositoriesMode.FAIL_ON_PROJECT_REPOS)
repositories {
mavenCentral()
maven { url 'https://jitpack.io' }
}
}
```

Step 2. Add the dependency

```groovy
dependencies {
implementation 'com.github.NotDroidUser:Readability4J:2.0.0-jitpack-beta'
}
```

## Usage

From Java:

```java
String url = ...;
String html = ...;
String url = "some-page.com";
String html = "Some Bloated Article html source";

Readability4J readability4J = new Readability4J(url, html); // url is just needed to resolve relative urls
Article article = readability4J.parse();
Expand All @@ -46,38 +51,80 @@ String title = article.getTitle();
String byline = article.getByline();
String excerpt = article.getExcerpt();
```
From Kotlin:

## Readability4J and Readability4JExtended
```kotlin

With Readability4J class I wanted to stick close to Mozilla's Readability to keep compatibility.
val url = "somepage.com"
val html = "Some Bloated Article html source"

But during development I found some handy features not supported by Readability, e. g. copying url from data-src
attribute to &lt;img src="" /> to display lazy loading images, using &lt;head>&lt;base>'s href value for resolving
relative urls and a
better
detection of
which
images to keep in output.
val readability4J = Readability4J(url, html) // url is just needed to resolve relative urls
val article = readability4J.parse()

These features I implemented in Readability4JExtended.
// returns extracted content in a <div> element
val extractedContentHtml = article.getContent()
// to get content wrapped in <html> tags and encoding set to UTF-8, see chapter 'Output encoding'
val extractedContentHtmlWithUtf8Encoding = article.getContentWithUtf8Encoding()
val extractedContentPlainText = article.getTextContent()
val title = article.getTitle()
val byline = article.getByline()
val excerpt = article.getExcerpt()

If you want to use it, simply instantiate with (the rest of the code stays the same):
```

<pre>
Readability4J readability4J = new <b>Readability4JExtended</b>(url, html);
Article article = readability4J.parse();
</pre>
# Why i can't use Readability4JExtended now?

<!--Basically as you have seen in code, it is divided in 4 classes Preprocessor, MetadataParser, ArticleGrabber and PostProcessor
Preprocessor is the code that work with the HTML, removing tags like script, style, successive br tags and change font tags into span tags and also unwraps no-script tag images
MetadataParser parses meta tags for info and ld+json before scripts are removed
ArticleGrabber is the one is where magic is done
PostProcessor is where the a tags get from relative to native
-->
As readability code changed a lot from the latest commit (2018-2025), had first updated Readability4J code base to make the updating process the less stressfully, yet you can do some alike with classes like:

On Java:

```java
String url = "some-specific-page.com";
String html = "Some Bloated Article html source that needs extra steps";

Readability4J readability4J = Readability4J(url, html);
ArticleGrabber extended = new ArticleGrabber(readability4J.getOptions(),new BaseRegexUtilExtended());
readability4J.setArticleGrabber(extended);
```

On Kotlin:

```kotlin
val url = "some-specific-page.com"
val html = "Some Bloated Article html source that needs extra steps"

val readability4J = Readability4J(url, html)
readability4J.articleGrabber = ArticleGrabber(readability4J.options,BaseRegexUtilExtended())
```

Yet some of original Readability4JExtended like data-src was implemented on the original one (srcset regex for example)

<!--
## *yet not updated Readability4J and Readability4JExtended )

With Readability4J class I wanted to stick close to Mozilla's Readability to keep compatibility.)
But during development I found some handy features not supported by Readability, e. g. copying url from data-src attribute to `<img src="" />` to display lazy loading images, using `<head> <base>`'s href value for resolving relative urls and a better detection of which images to keep in output. These features I implemented in Readability4JExtended. If you want to use it, simply instantiate with (the rest of the code stays the same):
```java
Readability4J readability4J = new Readability4JExtended(url, html);
Article article = readability4J.parse()
```
-->

## Output encoding

As users noted (see Issue [#1](https://github.com/dankito/Readability4J/issues/1) and [#2](https://github.com/dankito/Readability4J/issues/2))
by default no encoding is applied to Readability4J's output resulting in incorrect display of non-ASCII characters.
As users noted (see Issue [#1](https://github.com/dankito/Readability4J/issues/1) and [#2](https://github.com/dankito/Readability4J/issues/2)) by default no encoding is applied to Readability4J's output resulting in incorrect display of non-ASCII characters.

The reason is like Readability.js Readability4J returns its output in a &lt;div> element, and the only way to set the
encoding in HTML is in a &lt;head>&lt;meta charset=""> tag.
The reason is like Readability.js Readability4J returns its output in a `<div>` element, and the only way to set the encoding in HTML is in a `<head> <meta charset="">` tag.

So I added these convenience methods to Article class
So I added these convenience methods to Article class:

On Java:
```java
String contentHtmlWithUtf8Encoding = article.getContentWithUtf8Encoding();
// or (tries to apply site's charset, if set, or if not uses UTF-8 as fallback
Expand All @@ -86,12 +133,22 @@ String contentWithDocumentsCharsetOrUtf8 = article.getContentWithDocumentsCharse
String contentHtmlWithCustomEncoding = article.getContentWithEncoding("ISO-8859-1");
```

which wrap the content in
On Kotlin:

```kotlin
var contentHtmlWithUtf8Encoding = article.contentWithUtf8Encoding
// or (tries to apply site's charset, if set, or if not uses UTF-8 as fallback
var contentWithDocumentsCharsetOrUtf8 = article.contentWithDocumentsCharsetOrUtf8
// or
var contentHtmlWithCustomEncoding = article.getContentWithEncoding("ISO-8859-1")
```

Which wrap the content in:

```
<html>
<head>
<meta charset="utf-8" />
<meta charset="$encoding" />
</head>
<body>
<!-- content -->
Expand All @@ -101,16 +158,16 @@ which wrap the content in

## Compatibility with Mozilla‘s Readability.js

As mentioned before, this is almost an exact copy of Mozilla's Readability.js. But since I didn't find the original code very readable itself, I extracted some parts from the 2000 lines of code into a new classes:
As mentioned before, this is almost an exact copy of Mozilla's Readability.js. But since the code in only one file can be almost unreadable, I extracted some parts from the 2000+ lines of code into a new classes:

<table>
<tr>
<th>Readability.js function</td>
<th>Readability4J location</td>
<td>Readability.js function</td>
<td>Readability4J location</td>
</tr>
<tr>
<td>_removeScripts() and _prepDocument()</td>
<td>Preprocessor.prepareDocument()</td>
<td>_unwrapNoscriptImages(), _removeScripts() and _prepDocument()</td>
<td>Preprocessor.unwrapNoscriptImages(), Preprocessor.removeScripts() and Preprocessor.prepDocument()</td>
</tr>
<tr>
<td>_grabArticle()</td>
Expand All @@ -121,19 +178,20 @@ As mentioned before, this is almost an exact copy of Mozilla's Readability.js. B
<td>Postprocessor.postProcessContent()</td>
</tr>
<tr>
<td>_getArticleMetadata()</td>
<td>MetadataParser.getArticleMetadata()</td>
<td>_getJSONLD(),_getArticleMetadata()</td>
<td>MetadataParser.getJSONLD(), MetadataParser.getArticleMetadata()</td>
</tr>
</table>

I added some log functions on Util.kt so the nodes are logged as on Javascript for compare in test cases, also done a rollback to the latest compatible Jackson with Android API 19-25

Overview of which Mozilla‘s Readability.js commit a Readability4J version matches:

<table>
<tr>
<th>Version</td>
<th>Commit</td>
<th>Date</td>
<td>Version</td>
<td>Commit</td>
<td>Date</td>
</tr>
<tr>
<td>1.0</td>
Expand All @@ -145,11 +203,25 @@ Overview of which Mozilla‘s Readability.js commit a Readability4J version matc
<td>834672e</td>
<td>02/27/18</td>
</tr>
<tr>
<td>2.0.0-beta</td>
<td>almost all test from [v0.6.0](https://github.com/mozilla/readability/commit/04fd32f72b448c12b02ba6c40928b67e510bac49) works</td>
<td>13/10/25</td>
</tr>
<tr>
<td>2.1.0-rc</td>
<td>only 4 failing test (with minor differences) [d7949dc4](https://github.com/mozilla/readability/commit/d7949dc4) works</td>
<td>12/1/26</td>
</tr>
</table>

## Testing

I had added readability.js as a submodule so it will be updated with their latest tests, also i don't get their results for done, i do a call to the readability.js inside HTMLUnit, with some regex changes, syntactic [see rhino compat](https://mozilla.github.io/rhino/compat/engines.html#ES2015-syntax-spread-syntax-for-iterable-objects) and non syntactic as it can run as a function than a class

## Extensibility

I tried to create the library as extensible as possible. All above mentioned classes can be overwritten and passed to Readability4J's constructor.
I tried to maintain the library as extensible as possible. All above mentioned classes can be overwritten and passed to Readability4J's as a variable assignment.

## Logging

Expand All @@ -159,7 +231,7 @@ So you can use any logger that supports slf4j, like Logback and log4j, to config

# License

Copyright 2017 dankito
Copyright 2017 dankito 2025 NotDroidUser

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
Expand Down
Loading