forked from mozilla/readability
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Improve metadata extraction (mozilla#478)
* Improve metadata extraction * Recognize meta[property] as a space-separated list * Recognize Dulin Core (dc|dcterm): metadata. * Prefer Dublin Core, Open Graph, Twitter, and HTML in that order. * _getArticleTitle() is now only used as fallback if document doesn't provide good metadata.
- Loading branch information
Showing
33 changed files
with
211 additions
and
73 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
7 changes: 7 additions & 0 deletions
7
test/test-pages/003-metadata-preferred/expected-metadata.json
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
{ | ||
"title": "Dublin Core property title", | ||
"byline": "Dublin Core property author", | ||
"dir": null, | ||
"excerpt": "Dublin Core property description", | ||
"readerable": true | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,20 @@ | ||
<div id="readability-page-1" class="page"> | ||
<article> | ||
<p> | ||
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod | ||
tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, | ||
quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo | ||
consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse | ||
cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non | ||
proident, sunt in culpa qui officia deserunt mollit anim id est laborum. | ||
</p> | ||
<p> | ||
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod | ||
tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, | ||
quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo | ||
consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse | ||
cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non | ||
proident, sunt in culpa qui officia deserunt mollit anim id est laborum. | ||
</p> | ||
</article> | ||
</div> |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,45 @@ | ||
<!DOCTYPE html> | ||
<html> | ||
<head> | ||
<meta charset="utf-8"/> | ||
<title>Title Element</title> | ||
<meta name="title" content="Meta name title"/> | ||
<meta name="og:title" content="Open Graph name title"/> | ||
<meta name="twitter:title" content="Twitter name title"/> | ||
<meta name="DC.title" content="Dublin Core name title"/> | ||
<meta property="dc:title" content="Dublin Core property title"/> | ||
<meta property="twitter:title" content="Twitter property title"/> | ||
<meta property="og:title" content="Open Graph property title"/> | ||
<meta name="author" content="Meta name author"/> | ||
<meta name="DC.creator" content="Dublin Core name author"/> | ||
<meta property="dc:creator" content="Dublin Core property author"/> | ||
<meta name="description" content="Meta name description"/> | ||
<meta name="og:description" content="Open Graph name description"/> | ||
<meta name="twitter:description" content="Twitter name description"/> | ||
<meta name="DC.description" content="Dublin Core name description"/> | ||
<meta property="dc:description" content="Dublin Core property description"/> | ||
<meta property="twitter:description" content="Twitter property description"/> | ||
<meta property="og:description" content="Open Graph property description"/> | ||
</head> | ||
<body> | ||
<article> | ||
<h1>Test document title</h1> | ||
<p> | ||
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod | ||
tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, | ||
quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo | ||
consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse | ||
cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non | ||
proident, sunt in culpa qui officia deserunt mollit anim id est laborum. | ||
</p> | ||
<p> | ||
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod | ||
tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, | ||
quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo | ||
consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse | ||
cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non | ||
proident, sunt in culpa qui officia deserunt mollit anim id est laborum. | ||
</p> | ||
</article> | ||
</body> | ||
</html> |
7 changes: 7 additions & 0 deletions
7
test/test-pages/004-metadata-space-separated-properties/expected-metadata.json
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
{ | ||
"title": "Preferred title", | ||
"byline": "Creator Name", | ||
"dir": null, | ||
"excerpt": "Preferred description", | ||
"readerable": true | ||
} |
20 changes: 20 additions & 0 deletions
20
test/test-pages/004-metadata-space-separated-properties/expected.html
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,20 @@ | ||
<div id="readability-page-1" class="page"> | ||
<article> | ||
<p> | ||
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod | ||
tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, | ||
quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo | ||
consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse | ||
cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non | ||
proident, sunt in culpa qui officia deserunt mollit anim id est laborum. | ||
</p> | ||
<p> | ||
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod | ||
tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, | ||
quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo | ||
consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse | ||
cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non | ||
proident, sunt in culpa qui officia deserunt mollit anim id est laborum. | ||
</p> | ||
</article> | ||
</div> |
35 changes: 35 additions & 0 deletions
35
test/test-pages/004-metadata-space-separated-properties/source.html
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,35 @@ | ||
<!DOCTYPE html> | ||
<html> | ||
<head> | ||
<meta charset="utf-8"/> | ||
<title>Title Element</title> | ||
<meta property="x:title dc:title" content="Preferred title"/> | ||
<meta property="og:title twitter:title" content="A title"/> | ||
<meta property="dc:creator twitter:site_name" content="Creator Name"/> | ||
<meta name="author" content="FAIL"/> | ||
<meta property="og:description x:description twitter:description" content="A description"/> | ||
<meta property="dc:description og:description" content="Preferred description"/> | ||
<meta name="description" content="FAIL"/> | ||
</head> | ||
<body> | ||
<article> | ||
<h1>Test document title</h1> | ||
<p> | ||
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod | ||
tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, | ||
quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo | ||
consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse | ||
cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non | ||
proident, sunt in culpa qui officia deserunt mollit anim id est laborum. | ||
</p> | ||
<p> | ||
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod | ||
tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, | ||
quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo | ||
consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse | ||
cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non | ||
proident, sunt in culpa qui officia deserunt mollit anim id est laborum. | ||
</p> | ||
</article> | ||
</body> | ||
</html> |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,6 +1,6 @@ | ||
{ | ||
"title": "Obama admits US gun laws are his 'biggest frustration'", | ||
"title": "Obama admits US gun laws are his 'biggest frustration' - BBC News", | ||
"byline": null, | ||
"excerpt": "President Barack Obama tells the BBC his failure to pass", | ||
"excerpt": "President Barack Obama tells the BBC his failure to pass \"common sense gun safety laws\" is the greatest frustration of his presidency.", | ||
"readerable": true | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,6 +1,6 @@ | ||
{ | ||
"title": "Student Dies After Diet Pills She Bought Online \"Burned Her Up From Within\"", | ||
"byline": "Mark Di Stefano", | ||
"excerpt": "An inquest into Eloise Parry's death has been adjourned until July...", | ||
"byline": null, | ||
"excerpt": "An inquest into Eloise Parry's death has been adjourned until July.", | ||
"readerable": true | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,7 +1,7 @@ | ||
{ | ||
"title": "How to Build a Terrarium (with Pictures)", | ||
"title": "How to Build a Terrarium | eHow", | ||
"byline": "Lucy Akins", | ||
"dir": null, | ||
"excerpt": "How to Build a Terrarium. Glass cloche terrariums are not only appealing to the eye, but they also preserve a bit of nature in your home and serve as a simple, yet beautiful, piece of art. Closed terrariums are easy to care for, as they retain much of their own moisture and provide a warm environment with a consistent level of humidity. You...", | ||
"excerpt": "Glass cloche terrariums are not only appealing to the eye, but they also preserve a bit of nature in your home and serve as a simple, yet beautiful, piece of art. Closed terrariums are easy to care for, as they retain much of their own moisture and provide a warm environment with a consistent level of humidity. You won’t have to water the...", | ||
"readerable": true | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,7 +1,7 @@ | ||
{ | ||
"title": "How to Throw a Graduation Party on a Budget (with Pictures)", | ||
"title": "How to Throw a Graduation Party on a Budget | eHow", | ||
"byline": "Gina Roberts-Grey", | ||
"dir": null, | ||
"excerpt": "How to Throw a Graduation Party on a Budget. Graduation parties are a great way to commemorate the years of hard work teens and college co-eds devote to education. They’re also costly for mom and dad.The average cost of a graduation party in 2013 was a whopping $1,200, according to Graduationparty.com; $700 of that was allocated for food....", | ||
"excerpt": "Graduation parties are a great way to commemorate the years of hard work teens and college co-eds devote to education. They’re also costly for mom and dad.The average cost of a graduation party in 2013 was a whopping $1,200, according to Graduationparty.com; $700 of that was allocated for food. However that budget was based on Midwestern...", | ||
"readerable": true | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,7 +1,7 @@ | ||
{ | ||
"title": "Xbox One X review: A console that keeps up with gaming PCs", | ||
"title": "Xbox One X review: A console that keeps up with gaming PCs", | ||
"byline": null, | ||
"dir": null, | ||
"excerpt": "The Xbox One X is the ultimate video game system. It sports more horsepower than any system ever. And it plays more titles in native 4K than Sony's PlayStation...", | ||
"excerpt": "The Xbox One X is the most powerful gaming console ever, but it's not for everyone yet.", | ||
"readerable": true | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,6 +1,6 @@ | ||
{ | ||
"title": "1Password für Mac generiert Einmal-Passwörter", | ||
"byline": null, | ||
"byline": "Mac & i", | ||
"excerpt": "Das in der iOS-Version bereits enthaltene TOTP-Feature ist nun auch für OS X 10.10 verfügbar. Zudem gibt es neue Zusatzfelder in der Datenbank und weitere Verbesserungen.", | ||
"readerable": true | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,5 +1,5 @@ | ||
{ | ||
"title": "draft-dejong-remotestorage-04 - remoteStorage", | ||
"byline": "AUTHORING", | ||
"title": "remoteStorage", | ||
"byline": "Jong, Michiel de", | ||
"readerable": true | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,6 +1,6 @@ | ||
{ | ||
"title": "Una solución no violenta para la cuestión mapuche - 07.12.2017", | ||
"title": "Una solución no violenta para la cuestión mapuche", | ||
"byline": null, | ||
"excerpt": "Una solución no violenta para la cuestión mapuche | Los pueblos indígenas reclaman por derechos que permanecen incumplidos, por eso es más eficiente canalizar la protesta que reprimirla - LA NACION", | ||
"excerpt": "Los pueblos indígenas reclaman por derechos que permanecen incumplidos, por eso es más eficiente canalizar la protesta que reprimirla", | ||
"readerable": true | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,7 +1,7 @@ | ||
{ | ||
"title": "Samantha and The Great Big Lie – John C. Welch – Medium", | ||
"title": "Samantha and The Great Big Lie", | ||
"byline": "John C. Welch", | ||
"dir": null, | ||
"excerpt": "(EDIT: removed the link to Samantha’s post, because the arments and the grubers and the rest of The Deck Clique got what they wanted: a non-proper person driven off the internet lightly capped with a…", | ||
"excerpt": "How to get shanked doing what people say they want", | ||
"readerable": true | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.