Skip to content
This repository has been archived by the owner on Jan 12, 2021. It is now read-only.
/ readembedability Public archive

Turn unstructured webpages into structured content. Readability + oembed

License

Notifications You must be signed in to change notification settings

bmuller/readembedability

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

83 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Readembedability

Build Status

Installation

Readembedability is an open source version of Readability or Diffbot that also handles creating a unified view of oEmbed pages (for instance, twitter or youtube pages).

pip install readembedability

Usage

Assume you want to extract all of the meaningful data from an article on the NY Times:

from readembedability.page import get_readembedable
import asyncio

loop = asyncio.get_event_loop()
url = "www.nytimes.com/politics/first-draft/2015/03/13/judge-orders-state-dept-to-release-records-from-clinton-trips/"
result = loop.run_until_complete(get_readembedable(url))
print(result)

The result of calling get_readembedable will give you a dictionary with the following keys:

  • primary_image: The full URL to the image that is most likely the primary one for the page.
  • secondary_images: A list of all other images that appear and seem related to the content.
  • authors: The author names, if it can be pulled out.
  • url: The original URL passed as a parameter.
  • canonical_url: The URL for the page that had the content (for instance, after following all redirects)
  • title: Page title
  • summary: Few sentence summary of the content
  • content: Meaningful/relevant text content from the page.
  • published_at: Date of publishing.
  • keywords: Keywords pulled from the content
  • embed: Whether the content is HTML suitable for embedding (for instance, via oEmbed)

How It Works

Readembedability utilizes a number of libraries that all try to extract meaningful information from poorly structured web pages. It runs content through all of them, extracting the best guess at, say, the author after each pass. Some libraries are good at extracting text, others at images, etc. Readembedability uses each library for the task it seems best able to perform.

Running Tests

To run tests:

python -m unittest

About

Turn unstructured webpages into structured content. Readability + oembed

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published