forked from timbertson/python-readability
-
Notifications
You must be signed in to change notification settings - Fork 356
Open
Description
AdBlock Plus element hiding rules specify elements to exclude and are specified by CSS selectors. This is easily implemented in lxml, if somewhat slowly.
I'm using this in my own code to automatically remove social media share links from pages. You may want to consider including something similar in python-readablity.
EasyList is dual licensed Creative Commons Attribution-ShareAlike 3.0 Unported and GNU General Public License version 3. CC-BY-SA looks compatible with Apache licensed projects.
Example
First download the rules:
$ wget https://easylist-downloads.adblockplus.org/fanboy-annoyance.txt
Then you can simply extract the CSS selectors to match against a document tree.
from lxml import html
from lxml.cssselect import CSSSelector
RULES_PATH = 'fanboy-annoyance.txt'
with open(RULES_PATH, 'r') as f:
lines = f.read().splitlines()
# get elemhide rules (prefixed by ##) and create a CSSSelector for each of them
rules = [CSSSelector(line[2:]) for line in lines if line[:2] == '##']
def remove_ads(tree):
for rule in rules:
for matched in rule(tree):
matched.getparent().remove(matched)
doc = html.document_fromstring("<html>...</html>")
remove_ads(doc)alphapapa and siginoalejandromunozes
Metadata
Metadata
Assignees
Labels
No labels