extractlinks

extract and repair links from Requests objects, including redirects and final landing page

Installation

pip install extractlinks
python3 -m pip install extractlinks

Usage

import requests
from extractlinks import ExtractLinks
URL = "http://cnn.com/"
r = requests.get(URL, allow_redirects=True)
e = ExtractLinks(content=r)
print(e.json)

Example Output

[
	{
		"@timestamp": "2021-06-26T16:33:20.384Z",
		"url": {
			"full": "https://www.cnn.com/",
			"original": "https://www.cnn.com/",
			"scheme": "https",
			"domain": "www.cnn.com",
			"path": "/"
		},
		"http": {
			"response": {
			"status_code": 200,
			"status_code_reason": "OK",
			"body_bytes": 1110460
		},
		"chainitem": 2,
		"pguid": "1ff26fce-21a0-401a-9d53-1f863c6e3e31",
		"guid": "59dcfa56-b6d2-4924-bae1-70dbcd9d8309"
		"count": 324,
		"types": [
			"a-href",
			"form-action",
			"link-href",
			"meta-content",
			"script-src"
		],
		"tags": [
			"script",
			"meta",
			"a",
			"form",
			"link"
		],
		"attributes": [
			"action",
			"content",
			"src",
			"href"
		],
		"links": [
			"https://www.cnn.com/specials/cnn-investigates",
			"https://www.cnn.com/specials/tech/innovate",
			"https://www.cnn.com/travel/news",
			"https://www.i.cdn.cnn.com/.a/fonts/cnn/3.9.0/cnnsans-italic.woff2"
		...

Objects

# primary list-of-dictionaries / JSON dump
# these contain the full link extractions, including items not recognized as URLs or mobile links
output # list of dictionaries
json # JSON string

# lists
links_all # this only contains full links and any relative links "repaired" back to full-link format (ex. /images becomes https://www.cnn.com/images
types_all # ex. "a-href", "img-src", etc
tags_all # ex. "a", "img"
attributes_all # ex. "href", "src"

# generators, if urlbreakdown module is installed; runs URLBreakdown on every link in links_all
urlbreakdown_generator_dict()
urlbreakdown_generator_json()

Notes

select URL and HTTP output fields align to the Elastic Common Schema
links_count is not reflective of a unique count, and includes all objects identified including non-URLs in otherwise link-related tag attributes

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
__init__.py		__init__.py
el-wrapper.py		el-wrapper.py
extractlinks.py		extractlinks.py
pyproject.toml		pyproject.toml
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

extractlinks

Installation

Usage

Example Output

Objects

Notes

About

Uh oh!

Releases

Packages

Languages

License

bonifield/extractlinks

Folders and files

Latest commit

History

Repository files navigation

extractlinks

Installation

Usage

Example Output

Objects

Notes

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages