-
Notifications
You must be signed in to change notification settings - Fork 35
Python package docs
You are currently reading waybackpy docs to use it as a python package. If you want to use waybackpy as CLI tool visit our CLI docs.
- Archiving/Saving a webpage
- Retrieve archive of webpage
- Count total number of archives for a webpage
- List of URLs that Wayback Machine knows and has archived for a domain name
- CDX Server API
>>> import waybackpy
>>>
>>> url = "https://en.wikipedia.org/wiki/Multivariable_calculus"
>>> user_agent = "Mozilla/5.0 (Windows NT 5.1; rv:40.0) Gecko/20100101 Firefox/40.0"
>>>
>>> save_api = waybackpy.WaybackMachineSaveAPI(url, user_agent=user_agent)
>>> save_api.save()
'https://web.archive.org/web/20220122102014/https://en.wikipedia.org/wiki/Multivariable_calculus'
>>> save_api.cached_save
False
>>> save_api.headers
{'Server': 'nginx/1.19.5', 'Date': 'Sat, 22 Jan 2022 10:20:19 GMT', 'Content-Type': 'text/html; charset=UTF-8', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'x-archive-orig-date': 'Fri, 21 Jan 2022 23:32:39 GMT', 'x-archive-orig-server': 'mw1407.eqiad.wmnet', 'x-archive-orig-x-content-type-options': 'nosniff', 'x-archive-orig-p3p': 'CP="See https://en.wikipedia.org/wiki/Special:CentralAutoLogin/P3P for more info."', 'x-archive-orig-content-language': 'en', 'x-archive-orig-vary': 'Accept-Encoding,Cookie,Authorization', 'x-archive-orig-last-modified': 'Fri, 21 Jan 2022 23:16:22 GMT', 'x-archive-orig-content-encoding': 'gzip', 'x-archive-orig-age': '38855', 'x-archive-orig-x-cache': 'cp4027 miss, cp4030 hit/2', 'x-archive-orig-x-cache-status': 'hit-front', 'x-archive-orig-server-timing': 'cache;desc="hit-front", host;desc="cp4030"', 'x-archive-orig-strict-transport-security': 'max-age=106384710; includeSubDomains; preload', 'x-archive-orig-report-to': '{ "group": "wm_nel", "max_age": 86400, "endpoints": [{ "url": "https://intake-logging.wikimedia.org/v1/events?stream=w3c.reportingapi.network_error&schema_uri=/w3c/reportingapi/network_error/1.0.0" }] }', 'x-archive-orig-nel': '{ "report_to": "wm_nel", "max_age": 86400, "failure_fraction": 0.05, "success_fraction": 0.0}', 'x-archive-orig-permissions-policy': 'interest-cohort=()', 'x-archive-orig-set-cookie': 'WMF-Last-Access=22-Jan-2022;Path=/;HttpOnly;secure;Expires=Wed, 23 Feb 2022 00:00:00 GMT, WMF-Last-Access-Global=22-Jan-2022;Path=/;Domain=.wikipedia.org;HttpOnly;secure;Expires=Wed, 23 Feb 2022 00:00:00 GMT, GeoIP=US:CA:San_Francisco:37.78:-122.47:v4; Path=/; secure; Domain=.wikipedia.org', 'x-archive-orig-x-client-ip': '207.241.227.105', 'x-archive-orig-cache-control': 'private, s-maxage=0, max-age=0, must-revalidate', 'x-archive-orig-accept-ranges': 'bytes', 'x-archive-orig-content-length': '28504', 'x-archive-orig-connection': 'keep-alive', 'x-archive-guessed-content-type': 'text/html', 'x-archive-guessed-charset': 'utf-8', 'memento-datetime': 'Sat, 22 Jan 2022 10:20:14 GMT', 'link': '<https://en.wikipedia.org/wiki/Multivariable_calculus>; rel="original", <https://web.archive.org/web/timemap/link/https://en.wikipedia.org/wiki/Multivariable_calculus>; rel="timemap"; type="application/link-format", <https://web.archive.org/web/https://en.wikipedia.org/wiki/Multivariable_calculus>; rel="timegate", <https://web.archive.org/web/20050422130129/http://en.wikipedia.org:80/wiki/Multivariable_calculus>; rel="first memento"; datetime="Fri, 22 Apr 2005 13:01:29 GMT", <https://web.archive.org/web/20220118154923/https://en.wikipedia.org/wiki/Multivariable_calculus>; rel="prev memento"; datetime="Tue, 18 Jan 2022 15:49:23 GMT", <https://web.archive.org/web/20220122102014/https://en.wikipedia.org/wiki/Multivariable_calculus>; rel="memento"; datetime="Sat, 22 Jan 2022 10:20:14 GMT", <https://web.archive.org/web/20220122102014/https://en.wikipedia.org/wiki/Multivariable_calculus>; rel="last memento"; datetime="Sat, 22 Jan 2022 10:20:14 GMT"', 'content-security-policy': "default-src 'self' 'unsafe-eval' 'unsafe-inline' data: blob: archive.org web.archive.org analytics.archive.org pragma.archivelab.org", 'x-archive-src': 'spn2-20220122093153-wwwb-spn23.us.archive.org-8000.warc.gz', 'server-timing': 'captures_list;dur=138.943024, exclusion.robots;dur=0.124457, exclusion.robots.policy;dur=0.114278, cdx.remote;dur=0.091306, esindex;dur=0.011012, LoadShardBlock;dur=101.247564, PetaboxLoader3.datanode;dur=44.420167, CDXLines.iter;dur=25.235685, PetaboxLoader3.resolve;dur=25.677021, load_resource;dur=4.737038', 'x-app-server': 'wwwb-app201', 'x-ts': '200', 'x-tr': '315', 'X-location': 'All', 'X-Cache-Key': 'httpsweb.archive.org/web/20220122102014/https://en.wikipedia.org/wiki/Multivariable_calculusIN', 'X-RL': '0', 'X-NA': '0', 'X-Page-Cache': 'MISS', 'X-NID': '-', 'Referrer-Policy': 'no-referrer-when-downgrade', 'Permissions-Policy': 'interest-cohort=()', 'Content-Encoding': 'gzip'}
>>> save_api.timestamp()
datetime.datetime(2022, 1, 22, 10, 20, 14)
>>> save_api.archive_url
'https://web.archive.org/web/20220122102014/https://en.wikipedia.org/wiki/Multivariable_calculus'
Try this out in your browser @ https://repl.it/@akamhy/WaybackPySaveExample
- Sometimes the Wayback Machine may deny your archiving requests and not save the webpage. waybackpy will raise 'WaybackError' if your request failed.
>>>
>>> url = "https://github.com/akamhy/waybackpy/this-page-doesn't-exit" # This webpage doesn't exist (404), therefore can't archive.
>>> user_agent = "Mozilla/5.0 (Windows NT 5.1; rv:40.0) Gecko/20100101 Firefox/40.0"
>>> wayback = waybackpy.Url(url, user_agent)
>>> archive = wayback.save()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/akamhy/.pyenv/versions/3.5.3/lib/python3.5/site-packages/waybackpy/wrapper.py", line 141, in save
self._archive_url = "https://" + _archive_url_parser(response.headers, self.url)
File "/home/akamhy/.pyenv/versions/3.5.3/lib/python3.5/site-packages/waybackpy/utils.py", line 223, in _archive_url_parser
"of waybackpy.\nHeader:\n%s" % (url, __version__, str(header))
waybackpy.exceptions.WaybackError: No archive URL found in the API response. If 'https://github.com/akamhy/waybackpy/this-page-doesn't-exit' can be accessed via your web browser then either this version of waybackpy (2.4.1) is out of date or WayBack Machine is malfunctioning. Visit 'https://github.com/akamhy/waybackpy' for the latest version of waybackpy.
Header:
{'X-NA': '0', 'Content-Type': 'text/html; charset=utf-8', 'Date': 'Tue, 12 Jan 2021 12:53:21 GMT', 'X-NID': '-', 'Transfer-Encoding': 'chunked', 'X-ts': '523', 'Connection': 'keep-alive', 'Server': 'nginx/1.15.8', 'Cache-Control': 'no-cache', 'X-Tr': '6049', 'X-RL': '0', 'X-App-Server': 'wwwb-app52', 'X-Page-Cache': 'MISS'}
>>>
- You can handle it (WaybackError) using a try-except block.
>>>
>>> import waybackpy
>>> from waybackpy.exceptions import WaybackError
>>> url = "https://github.com/akamhy/waybackpy/this-page-doesn't-exit"
>>> user_agent = "Mozilla/5.0 (Windows NT 5.1; rv:40.0) Gecko/20100101 Firefox/40.0"
>>>
>>> wayback = waybackpy.Url(url, user_agent)
>>>
>>> try:
... archive = wayback.save()
... except WaybackError as e:
... pass # handle as you like!
...
>>>
>>> from waybackpy import WaybackMachineAvailabilityAPI
>>> url = "https://www.google.com"
>>> user_agent = "Any-user-agent-you-want"
>>> availability_api = WaybackMachineAvailabilityAPI(url, user_agent)
>>> availability_api.oldest()
https://web.archive.org/web/19981111184551/http://google.com:80/
>>> availability_api.archive_url
'https://web.archive.org/web/19981111184551/http://google.com:80/'
>>> availability_api.JSON
{'url': 'https://www.google.com', 'archived_snapshots': {'closest': {'status': '200', 'available': True, 'url': 'http://web.archive.org/web/19981111184551/http://google.com:80/', 'timestamp': '19981111184551'}}, 'timestamp': '199401221029'}
>>> availability_api.timestamp()
datetime.datetime(1998, 11, 11, 18, 45, 51)
Try this out in your browser @ https://repl.it/@akamhy/WaybackPyOldestExample
>>> import waybackpy
>>> url = "https://www.eff.org"
>>> availability_api = waybackpy.WaybackMachineAvailabilityAPI(url)
>>> availability_api.newest()
https://web.archive.org/web/20220122070041/https://www.eff.org/
>>> availability_api.archive_url
'https://web.archive.org/web/20220122070041/https://www.eff.org/'
>>> availability_api.timestamp()
datetime.datetime(2022, 1, 22, 7, 0, 41)
>>> availability_api.JSON
{'url': 'https://www.eff.org', 'archived_snapshots': {'closest': {'status': '200', 'available': True, 'url': 'http://web.archive.org/web/20220122070041/https://www.eff.org/', 'timestamp': '20220122070041'}}, 'timestamp': '20220122104234'}
Try this out in your browser @ https://repl.it/@akamhy/WaybackPyNewestExample
Retrieving archive close to a specified year, month, day, hour, and a minute or a UNIX timestamp using near()
>>> from waybackpy import WaybackMachineAvailabilityAPI
>>> url = "https://www.facebook.com/zuck"
>>> user_agent = "YOUR USER AGENT"
>>> availability_api = WaybackMachineAvailabilityAPI(url, user_agent=user_agent)
>>> availability_api.near(year=2012, month=10, day=29, hour=12, minute=16)
https://web.archive.org/web/20121029122242/https://www.facebook.com/zuck
>>> availability_api.JSON
{'url': 'https://www.facebook.com/zuck', 'archived_snapshots': {'closest': {'status': '200', 'available': True, 'url': 'http://web.archive.org/web/20121029122242/https://www.facebook.com/zuck', 'timestamp': '20121029122242'}}, 'timestamp': '201210291216'}
>>> availability_api.timestamp()
datetime.datetime(2012, 10, 29, 12, 22, 42)
>>> import waybackpy
>>> url = "https://www.google.com"
>>> unix_time = 1200144258 # you can pass str, int or float.
>>> availability_api = waybackpy.WaybackMachineAvailabilityAPI(url)
>>> availability_api.near(unix_timestamp=unix_time)
https://web.archive.org/web/20080114115458/http://www.google.com/
>>> availability_api.archive_url
'https://web.archive.org/web/20080114115458/http://www.google.com/'
>>> availability_api.timestamp()
datetime.datetime(2008, 1, 14, 11, 54, 58)
>>> availability_api.JSON
{'url': 'https://www.google.com', 'archived_snapshots': {'closest': {'status': '200', 'available': True, 'url': 'http://web.archive.org/web/20080114115458/http://www.google.com/', 'timestamp': '20080114115458'}}, 'timestamp': '20080112132418'}
Try this out in your browser @ https://repl.it/@akamhy/WaybackPyNearExample
>>>
>>> import waybackpy
>>>
>>> URL = "https://en.wikipedia.org/wiki/Python (programming language)"
>>> UA = "Mozilla/5.0 (iPad; CPU OS 8_1_1 like Mac OS X) AppleWebKit/600.1.4 (KHTML, like Gecko) Version/8.0 Mobile/12B435 Safari/600.1.4"
>>>
>>> wayback = waybackpy.Url(url=URL, user_agent=UA)
>>>
>>> total_archives = wayback.total_archives() # <class 'int'>
>>> total_archives
2550
>>>
Try this out in your browser @ https://repl.it/@akamhy/WaybackPyTotalArchivesExample
- To include URLs from subdomain set subdomain=True
import waybackpy
URL = "akamhy.github.io"
UA = "Mozilla/5.0 (iPad; CPU OS 8_1_1 like Mac OS X) AppleWebKit/600.1.4 (KHTML, like Gecko) Version/8.0 Mobile/12B435 Safari/600.1.4"
wayback = waybackpy.Url(url=URL, user_agent=UA)
known_urls = wayback.known_urls(subdomain=False) # <class 'list'>
print(known_urls)
['http://akamhy.github.io',
'https://akamhy.github.io/waybackpy/',
'https://akamhy.github.io/waybackpy/assets/css/style.css?v=a418a4e4641a1dbaad8f3bfbf293fad21a75ff11',
'https://akamhy.github.io/waybackpy/assets/css/style.css?v=f881705d00bf47b5bf0c58808efe29eecba2226c']
Try this out in your browser @ https://repl.it/@akamhy/WaybackPyKnownURLsToWayBackMachineExample#main.py
This CDX server API doc is derived from the https://github.com/internetarchive/wayback/blob/master/wayback-cdx-server/README.md.
from waybackpy import Cdx
url = "https://github.com/akamhy/*"
user_agent = "Your-apps-user-agent"
cdx = Cdx(url=url, user_agent=user_agent)
snapshots = cdx.snapshots()
def snapshot_printer(i, snapshot):
"""
This function is not necessary but we are using it to print the output nicely.
"""
urlkey = snapshot.urlkey
timestamp = snapshot.timestamp
original = snapshot.original
mimetype = snapshot.mimetype
statuscode = snapshot.statuscode
digest = snapshot.digest
length = snapshot.length
archive_url = snapshot.archive_url
datetime_timestamp = snapshot.datetime_timestamp
text = (
"\n\n"
"%d\n"
" urlkey : %s\n"
" timestamp : %s\n"
" original : %s\n"
" mimetype : %s\n"
" statuscode : %s\n"
" digest : %s\n"
" length : %s\n"
" archive_url : %s\n"
" datetime_timestamp : %s\n"
) % (
i,
urlkey,
timestamp,
original,
mimetype,
statuscode,
digest,
length,
archive_url,
datetime_timestamp,
)
print(text)
for i, snapshot in enumerate(snapshots, start =1):
snapshot_printer(i, snapshot)
Try this out in your browser @ https://repl.it/@akamhy/CDX-Basic-usage#main.py
The default behavior is to return matches for an exact URL. However, the CDX server can also return results matching a certain prefix, a certain host or all subdomains by using the match_type= param.
-
match_type=exact
(default if omitted) will return results matching exactly archive.org/about/ -
match_type=prefix
will return results for all results under the path archive.org/about/ -
match_type=host
will return results from host archive.org -
match_type=domain
will return results from host archive.org and all sub-hosts *.archive.org
from waybackpy import Cdx
url = "archive.org/about/"
user_agent = "Your-apps-user-agent"
cdx = Cdx(url=url, user_agent=user_agent, match_type="prefix")
snapshots = cdx.snapshots()
for snapshot in snapshots:
print(snapshot.archive_url)
Try this out in your browser @ https://repl.it/@akamhy/CDX-UrlMatchScope#main.py
Date Range: Results may be filtered by timestamp using start_timestamp= and end_timestamp= params. The ranges are inclusive and are specified in the same 1 to 14 digit format used for wayback captures: yyyyMMddhhmmss
from waybackpy import Cdx
url = "google.com"
user_agent = "Your-apps-user-agent"
cdx = Cdx(url=url, user_agent=user_agent, start_timestamp=1998, end_timestamp=2000)
snapshots = cdx.snapshots()
for snapshot in snapshots:
print(snapshot.archive_url)
Try this out in your browser @ https://repl.it/@akamhy/CDX-Filtering-Date-Range#main.py
-
It is possible to filter on a specific field or the entire CDX line (which is space-delimited). Filtering by specific field is often simpler. Any number of filter params of the following form may be specified: filters=["[!]field:regex"] may be specified.
-
field is one of the named cdx fields (listed in the JSON query) or an index of the field. It is often useful to filter by mimetype or statuscode
-
Optional: ! before the query inverts the match, that is, will return results that do NOT match the regex.
-
regex is any standard Java regex pattern (http://docs.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html)
-
-
Ex: Query for 2 capture results with a non-200 status code:
from waybackpy import Cdx
url = "archive.org"
user_agent = "Your-apps-user-agent"
cdx = Cdx(url=url, user_agent=user_agent, filters=["!statuscode:200"])
snapshots = cdx.snapshots()
i = 0
for snapshot in snapshots:
print(snapshot.statuscode, snapshot.archive_url)
i += 1
if i == 2:
break
Try this out in your browser @ https://repl.it/@akamhy/filtering1#main.py
- Ex: Query for 10 capture results with a non-200 status code and non text/html mime type matching a specific digest:
from waybackpy import Cdx
url = "archive.org"
user_agent = "Your-apps-user-agent"
cdx = Cdx(url=url, user_agent=user_agent, filters=["!statuscode:200", "!mimetype:text/html", "digest:2WAXX5NUWNNCS2BDKCO5OVDQBJVNKIVV"])
snapshots = cdx.snapshots()
i = 0
for snapshot in snapshots:
print(snapshot.digest, snapshot.statuscode, snapshot.archive_url)
i += 1
if i == 10:
break
Try this out in your browser @ https://repl.it/@akamhy/filtering2#main.py
A new form of filtering is the option to 'collapse' results based on a field, or a substring of a field. Collapsing is done on adjacent cdx lines where all captures after the first one that is duplicate and are filtered out. This is useful for filtering out captures that are 'too dense' or when looking for unique captures.
To use collapsing, add one or more field or field:N to 'collapses=[]' where the field is one of (urlkey, timestamp, original, mimetype, statuscode, digest, and length) and N is the first N characters of the field to test.
- Ex: Only show at most 1 capture per hour (compare the first 10 digits of the timestamp field). Given 2 captures 20130226010000 and 20130226010800, since the first 10 digits 2013022601 matches, the 2nd capture will be filtered out.
from waybackpy import Cdx
url = "google.com"
user_agent = "Your-apps-user-agent"
cdx = Cdx(url=url, user_agent=user_agent, collapses=["timestamp:10"])
snapshots = cdx.snapshots()
for snapshot in snapshots:
print(snapshot.archive_url)
Try this out in your browser @ https://repl.it/@akamhy/Cdx-collapsing-first#main.py
- Ex: Only show unique captures by digest (note that only adjacent digest are collapsed, duplicates elsewhere in the cdx are not affected)
from waybackpy import Cdx
url = "google.com"
user_agent = "Your-apps-user-agent"
cdx = Cdx(url=url, user_agent=user_agent, collapses=["digest"])
snapshots = cdx.snapshots()
for snapshot in snapshots:
print(snapshot.archive_url)
Try this out in your browser @ https://repl.it/@akamhy/Cdx-collapsing-second#main.py
- Ex: Only show unique URLs in a prefix query (filtering out captures except for the first capture of a given URL). This is similar to the old prefix query in wayback (note: this query may be slow at the moment):
from waybackpy import Cdx
url = "archive.org"
user_agent = "Your-apps-user-agent"
cdx = Cdx(url=url, user_agent=user_agent, collapses=["urlkey"], match_type="prefix")
snapshots = cdx.snapshots()
for snapshot in snapshots:
print(snapshot.archive_url)
Try this out in your browser @ https://repl.it/@akamhy/Cdx-collapsing-last#main.py