Skip to content

Python package docs

Akash Mahanty edited this page Jan 22, 2022 · 86 revisions

You are currently reading waybackpy docs to use it as a python package. If you want to use waybackpy as CLI tool visit our CLI docs.


Contents

Archiving or Saving a webpage

>>> import waybackpy
>>> 
>>> url = "https://en.wikipedia.org/wiki/Multivariable_calculus"
>>> user_agent = "Mozilla/5.0 (Windows NT 5.1; rv:40.0) Gecko/20100101 Firefox/40.0"
>>> 
>>> save_api = waybackpy.WaybackMachineSaveAPI(url, user_agent=user_agent)
>>> save_api.save()
'https://web.archive.org/web/20220122102014/https://en.wikipedia.org/wiki/Multivariable_calculus'
>>> save_api.cached_save
False
>>> save_api.headers
{'Server': 'nginx/1.19.5', 'Date': 'Sat, 22 Jan 2022 10:20:19 GMT', 'Content-Type': 'text/html; charset=UTF-8', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'x-archive-orig-date': 'Fri, 21 Jan 2022 23:32:39 GMT', 'x-archive-orig-server': 'mw1407.eqiad.wmnet', 'x-archive-orig-x-content-type-options': 'nosniff', 'x-archive-orig-p3p': 'CP="See https://en.wikipedia.org/wiki/Special:CentralAutoLogin/P3P for more info."', 'x-archive-orig-content-language': 'en', 'x-archive-orig-vary': 'Accept-Encoding,Cookie,Authorization', 'x-archive-orig-last-modified': 'Fri, 21 Jan 2022 23:16:22 GMT', 'x-archive-orig-content-encoding': 'gzip', 'x-archive-orig-age': '38855', 'x-archive-orig-x-cache': 'cp4027 miss, cp4030 hit/2', 'x-archive-orig-x-cache-status': 'hit-front', 'x-archive-orig-server-timing': 'cache;desc="hit-front", host;desc="cp4030"', 'x-archive-orig-strict-transport-security': 'max-age=106384710; includeSubDomains; preload', 'x-archive-orig-report-to': '{ "group": "wm_nel", "max_age": 86400, "endpoints": [{ "url": "https://intake-logging.wikimedia.org/v1/events?stream=w3c.reportingapi.network_error&schema_uri=/w3c/reportingapi/network_error/1.0.0" }] }', 'x-archive-orig-nel': '{ "report_to": "wm_nel", "max_age": 86400, "failure_fraction": 0.05, "success_fraction": 0.0}', 'x-archive-orig-permissions-policy': 'interest-cohort=()', 'x-archive-orig-set-cookie': 'WMF-Last-Access=22-Jan-2022;Path=/;HttpOnly;secure;Expires=Wed, 23 Feb 2022 00:00:00 GMT, WMF-Last-Access-Global=22-Jan-2022;Path=/;Domain=.wikipedia.org;HttpOnly;secure;Expires=Wed, 23 Feb 2022 00:00:00 GMT, GeoIP=US:CA:San_Francisco:37.78:-122.47:v4; Path=/; secure; Domain=.wikipedia.org', 'x-archive-orig-x-client-ip': '207.241.227.105', 'x-archive-orig-cache-control': 'private, s-maxage=0, max-age=0, must-revalidate', 'x-archive-orig-accept-ranges': 'bytes', 'x-archive-orig-content-length': '28504', 'x-archive-orig-connection': 'keep-alive', 'x-archive-guessed-content-type': 'text/html', 'x-archive-guessed-charset': 'utf-8', 'memento-datetime': 'Sat, 22 Jan 2022 10:20:14 GMT', 'link': '<https://en.wikipedia.org/wiki/Multivariable_calculus>; rel="original", <https://web.archive.org/web/timemap/link/https://en.wikipedia.org/wiki/Multivariable_calculus>; rel="timemap"; type="application/link-format", <https://web.archive.org/web/https://en.wikipedia.org/wiki/Multivariable_calculus>; rel="timegate", <https://web.archive.org/web/20050422130129/http://en.wikipedia.org:80/wiki/Multivariable_calculus>; rel="first memento"; datetime="Fri, 22 Apr 2005 13:01:29 GMT", <https://web.archive.org/web/20220118154923/https://en.wikipedia.org/wiki/Multivariable_calculus>; rel="prev memento"; datetime="Tue, 18 Jan 2022 15:49:23 GMT", <https://web.archive.org/web/20220122102014/https://en.wikipedia.org/wiki/Multivariable_calculus>; rel="memento"; datetime="Sat, 22 Jan 2022 10:20:14 GMT", <https://web.archive.org/web/20220122102014/https://en.wikipedia.org/wiki/Multivariable_calculus>; rel="last memento"; datetime="Sat, 22 Jan 2022 10:20:14 GMT"', 'content-security-policy': "default-src 'self' 'unsafe-eval' 'unsafe-inline' data: blob: archive.org web.archive.org analytics.archive.org pragma.archivelab.org", 'x-archive-src': 'spn2-20220122093153-wwwb-spn23.us.archive.org-8000.warc.gz', 'server-timing': 'captures_list;dur=138.943024, exclusion.robots;dur=0.124457, exclusion.robots.policy;dur=0.114278, cdx.remote;dur=0.091306, esindex;dur=0.011012, LoadShardBlock;dur=101.247564, PetaboxLoader3.datanode;dur=44.420167, CDXLines.iter;dur=25.235685, PetaboxLoader3.resolve;dur=25.677021, load_resource;dur=4.737038', 'x-app-server': 'wwwb-app201', 'x-ts': '200', 'x-tr': '315', 'X-location': 'All', 'X-Cache-Key': 'httpsweb.archive.org/web/20220122102014/https://en.wikipedia.org/wiki/Multivariable_calculusIN', 'X-RL': '0', 'X-NA': '0', 'X-Page-Cache': 'MISS', 'X-NID': '-', 'Referrer-Policy': 'no-referrer-when-downgrade', 'Permissions-Policy': 'interest-cohort=()', 'Content-Encoding': 'gzip'}
>>> save_api.timestamp()
datetime.datetime(2022, 1, 22, 10, 20, 14)
>>> save_api.archive_url
'https://web.archive.org/web/20220122102014/https://en.wikipedia.org/wiki/Multivariable_calculus'

Try this out in your browser @ https://repl.it/@akamhy/WaybackPySaveExample

Exception/Error Handling
  • Sometimes the Wayback Machine may deny your archiving requests and not save the webpage. waybackpy will raise 'WaybackError' if your request failed.
>>> 
>>> url = "https://github.com/akamhy/waybackpy/this-page-doesn't-exit" # This webpage doesn't exist (404), therefore can't archive.
>>> user_agent = "Mozilla/5.0 (Windows NT 5.1; rv:40.0) Gecko/20100101 Firefox/40.0"
>>> wayback = waybackpy.Url(url, user_agent)
>>> archive = wayback.save()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/akamhy/.pyenv/versions/3.5.3/lib/python3.5/site-packages/waybackpy/wrapper.py", line 141, in save
    self._archive_url = "https://" + _archive_url_parser(response.headers, self.url)
  File "/home/akamhy/.pyenv/versions/3.5.3/lib/python3.5/site-packages/waybackpy/utils.py", line 223, in _archive_url_parser
    "of waybackpy.\nHeader:\n%s" % (url, __version__, str(header))
waybackpy.exceptions.WaybackError: No archive URL found in the API response. If 'https://github.com/akamhy/waybackpy/this-page-doesn't-exit' can be accessed via your web browser then either this version of waybackpy (2.4.1) is out of date or WayBack Machine is malfunctioning. Visit 'https://github.com/akamhy/waybackpy' for the latest version of waybackpy.
Header:
{'X-NA': '0', 'Content-Type': 'text/html; charset=utf-8', 'Date': 'Tue, 12 Jan 2021 12:53:21 GMT', 'X-NID': '-', 'Transfer-Encoding': 'chunked', 'X-ts': '523', 'Connection': 'keep-alive', 'Server': 'nginx/1.15.8', 'Cache-Control': 'no-cache', 'X-Tr': '6049', 'X-RL': '0', 'X-App-Server': 'wwwb-app52', 'X-Page-Cache': 'MISS'}
>>> 
  • You can handle it (WaybackError) using a try-except block.
>>> 
>>> import waybackpy
>>> from waybackpy.exceptions import WaybackError
>>> url = "https://github.com/akamhy/waybackpy/this-page-doesn't-exit"
>>> user_agent = "Mozilla/5.0 (Windows NT 5.1; rv:40.0) Gecko/20100101 Firefox/40.0"
>>> 
>>> wayback = waybackpy.Url(url, user_agent)
>>> 
>>> try:
...      archive = wayback.save()
... except WaybackError as e:
...     pass # handle as you like!
... 
>>> 

Retrieve archive of webpage

Retrieving the oldest archive for an URL using oldest()
>>> from waybackpy import WaybackMachineAvailabilityAPI
>>> url = "https://www.google.com"
>>> user_agent = "Any-user-agent-you-want"
>>> availability_api = WaybackMachineAvailabilityAPI(url, user_agent)
>>> availability_api.oldest()
https://web.archive.org/web/19981111184551/http://google.com:80/
>>> availability_api.archive_url
'https://web.archive.org/web/19981111184551/http://google.com:80/'
>>> availability_api.JSON
{'url': 'https://www.google.com', 'archived_snapshots': {'closest': {'status': '200', 'available': True, 'url': 'http://web.archive.org/web/19981111184551/http://google.com:80/', 'timestamp': '19981111184551'}}, 'timestamp': '199401221029'}
>>> availability_api.timestamp()
datetime.datetime(1998, 11, 11, 18, 45, 51)

Try this out in your browser @ https://repl.it/@akamhy/WaybackPyOldestExample

Retrieving the newest archive for an URL using newest()
>>> import waybackpy
>>> url = "https://www.eff.org"
>>> availability_api = waybackpy.WaybackMachineAvailabilityAPI(url)
>>> availability_api.newest()
https://web.archive.org/web/20220122070041/https://www.eff.org/
>>> availability_api.archive_url
'https://web.archive.org/web/20220122070041/https://www.eff.org/'
>>> availability_api.timestamp()
datetime.datetime(2022, 1, 22, 7, 0, 41)
>>> availability_api.JSON
{'url': 'https://www.eff.org', 'archived_snapshots': {'closest': {'status': '200', 'available': True, 'url': 'http://web.archive.org/web/20220122070041/https://www.eff.org/', 'timestamp': '20220122070041'}}, 'timestamp': '20220122104234'}

Try this out in your browser @ https://repl.it/@akamhy/WaybackPyNewestExample

Retrieving archive close to a specified year, month, day, hour, and a minute or a UNIX timestamp using near()
>>> from waybackpy import WaybackMachineAvailabilityAPI
>>> url = "https://www.facebook.com/zuck"
>>> user_agent = "YOUR USER AGENT"
>>> availability_api = WaybackMachineAvailabilityAPI(url, user_agent=user_agent)
>>> availability_api.near(year=2012, month=10, day=29, hour=12, minute=16)
https://web.archive.org/web/20121029122242/https://www.facebook.com/zuck
>>> availability_api.JSON
{'url': 'https://www.facebook.com/zuck', 'archived_snapshots': {'closest': {'status': '200', 'available': True, 'url': 'http://web.archive.org/web/20121029122242/https://www.facebook.com/zuck', 'timestamp': '20121029122242'}}, 'timestamp': '201210291216'}
>>> availability_api.timestamp()
datetime.datetime(2012, 10, 29, 12, 22, 42)
>>> import waybackpy
>>> url = "https://www.google.com" 
>>> unix_time = 1200144258 # you can pass str, int or float.
>>> availability_api = waybackpy.WaybackMachineAvailabilityAPI(url)
>>> availability_api.near(unix_timestamp=unix_time)
https://web.archive.org/web/20080114115458/http://www.google.com/
>>> availability_api.archive_url
'https://web.archive.org/web/20080114115458/http://www.google.com/'
>>> availability_api.timestamp()
datetime.datetime(2008, 1, 14, 11, 54, 58)
>>> availability_api.JSON
{'url': 'https://www.google.com', 'archived_snapshots': {'closest': {'status': '200', 'available': True, 'url': 'http://web.archive.org/web/20080114115458/http://www.google.com/', 'timestamp': '20080114115458'}}, 'timestamp': '20080112132418'}

Try this out in your browser @ https://repl.it/@akamhy/WaybackPyNearExample


Count total archives for an URL using total_archives()

>>> 
>>> import waybackpy
>>> 
>>> URL = "https://en.wikipedia.org/wiki/Python (programming language)"
>>> UA = "Mozilla/5.0 (iPad; CPU OS 8_1_1 like Mac OS X) AppleWebKit/600.1.4 (KHTML, like Gecko) Version/8.0 Mobile/12B435 Safari/600.1.4"
>>> 
>>> wayback = waybackpy.Url(url=URL, user_agent=UA)
>>> 
>>> total_archives = wayback.total_archives() # <class 'int'>
>>> total_archives
2550
>>> 

Try this out in your browser @ https://repl.it/@akamhy/WaybackPyTotalArchivesExample


List of URLs that Wayback Machine knows and has archived for a domain name

  • To include URLs from subdomain set subdomain=True
import waybackpy

URL = "akamhy.github.io"
UA = "Mozilla/5.0 (iPad; CPU OS 8_1_1 like Mac OS X) AppleWebKit/600.1.4 (KHTML, like Gecko) Version/8.0 Mobile/12B435 Safari/600.1.4"

wayback = waybackpy.Url(url=URL, user_agent=UA)

known_urls = wayback.known_urls(subdomain=False) # <class 'list'>

print(known_urls)
['http://akamhy.github.io',
'https://akamhy.github.io/waybackpy/',
'https://akamhy.github.io/waybackpy/assets/css/style.css?v=a418a4e4641a1dbaad8f3bfbf293fad21a75ff11',
'https://akamhy.github.io/waybackpy/assets/css/style.css?v=f881705d00bf47b5bf0c58808efe29eecba2226c']

Try this out in your browser @ https://repl.it/@akamhy/WaybackPyKnownURLsToWayBackMachineExample#main.py

CDX Server API

This CDX server API doc is derived from the https://github.com/internetarchive/wayback/blob/master/wayback-cdx-server/README.md.

Basic usage
from waybackpy import Cdx
url = "https://github.com/akamhy/*"
user_agent = "Your-apps-user-agent"

cdx = Cdx(url=url, user_agent=user_agent)
snapshots = cdx.snapshots()

def snapshot_printer(i, snapshot):
    """
    This function is not necessary but we are using it to print the output nicely.
    """
    urlkey = snapshot.urlkey
    timestamp = snapshot.timestamp
    original = snapshot.original
    mimetype = snapshot.mimetype
    statuscode = snapshot.statuscode
    digest = snapshot.digest
    length = snapshot.length
    archive_url = snapshot.archive_url
    datetime_timestamp = snapshot.datetime_timestamp
    text = (
        "\n\n"
        "%d\n"
        " urlkey : %s\n"
        " timestamp : %s\n"
        " original : %s\n"
        " mimetype : %s\n"
        " statuscode : %s\n"
        " digest : %s\n"
        " length : %s\n"
        " archive_url : %s\n"
        " datetime_timestamp : %s\n"
    ) % (
        i, 
        urlkey, 
        timestamp, 
        original, 
        mimetype, 
        statuscode,
        digest,
        length,
        archive_url,
        datetime_timestamp,
        )
    print(text)

for i, snapshot in  enumerate(snapshots, start =1):
    snapshot_printer(i, snapshot)

Try this out in your browser @ https://repl.it/@akamhy/CDX-Basic-usage#main.py

Url Match Scope

The default behavior is to return matches for an exact URL. However, the CDX server can also return results matching a certain prefix, a certain host or all subdomains by using the match_type= param.

  • match_type=exact (default if omitted) will return results matching exactly archive.org/about/
  • match_type=prefix will return results for all results under the path archive.org/about/
  • match_type=host will return results from host archive.org
  • match_type=domain will return results from host archive.org and all sub-hosts *.archive.org
from waybackpy import Cdx
url = "archive.org/about/"
user_agent = "Your-apps-user-agent"

cdx = Cdx(url=url, user_agent=user_agent, match_type="prefix")
snapshots = cdx.snapshots()

for snapshot in snapshots:
    print(snapshot.archive_url)

Try this out in your browser @ https://repl.it/@akamhy/CDX-UrlMatchScope#main.py

Filtering
Date Range

Date Range: Results may be filtered by timestamp using start_timestamp= and end_timestamp= params. The ranges are inclusive and are specified in the same 1 to 14 digit format used for wayback captures: yyyyMMddhhmmss

from waybackpy import Cdx
url = "google.com"
user_agent = "Your-apps-user-agent"

cdx = Cdx(url=url, user_agent=user_agent, start_timestamp=1998, end_timestamp=2000)
snapshots = cdx.snapshots()

for snapshot in snapshots:
    print(snapshot.archive_url)

Try this out in your browser @ https://repl.it/@akamhy/CDX-Filtering-Date-Range#main.py

Regex filtering
  • It is possible to filter on a specific field or the entire CDX line (which is space-delimited). Filtering by specific field is often simpler. Any number of filter params of the following form may be specified: filters=["[!]field:regex"] may be specified.

    • field is one of the named cdx fields (listed in the JSON query) or an index of the field. It is often useful to filter by mimetype or statuscode

    • Optional: ! before the query inverts the match, that is, will return results that do NOT match the regex.

    • regex is any standard Java regex pattern (http://docs.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html)

  • Ex: Query for 2 capture results with a non-200 status code:

from waybackpy import Cdx
url = "archive.org"
user_agent = "Your-apps-user-agent"

cdx = Cdx(url=url, user_agent=user_agent, filters=["!statuscode:200"])
snapshots = cdx.snapshots()

i = 0
for snapshot in snapshots:
    print(snapshot.statuscode, snapshot.archive_url)
    i += 1
    if i == 2:
        break

Try this out in your browser @ https://repl.it/@akamhy/filtering1#main.py

  • Ex: Query for 10 capture results with a non-200 status code and non text/html mime type matching a specific digest:
from waybackpy import Cdx
url = "archive.org"
user_agent = "Your-apps-user-agent"

cdx = Cdx(url=url, user_agent=user_agent, filters=["!statuscode:200", "!mimetype:text/html", "digest:2WAXX5NUWNNCS2BDKCO5OVDQBJVNKIVV"])
snapshots = cdx.snapshots()

i = 0
for snapshot in snapshots:
    print(snapshot.digest, snapshot.statuscode, snapshot.archive_url)
    i += 1
    if i == 10:
        break

Try this out in your browser @ https://repl.it/@akamhy/filtering2#main.py

Collapsing

A new form of filtering is the option to 'collapse' results based on a field, or a substring of a field. Collapsing is done on adjacent cdx lines where all captures after the first one that is duplicate and are filtered out. This is useful for filtering out captures that are 'too dense' or when looking for unique captures.

To use collapsing, add one or more field or field:N to 'collapses=[]' where the field is one of (urlkey, timestamp, original, mimetype, statuscode, digest, and length) and N is the first N characters of the field to test.

  • Ex: Only show at most 1 capture per hour (compare the first 10 digits of the timestamp field). Given 2 captures 20130226010000 and 20130226010800, since the first 10 digits 2013022601 matches, the 2nd capture will be filtered out.
from waybackpy import Cdx
url = "google.com"
user_agent = "Your-apps-user-agent"

cdx = Cdx(url=url, user_agent=user_agent, collapses=["timestamp:10"])
snapshots = cdx.snapshots()

for snapshot in snapshots:
    print(snapshot.archive_url)

Try this out in your browser @ https://repl.it/@akamhy/Cdx-collapsing-first#main.py

  • Ex: Only show unique captures by digest (note that only adjacent digest are collapsed, duplicates elsewhere in the cdx are not affected)
from waybackpy import Cdx
url = "google.com"
user_agent = "Your-apps-user-agent"

cdx = Cdx(url=url, user_agent=user_agent, collapses=["digest"])
snapshots = cdx.snapshots()

for snapshot in snapshots:
    print(snapshot.archive_url)

Try this out in your browser @ https://repl.it/@akamhy/Cdx-collapsing-second#main.py

  • Ex: Only show unique URLs in a prefix query (filtering out captures except for the first capture of a given URL). This is similar to the old prefix query in wayback (note: this query may be slow at the moment):
from waybackpy import Cdx
url = "archive.org"
user_agent = "Your-apps-user-agent"

cdx = Cdx(url=url, user_agent=user_agent, collapses=["urlkey"], match_type="prefix")
snapshots = cdx.snapshots()

for snapshot in snapshots:
    print(snapshot.archive_url)

Try this out in your browser @ https://repl.it/@akamhy/Cdx-collapsing-last#main.py

Clone this wiki locally