BASC Cache Fetcher

Simple Python set of scripts that obtain a list of a cached URLs for a website. This is important when the website suddenly goes offline for good, and leaves behind only Cached URLs. It support scraping from the following web caches (as of 2015-10-09):

Google Cache - Google Cache is the most famous web cache system around. It obtains web caches from websites using GoogleBot crawlers to index metadata for searches.
- Listing - The first step is to obtain all the cached URLs.
- Archival - Because Google Cache has strong restrictions against robot scrapers, it is very difficult to grab more than 15 pages procedurally. However, there is a workaround: use Archive.is to archive the pages. Google has given this site exclusive rights to scrape the Google Cache without restriction, since they apparently understand the importance in saving certain cached pages.
Bing Cache - Bing, another major search service, also makes cached URLs available to users. Bing has significantly fewer restrictions on robots, and does not prune dead caches as quickly, so it is the ideal webcache to work with.
- Listing - Bing uses a very odd cache URL scheme, which obligates us to scrape the HTML. Thankfully, Bing is not prejudiced against robots, so this is quite an easy task.
- Archival - Just use good ol' wget to grab the pages, with 1 second delay between requests.

Why not use Bing API?

Unfortunately, Bing no longer provides cached URL ids in their API anymore. On the flip side, they are desparate for page views and allow any sort of traffic to come in their search engine, unlike Google, which is prejudiced against robots.

http://www.newmediacampaigns.com/page/search-microsoft-bing-with-xml-and-json-apis

Sign up free for 5,000 requests.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
google		google
.gitignore		.gitignore
README.md		README.md
adfly-url-extractor.py		adfly-url-extractor.py
bing-cache-archiver.py		bing-cache-archiver.py
caches.7z		caches.7z
download-bing.py		download-bing.py
get-bing-cache.py		get-bing-cache.py
get-google-api.py		get-google-api.py
google-cache-archiver.py		google-cache-archiver.py
twitter-url-extractor.py		twitter-url-extractor.py
upload-archive-is.sh		upload-archive-is.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BASC Cache Fetcher

Why not use Bing API?

About

Releases

Packages

Languages

orbang/webcache-scraper

Folders and files

Latest commit

History

Repository files navigation

BASC Cache Fetcher

Why not use Bing API?

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages