2017/01/18, Central Intelligence Agency (CIA) released their CIA Records Search Tool(CREST) database online, including 930,000 declassified documents.
Being interested, I try to write a web crawler for the public CIA CREST website ( https://www.cia.gov/library/readingroom/collection/crest-25-year-program-archive ), making it convenient to fast browse information of your query and automatically download documents in your own equipment.
The search query is for CIA Freedom of Information Act (FOIA) Electronic Reading Room (ERR), and the crawler is coded by R.
This is a R script which has 3 functions:
-
basic.info.query.CIA_CREST(query)
: get the basic information by a given query. -
parsing.pages.CIA_CREST(query, pages)
: return aparse.table
according to the given query and range of pages where you want to search, should be provied to the next function. -
download.doc.CIA_CREST(parse.table)
: automatically download documents based on theparse.table
, and return areference.table
which helps to match titles of documents with downloaded documents(.pdf).
I provided some examples in this script, now just showing one example.
For example, if you are interesting in "secret lettet" and want to search some documents:
basic.info.query.CIA_CREST(query = "secret letter")
# Response
The search query is for CIA Freedom of Information Act (FOIA) Electronic Reading Room (ERR)
URL: https://www.cia.gov/library/readingroom/collection/crest-25-year-program-archive
Your query is : secret letter
Search found 388350 items
The results contain 0 ~ 19417 pages
and you will get the response of 388350 search items and the range of result pages is 0~19417 pages.
(Note that 0 page equals to the first page on the web)
The next step is to decide pages where you want to search.
For example, you want to check documenets about "secret letter" in the top 10 pages:
your.query = 'secret letter'
page.nums = c(0:9) # the top 10 pages
parse.table = parsing.pages.CIA_CREST(query = your.query,
pages = page.nums)
The return parse.table
includes 4 columns:
-
title
: titles of documents. -
download.url
: urls where to download documents. -
page
: the page where this document is in. -
correspond.page
: the page url where this documents is in.
This parse.table
should be supplied to download.doc.CIA_CREST()
, the function which will automatically download all documents in parse.table
to the relative folder.
That is, we want to download documents(.pdf) about "secret letter" in the top 10 pages.
your.query = 'secret letter'
page.nums = c(0:9) # the top 10 pages
parse.table = parsing.pages.CIA_CREST(query = your.query,
pages = page.nums)
reference.table = download.doc.CIA_CREST(parse.table)
Or we want to download the top 10 documents(.pdf) about "UFO" in the first page.
your.query = 'UFO'
page.nums = c(0) # the first pages
parse.table = parsing.pages.CIA_CREST(query = your.query,
pages = page.nums)
reference.table = download.doc.CIA_CREST(parse.table[1:10,]) # only the top 10 documents
Note that the return reference.table
includs 2 columns,
-
title
: title of documents -
pdf.name
: file name of downloaded documents(.pdf)
for the reason that downloaded documents have their own file name by CIA encoded style; therefore, it's necessary to have a reference.table
for mathcing titles to documents.
I write an article in Chinese for more detail about how I implement this crawler.
(Sorry, there is no English version)
http://rpubs.com/skydome20/R-Note13-Web-Crawler-on-CIA-CREST-by-xml2