Skip to content

Latest commit

 

History

History
461 lines (341 loc) · 20.9 KB

README.md

File metadata and controls

461 lines (341 loc) · 20.9 KB

Scoop 🍨

npm version JavaScript Style Guide Linting Test suite

High-fidelity, browser-based, single-page web archiving library and CLI.

Use it in the terminal...

scoop "https://lil.law.harvard.edu"

... or in your Node.js project

import { Scoop } from '@harvard-lil/scoop'

const capture = await Scoop.capture('https://lil.law.harvard.edu')
const wacz = await capture.toWACZ()

Perma Tools


Summary


About

Scoop is a high fidelity, browser-based, web archiving capture engine for witnessing the web from the Harvard Library Innovation Lab.

Fine-tune this custom web capture software to create robust single-page captures of the internet with accurate and complete provenance information.

With extensive options for asset formats and inclusions, Scoop will create .warc, warc.gz or .wacz files to be stored by users and replayed using the web archive replay software of their choosing.

Scoop also comes with built-in support for the WACZ Signing and Verification specification, allowing users to cryptographically sign their captures.

More info:

👆 Back to the summary


Main Features

  • High-fidelity, browser-based capture of singular web pages with no alterations
  • Highly configurable
  • Optional attachments:
    • Provenance summary
    • Screenshot
    • Extracted videos with associated subtitles and metadata
    • PDF snapshot
    • DOM snapshot
    • SSL certificates
  • Support for .warc., .warc.gz and .wacz output formats

Examples and screenshots

👆 Back to the summary


Getting started

Dependencies and requirements

Scoop requires Node.js 18+.

Other recommended system-level dependencies: curl, python3 (for --capture-video-as-attachment option).

While the amount of resources Scoop needs is entirely dependent on what is being captured, a minimum of 4GB of RAM seems to be indicated for complex captures.

Compatibility

This program has been written for UNIX-like systems and is expected to work on Linux, Mac OS, and Windows Subsystem for Linux.

Installation

Scoop is available on npmjs.org and can be installed as follows:

# As a CLI
npm install -g @harvard-lil/scoop

# As a library
npm install @harvard-lil/scoop --save

# In both cases, you may need to install Playwright's dependencies: 
sudo npx playwright install-deps chromium
Trouble installing the CLI?
  • Make sure you are running Node JS 18+ (node -v)
  • Permissions issues are a common when installing npm packages globally for the first time. See npm's documentation for solutions.
  • On certain systems, using install-deps without the chromium argument might be necessary:
sudo npx playwright install-deps
# In a new folder
npm init
npm install @harvard-lil/scoop
npx scoop "https://example.com"

👆 Back to the summary


Using Scoop on the command line

Here are a few examples of how the scoop command can be used to make a customized capture of a web page.

# This will capture a given url using the default settings.
scoop "https://lil.law.harvard.edu" 

# Unless specified otherwise, scoop will save the output of the capture as "./archive.wacz".
# We can change this with the `--output` / `-o` option
scoop "https://lil.law.harvard.edu" -o my-collection/lil.wacz

# But what if I want to change the output format itself?
scoop "https://lil.law.harvard.edu" -f warc -o my-collection/lil.warc

# By default, Scoop runs in headless mode. 
# I can turn the "headless" flag off to see what happens in Chromium during capture.
scoop "https://lil.law.harvard.edu" --headless false

# Although it comes with "good defaults", scoop is highly configurable ...
# timeout-related options are good 
scoop "https://lil.law.harvard.edu" --capture-video-as-attachment false --screenshot false --capture-window-x 320 --capture-window-y 480 --capture-timeout 30000 --max-capture-size 100000 --signing-url "https://example.com/sign"

# ... use --help to list the available options, and see what the defaults are.
scoop --help

# Timeout-related options are good dials to turn first when trying to customize "how much" of a page to capture.
scoop "https://lil.law.harvard.edu" --capture-timeout 90000 --load-timeout 60000 --network-idle-timeout 30000
See: Output of scoop --help 🔍
Usage: scoop [options] <url>

🍨 High-fidelity, browser-based, single-page web archiving library and CLI.
More info: https://github.com/harvard-lil/scoop

Options:
  -v, --version                                          Display Scoop and Scoop CLI version.
  -o, --output <string>                                  Output path. (default: "./archive.wacz")
  -f, --format <string>                                  Output format. (choices: "warc", "warc-gzipped", "wacz", "wacz-with-raw", default: "wacz")
  --json-summary-output <string>                         If set, allows for saving a capture summary as JSON. Must be a path to .json file.
  --export-attachments-output <string>                   If set, allows for exporting attachments (screenshot, certs, ...). Must be a path to an existing directory.
  --signing-url <string>                                 Authsign-compatible endpoint for signing WACZ file.
  --signing-token <string>                               Authentication token to --signing-url, if needed.
  --screenshot <bool>                                    Add screenshot step to capture? (choices: "true", "false", default: "true")
  --pdf-snapshot <bool>                                  Add PDF snapshot step to capture? (choices: "true", "false", default: "false")
  --dom-snapshot <bool>                                  Add DOM snapshot step to capture? (choices: "true", "false", default: "false")
  --capture-video-as-attachment <bool>                   Add capture video(s) as attachment(s) step to capture? (choices: "true", "false", default: "true")
  --capture-certificates-as-attachment <bool>            Add capture certificate(s) as attachment(s) step to capture? (choices: "true", "false", default: "true")
  --provenance-summary <bool>                            Add provenance summary to capture? (choices: "true", "false", default: "true")
  --attachments-bypass-limits <bool>                     If active, attachments will not count towards time and size constraints imposed on capture (--capture-timeout, --max--capture-size). (choices: "true", "false", default: "true")
  --capture-timeout <number>                             Maximum time allocated to capture process before hard cut-off, in ms. (default: 60000)
  --load-timeout <number>                                Max time Scoop will wait for the page to load, in ms. (default: 20000)
  --network-idle-timeout <number>                        Max time Scoop will wait for the in-browser networking tasks to complete, in ms. (default: 20000)
  --behaviors-timeout <number>                           Max time Scoop will wait for the browser behaviors to complete, in ms. (default: 20000)
  --capture-video-as-attachment-timeout <number>         Max time Scoop will wait for the video capture process to complete, in ms. (default: 30000)
  --capture-certificates-as-attachment-timeout <number>  Max time Scoop will wait for the certificates capture process to complete, in ms. (default: 10000)
  --capture-window-x <number>                            Width of the browser window Scoop will open to capture, in pixels. (default: 1600)
  --capture-window-y <number>                            Height of the browser window Scoop will open to capture, in pixels. (default: 900)
  --max-capture-size <number>                            Size limit for the capture's exchanges list, in bytes. (default: 209715200)
  --auto-scroll <bool>                                   Should Scoop try to scroll through the page? (choices: "true", "false", default: "true")
  --auto-play-media <bool>                               Should Scoop try to autoplay `<audio>` and `<video>` tags? (choices: "true", "false", default: "true")
  --grab-secondary-resources <bool>                      Should Scoop try to download img srcsets and secondary stylesheets? (choices: "true", "false", default: "true")
  --run-site-specific-behaviors <bool>                   Should Scoop run site-specific capture behaviors? (via: browsertrix-behaviors) (choices: "true", "false", default: "true")
  --headless <bool>                                      Should Chrome run in headless mode? (choices: "true", "false", default: "true")
  --user-agent-suffix <string>                           If provided, will be appended to Chrome's user agent. (default: "")
  --blocklist <string>                                   If set, replaces Scoop's default list of url patterns and IP ranges Scoop should not capture. Comma-separated. Example: "/https?://localhost/,0.0.0.0/8,10.0.0.0".
  --intercepter <string>                                 ScoopIntercepter class to be used to intercept network exchanges. (default: "ScoopProxy")
  --proxy-host <string>                                  Hostname to be used by Scoop's HTTP proxy. (default: "localhost")
  --proxy-port <string>                                  Port to be used by Scoop's HTTP proxy. (default: 9000)
  --proxy-verbose <bool>                                 Should Scoop's HTTP proxy output logs to the console? (choices: "true", "false", default: "false")
  --public-ip-resolver-endpoint <string>                 API endpoint to be used to resolve the client's IP address. Used in the context of the provenance summary. (default: "https://icanhazip.com")
  --yt-dlp-path <string>                                 Path to the yt-dlp executable. Used for capturing videos. (default: "[library]/executables/yt-dlp")
  --crip-path <string>                                   Path to the crip executable. Used for capturing SSL/TLS certificates. (default: "[library]/executables/crip")
  --log-level <string>                                   Controls Scoop CLI's verbosity. (choices: "silent", "trace", "debug", "info", "warn", "error", default: "info")
  -h, --help                                             Show options list.

👆 Back to the summary


Using Scoop as a JavaScript library

Scoop can be used as a library in a Node.js project. Here are a few examples of how to programmatically capture web pages using the Scoop.capture() method, which returns an instance of the Scoop class.

const capture = await Scoop.capture(url, options)

Quick access

Example: Capture with default settings

import fs from 'fs/promises'
import { Scoop } from '@harvard-lil/scoop'

try {
  const capture = await Scoop.capture('https://lil.law.harvard.edu')
  const wacz = await capture.toWACZ()
  await fs.writeFile('archive.wacz', Buffer.from(wacz))
} catch(err) {
  // ...
}

Example: Capture with custom settings

import fs from 'fs/promises'
import { Scoop } from '@harvard-lil/scoop'

try {
  const capture = await Scoop.capture('https://lil.law.harvard.edu', {
    screenshot: true,
    pdfSnapshot: true,
    captureVideoAsAttachment: false,
    captureTimeout: 120 * 1000,
    loadTimeout: 60 * 1000,
    captureWindowX: 320,
    captureWindowY: 480
  })

  const warc = await capture.toWARC()
  await fs.writeFile('archive.warc', Buffer.from(warc))
} catch(err) {
  // ...
}

Example: Working with a copy of default settings

import { Scoop } from '@harvard-lil/scoop'

try {
  // "options" will be a copy of Scoop's default settings
  const options = Scoop.defaults

  // It therefore becomes easier to inspect said defaults ...
  console.log(options)

  // ... and edit existing values
  options.pdfSnapshot = true
  options.blocklist.push('/https?:\/\/foo/')

  const capture = Scoop.capture('https://lil.law.harvard.edu', options)

  // ...
} catch(err) {
  // ...
}

Example: Using a signing server

import fs from 'fs/promises'
import { Scoop } from '@harvard-lil/scoop'

try {
  const capture = await Scoop.capture('https://lil.law.harvard.edu')

  const signedWacz = await capture.toWACZ(true, {
    url: 'https://example.com/sign',
    token: 'some-very-secret-token'
  })

  await fs.writeFile('archive.wacz', Buffer.from(signedWacz))
} catch(err) {
  // ...
}

👆 Back to the summary


FAQ

🚧 Under construction

What does "browser-based" capture mean? Is it using my browser?

Browser-based capture means that Scoop uses a browser - Chromium - to visit the web page to capture and collect resources.

Specifically, it uses an HTTP proxy to "intercept" network exchanges as early as possible and preserve them "as is".

flowchart LR
    A[Scoop]
    B[Playwright]
    C[Chromium]
    D[Website]
    E[HTTP Proxy]
    A <--> |Controls| B
    B <--> C
    C <--> D
    A <-.-> |Capture| E <-.-> C
Loading

The browser Scoop controls was installed specifically for programmatic access by Playwright, the underlying tool it uses to communicate with it, and is different from the default browser of the machine Scoop is running on. Additionally, Scoop creates a single-use, isolated browsing context for every capture it makes.

More info:

Can I capture content behind login / password with Scoop?

Not yet - for security reasons - but we're working on it.

Although Playwright supports loading browser profiles doing so:

  • Breaks context isolation
  • May lead to the presence of credentials / tokens in the captured exchanges

Help us design this feature: harvard-lil#118

Does Scoop capture everything through a browser?

Yes, and unless specified otherwise.

Namely:

  • If the main URL to capture is not a web page (for example: a PDF file), it will be captured using curl.
  • Videos captured as attachments are captured outside of the browser using yt-dlp.
  • Same goes for certificates, captured as attachments via crip.
  • Favicons may be captured out-of-band using curl, if not intercepted during capture.

Exchanges captured in that context still go through Scoop's HTTP proxy, with the exception of crip.

flowchart LR
    A[Scoop]
    B[curl]
    C[Resource]
    D[HTTP Proxy]
    A <--> |Controls| B
    B <--> C
    A <-.-> |Capture| D <-.-> B

Loading

What is "WACZ with RAW exchanges"?

The includeRaw option of Scoop.toWACZ() allows for adding a folder named "raw" in the WACZ file, which contains a copy of unprocessed HTTP exchanges coming directly from Scoop's HTTP proxy.

This feature may be used to preserve finer elements that would otherwise be lost, such as ill-formed HTTP headers, and could be relevant in certain contexts such as forensic analysis.

In order to prevent unnecessary use of storage, Scoop only keeps in "/raw" the contents of exchanges it assesses are presented differently in WARCs. In practice, this most often means the bodies of HTTP exchanges are not included in the "/raw" files because the WARCs already contain the same data.

Experimental: WACZ files stored with the includeRaw option can be ingested by Scoop for analysis and processing via the Scoop.fromWACZ() method.

Should I run Scoop in headful mode?

In certain cases, running Scoop in "headful" mode might yield better results.

Passing --headless false to the CLI or { headless: false } to the library will instruct Scoop to run Chromium in headful mode.

Simulating a graphical output is necessary when running Scoop in headful mode on a server. The following command can be used for that purpose:

xvfb-run --auto-servernum -- scoop "https://lil.law.harvard.edu" --headless false

👆 Back to the summary


Development

Standard JS

This codebase uses the Standard JS coding style.

  • npm run lint can be used to check formatting.
  • npm run lint-autofix can be used to check formatting and automatically edit files accordingly when possible.
  • Most IDEs can be configured to automatically check and enforce this coding style.

JSDoc

JSDoc is used for both documentation and loose type checking purposes on this project.

Testing

This project uses Node.js' built-in test runner.

npm run test

Tests-specific environment variables

The following environment variables allow for testing features requiring access to a third-party server.

These are optional, and can be added to a local .env file which will be automatically interpreted by the test runner.

Name Description
TEST_WACZ_SIGNING_URL URL of an authsign-compatible endpoint for signing WACZ files.
To run such an endpoint locally, use npm run dev-signer, which will overwrite .env and set this variable to http://localhost:5000/sign; see .services/signer.
TEST_WACZ_SIGNING_TOKEN If required by the server at TEST_WACZ_SIGNING_URL, an authentication token.

Available CLI

# Runs test suite
npm run test

# Runs linter
npm run lint

# Runs linter and attempts to automatically fix issues
npm run lint-autofix

# Runs a local instance of wacz-signer for test purposes (see "Testing" section)
npm run dev-signer

# Step-by-step NPM publishing helper
npm run publish-util

👆 Back to the summary