Skip to content

Restrict expressiveness of site adapters #17

Open
@geoffreylitt

Description

I just had a nice chat with David Karger about Wildcard at the HCI research feedback lunch. He made a bunch of useful points about things to expand on in the Onward paper, but I found one of them particularly salient: restricting the expressiveness of site adapters. This is a topic that has come up before, but as we move towards a beta release and soliciting contributions of adapters, it's seeming increasingly important to discuss.

Why Javascript is problematic

Currently site adapters are written in Javascript (typed with Typescript). You write a single scrapePage function that returns all the data, and inside that function you can do whatever you want. In earlier versions of Wildcard I explored different more complicated APIs, but landed there for simplicity's sake.

In previous discussions we've briefly touched on the security concerns of having a community-sourced repository of scrapers that can execute arbitrary code. The tentative plan up to now was the following: have people contribute adapters back to the main Github repository, do centralized code review by the core developers, and then distribute adapters in the code along with the extension itself. That plan somewhat solves the security issue, but still has at least 3 remaining problems, in order of priority:

  1. Burden on users: contributing back site adapters has a high barrier to entry -- you need to install the development build system locally, write Javascript, submit a Github PR, etc... as I was writing the site adapter creation guide docs, I started to get nervous about this.
  2. Too many footguns: people have lots of room to mess up, especially if inexperienced programmers are writing adapters. It's harder for us to enforce patterns of building adapters that are robust.
  3. Mediocre distribution mechanism: Centralized code review is a bottleneck and still doesn't provide airtight security. Only shipping new adapters with new versions of the extension code will require frequent releases and getting all users to upgrade. It would be much preferable to be able to distribute adapters dynamically, independent of extension code releases.

The obvious solution here is to move away from Javascript as the scraper language, to a more restrictive and declarative DSL / "configuration language" . This solves all the problems:

  1. Easier to write an adapter -- you can open an adapter editor inside of Wildcard, save new adapters in some serialized format, and upload them to a website that collects people's adapters. Don't need to write JS.
  2. You no longer have enough expressiveness to write certain kinds of bugs, and you can't write malware (assuming the DSL is well designed).
  3. Distribution becomes way simpler: have some online collection of adapters; Wildcard can either download all the latest ones or you can download specific adapters.

Some drawbacks might be: a) harder to use for people who already know JS, b) providing a good editing experience might be more work, since we can't lean on Typescript types anymore and would need to do our own static verification of the adapter code

DSL design

OK, sounds great, but the tricky part is designing a DSL that can still usefully scrape sites with reasonable programmer ergonomics. Looking across the site adapters we have now, it seems clear that the basic building blocks of an HTML scraper are:

  1. CSS selectors, for locating relevant DOM elements.
  2. getting attributes of DOM elements
  3. regex, for extracting substrings. Could also consider end-user-friendlier languages with regex-equivalent power, but regex is a universal standard, and there are some nice regex-generation tools. The framework could provide helpers for standard regex operations (eg extract number from string)

What else is currently used in adapters? A quick audit:

  • Math: The Amazon scraper does some math to sum up delivery costs. I think this is better done as a formula outside the scraper, to give the end user more control. More generally, I'm optimistic that we can push computation out of the scrapers and into formulas in the table. If a site adapter could generate a set of "raw columns" extracted from the page, and then provide a "derived view" using formulas in the table, giving the end user maximum flexibility.
  • Conditionals: used in several adapters for various purposes. I think if we had a feature for fallbacks (try to scrape X; if it's not there, scrape Y, ...) that would eliminate many of them, but not all.
  • Iteration: the Flux adapter uses a for loop because parts of the CSS selectors are dependent on the row and the column; could probably design a way to directly interpolate row+column numbers into css selectors without requiring iteration

A few other thoughts:

xpath: I'm not super familiar with it but it seems incredibly powerful, potentially the perfect existing language for providing most/all of these features in one package. I thought it was similar to CSS in power but it seems quite a bit more expressive.

AJAX responses: We're also starting to explore AJAX adapters that scrape from an AJAX JSON response. So there'd need to be another way besides CSS selectors to index into a data tree -- maybe xpath.

Other adapter attributes besides scraping: Adapters also define other attributes besides scraping: a name, a set of columns, when to activate based on URL, DOM events to trigger data reloads... but most of those are all pretty declarative already.

Syntax: I'm reluctant to design a syntax from scratch; embedding this in JSON seems most straightforward. Usually I prefer DSLs embedded in a Turing-complete language to provide the TC escape hatch if needed, but here that's precisely what we don't want.

Example

Here's a concrete example of how such a DSL might look in a simple case, Airbnb's search page.

First, the existing Javascript adapter:

'use strict';

import { urlContains, extractNumber } from "../utils"
import { createDomScrapingAdapter } from "./domScrapingBase"

const rowContainerClass = "_fhph4u"
const rowClass = "_8ssblpx"
const titleClass = "_1c2n35az"
const priceClass = "_1p7iugi"
const ratingClass = "_10fy1f8"
const listingLinkClass = "_i24ijs"

const AirbnbAdapter = createDomScrapingAdapter({
  name: "Airbnb",
  enabled: () => urlContains("airbnb.com/s"),
  attributes: [
  { name: "id", type: "text" },
  { name: "name", type: "text" },
  { name: "price", type: "numeric" },
  { name: "rating", type: "numeric" }
  ],
  scrapePage: () => {
    return Array.from(document.getElementsByClassName(rowClass)).map(el => {
      let path = el.querySelector("." + listingLinkClass).getAttribute('href')
      let id = path.match(/\/rooms\/([0-9]*)\?/)[1]

      return {
        id: id,
        rowElements: [el],
        dataValues: {
          name: el.querySelector(`.${titleClass}`),
          price: el.querySelector(`.${priceClass}`).textContent.match(/\$([\d]*)/)[1],
          rating: extractNumber(el.querySelector(`.${ratingClass}`))
        }
      }
    })
  }
});

export default AirbnbAdapter;

Then, the new adapter in our imagined DSL:

{
  "name": "Airbnb",
  "enabled": {
    "urlContains": "airbnb.com/s"
  },
  "attributes": [
    { "name": "id", "type": "text" },
    { "name": "name", "type": "text" },
    { "name": "price", "type": "numeric" },
    { "name": "rating", "type": "numeric" }
  ],
  // CSS class identifying each row.
  // (todo: consider cases like Hacker News where each row
  // is spread across multiple DOM elements)
  "rows": "_fhph4u",
  "id": {
     // from within the row, get element with this class...
     "querySelector": "._i24ijs",
     // extract this attribute from that element...
     "attribute": "href",
     // then run this regex and get the first match.
     // (getting the first match is just the default behavior)
     "extract": { "regex": "/\/rooms\/([0-9]*)\?/" }
   },
  "values": {
    "querySelector": "._1c2n35az",
    "price": {
      "css": "._1p7iugi",
      "extract": { "regex": "/\$([\d]*)/" }
    },
    "rating": {
      "css": "_10fy1f8",
      "extract": "number"
    }
  }
}

Visual editing

Eventually it would be good to have a visual environment where end users can generate scrapers via direct manipulation, and there are some existing tools for doing that. One nice thing about this DSL approach is that it should be an easier code generation target for such a tool.

Initially, to limit scope, I'm imagining that users would directly edit this DSL in text. (Although -- if there's an existing end user scraper creation tool that's really good, maybe we could bypass text editing entirely and just use that tool instead...)

Prior art

People have designed many DSLs and visual scraping products before. If one of them fits our purposes (and ideally, is popular) then that would be great.

The ideal option would have:

Some language design inspiration from the Huginn web scraping agent's scraping configuration:

          "extract": {
            "url": { "css": "#comic img", "value": "@src" },
            "title": { "css": "#comic img", "value": "@title" },
            "body_text": { "css": "div.main", "value": "string(.)" },
            "page_title": { "css": "title", "value": "string(.)", "repeat": true }
          }
      or
          "extract": {
            "url": { "xpath": "//*[@class='blog-item']/a/@href", "value": ".",
            "title": { "xpath": "//*[@class='blog-item']/a", "value": "normalize-space(.)" },
            "description": { "xpath": "//*[@class='blog-item']/div[0]", "value": "string(.)" }
          }

Next steps

Unfortunately removing expressiveness from existing programs is hard. If people were to start contributing Javascript adapters, it wouldn't always be easily possible to convert them to this less expressive form.

I'm tempted to say that we should think through this issue before doing the planned work of writing up a site adapter creation guide + soliciting scraper contributions. It would be ideal to have something about this in the Onward paper as well. Unfortunately this may not be a quick thing to resolve; DSL design is hard.

One helpful technique would be to approach this incrementally, by supporting both the new DSL and Javascript adapters. Start with a tiny DSL that can handle the simplest cases, migrate some existing adapters over, and then encourage that for new adapters. Some of our current adapters may need to stay in JS for now, and there may be new JS adapters still in the future, but as long as most adapters are in the simple format, that will still get us many of the benefits outlined above.

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions