Skip to content

Tasks Reference

Sean McLellan edited this page May 1, 2017 · 11 revisions

Every action in Skrapr is performed by a Task in a Skrapr definition.

All tasks have the following properties:

  • name indicates the name, or type, of Task (the values of this field are described below)
  • description Provides a field to enter the description of what the task actually does.
  • disabled A boolean that indicates if the task should be skipped

AddAnchorsAsTargets

This task can be used to add a number of anchors to pages that will be navigated to by the main task flow.

Properties:

  • selector - The CSS Selector to the anchor tags whose href="..." attribute value will be added to the main task flow. Non-anchors will be ignored.

Example:

{
  "name": "AddAnchorsAsTargets",
  "selector": "my-class a"
}

InjectScriptElement

This task injects the specified script onto the current page. Useful to ensure that a library is present on the page to provide behavior to other tasks.

Properties:

  • condition - An optional javascript expression that is evaluated to determine if the script should be injected.
  • async - Sets the value of the 'async' attribute of the injected script tag
  • ```contents`` - Sets the contents of the script tag, the actual script if src is not defined.
  • type Sets the value of the 'type' attribute of the injected script tag. Defaults to 'text/javascript'
  • scriptUrl Sets the value of the 'src' attribute of the injected script tag. Note: Browsers will disregard the script contents if the src is specified.

Example:

Injects URI.js onto the page if it is not already present.

{
  "name": "InjectScriptElement",
  "condition": "window['URI'] === undefined;",
  "scriptUrl": "https://cdnjs.cloudflare.com/ajax/libs/URI.js/1.18.10/URI.min.js",
}

InjectStyleElement

This task injects the specified stylesheet onto the current page. Useful to highlight elements for debugging purposes, or perhaps to illustrate certain content prior to a PrintToPDF or Screenshot task.

Properties:

  • condition - An optional javascript expression that is evaluated to determine if the style should be injected.
  • styles - The contents of the style tag that is injected. Put your CSS Styles here.

Example:

{
  "name": "InjectStyleElement",
  "description": "Highlight the shirts just because we can",
  "styles": ".thumb-grid { background-color: rgba(0,0,255, 0.5); border: 2px solid red;}",
}

Scrape

This task is designed to facilitate extraction of data from the current page.

Properties

  • gather - Contains a JSON object. Each property contains a javascript expression that will be evaluated in the page context. Once the data is collected the object will be saved to the datastore.

Example.

{
  "name": "Scrape",
  "gather": {
    "_id": "(function() { var fn = URI().filename(); return fn.endsWith('.html') ? URI().segment().slice(-2)[0] : fn; })();",
    "url": "window.location.toString()",
    "sku": "$('div.product-title .sku span').text()",
    "title": "$('div.product-title span:eq(0)').text()",
    "description": "$('div.short_description').text().trim()",
    "price": "$('div.js-prices-box span#store_price').text()",
    "details": "$('div.details ul li').map(function() { return $(this).text(); }).get();",
    "colors": "$('li.color-swatch label').map(function() { return { title: $(this).attr('title'), swatchUrl: $(this).find('img').attr('src'), imageUrl: $('img.zoomImg').attr('src') }; }).get()",
    "sizes": "$('li.size-swatch label').map(function() { return { size: $(this).attr('title'), inStock: !$(this).parent().hasClass('size-disabled') }; }).get()"
  }
}

TemplatedSubFlow

This task contains one or more handlebars-based task templates that are populated with data from attributes defined on elements matching the selector.

This is one of the more complex tasks, but one of the most powerful -- once you get comfortable with it, you'll be reaching for this task often.

Say you have a page that has a number of products on it. For each product anchor tag, you'd like to navigate to the product's url and scrape content from it. With the TemplatedSubFlow you can do this easily:

{
   "name": "TemplatedSubFlow",
   "selector": ".products a",
   "taskTemplates": [
     {
        "name": "Navigate",
        "url": "{{href}}"
     },
     {
        "name": "Scrape",
        "script": "..."
     }
   ]

The {{href}} token will be replaced by the href="..." attribute on the .products anchor. All attributes can be used.

Failed Assertions and Navigation errors will cause the subflow to be added to the bottom of the queue.

Properties:

  • shuffle - A boolean which indicates if the created flows will be shuffled prior to executing. Shuffling makes it so that sub-flows aren't executed sequentially.

  • taskTemplates - An array of tasks, any available task can be used. Double curly brace enclosed tokens will be replaced with corresponding attribute values.

    The following built-in tokens are also available:

    • $index - The 0-based index of the selected element
    • $oneBasedIndex - The 1-based index of the selected element (great for nth-of-type or nth-child selectors)
    • $title - The title of the current page
    • $url - The url of the current page

Example:

{
  "name": "TemplatedSubFlow",
  "selector": ".thumb-grid a.name",
  "shuffle": true,
  "taskTemplates": [
    {
      "name": "Navigate",
      "url": "{{href}}",
      "referrer": "{{$url}}"
    },
    {
      "name": "Assert",
      "assertion": "document.location.toString() === '{{href}}'",
      "message": "Expected current url to be {{href}}"
    }
  ]
}

See also: TemplatedScriptedSubFlowTask