Skip to content

Adding Data to the ODN Backend

Lane Aasen edited this page Sep 15, 2016 · 11 revisions

After you have formatted your dataset and uploaded it to Socrata, you can add it to the ODN Backend.

Update ODN Entities

If your dataset contains entities that are not currently in the ODN, you must add the new entities.

Create a Source Declaration

The first step of adding a dataset to the ODN backend is creating a source declaration. Source declarations tell the ODN how to categorize, name, and locate each dataset. They also tell the ODN which variables are in the dataset.

All source declarations are stored in /data/sources.json.

Topics

Topics are at the first level of the source declaration tree. These are broad groupings of datasets like demographics, education, and crime. Each topic may optionally specify a human readable name attribute. If not provided, one will be generated automatically from the topic's ID.

Try to find a topic that will fit your dataset instead of creating a new one if possible. If you do create a new topic, ensure that it has the ID that you want, because you will never be able to change it!

Datasets

Each dataset represents a Socrata dataset. Datasets have the following fields:

  • name (optional): Human readable name of the dataset. Will be generated from the dataset's ID if not provided.
  • fxf (required): NBE ID of the dataset. The backend will not function with OBE datasets.
  • domain (optional): Defaults to odn.data.socrata.com.
  • constraints (required): List of constraints in the dataset. Each constraint is a column of the dataset used for slicing the data. For example, the jobs.occupations dataset has two constraints: occupation and year. When listing constraints, you can either specify just the column name (e.g. "year"), or include an ordering (e.g. ["year", "desc"]). Constraints are ordered in ascending order by default. If a constraint column is not listed here, it will be completely ignored by the backend.
  • sources (required): List of source of the data. A source can be a string referring to an attribution, or it can be a tuple in the form [sourceID, sourceURL] where sourceID is a reference to an attribution and sourceURL is a link to the specific page where the data was downloaded.
  • searchTerms (required): List of terms to use when searching for datasets related to this one. These search terms will be OR'd together and passed along to Cetera.
  • variables (required): List of variables in the dataset. See the next section for details. Each dataset must also contain a list of variables.

Ensure that your dataset has the proper ID (lowercase, no spaces, readable), because you will never be able to change it!

Variables

Each variable corresponds to a value of the variable column in the source dataset. You can use a SOQL $group query to get all of the variables in a dataset. For example, to get all of the variables in the ODN Population dataset.

The ordering of variables in a dataset will be respected by the ODN backend.

Each variable has the following fields:

  • name (optional): If not provided, the ODN Backend will infer a name for the variable.
  • description (optional): Descriptive name of the variable.
  • type (optional): Type of formatter to use. See the table below for options.
  • stoplight (optional): This should be set to true if the variable is clearly good or bad. For example, a higher high school graduation rate is clearly good, while a higher population is not. If stoplight=true, higher values will show up as green and lower values will show up as red. To switch the scale, use the reverse option.
  • reverse (optional): Flips the map scale when used in conjunction with stoplight=true.

Formatters

This table shows how 1.23 is formatted by different formatters:

Format Value
number 1.23
dollar $1.23
percent 123%
ratio 1.23%
precision1 1.2
precision2 1.23

For a full list of formatters, see the source code at /app/data/values/format.js. This includes some deprecated formatters for special cases.

Examples

This is the source declaration for the demographics.population dataset:

{
    "demographics": {
        "datasets": {
            "population": {
                "fxf": "9jg8-ki9x",
                "constraints": [["year", "desc"]],
                "variables": {
                    "count": {"name": "Population Count"},
                    "change": {
                        "name": "Population Rate of Change",
                        "description": "Percent change from the previous year",
                        "type": "percent",
                        "stoplight": true
                    }
                },
                "searchTerms": ["population", "household", "demographics", "ethnicity", "minority"],
                "sources": ["acs"]
            }
        }
    }
}

The ODN backend uses the variables dataset to find out what data is available for a given set of entities. It must be updated whenever a new dataset is added, or when the ID of an existing dataset it changed.

After adding the source declaration, use the use the variables.js script to extract all of the variables from the dataset.

Make sure you run the script from the base of the project.

% node data/process/variables.js
Usage: variables.js {datasetID} {outputFile}

For example, if we want to get the variables for the demographics.population dataset dataset and output them to population-variables.csv:

% node data/process/variables demographics.population population-variables.csv
found dataset: odn.data.socrata.com:9jg8-ki9x
loading variables: count, change
processed 10000 rows...
processed 20000 rows...
processed 30000 rows...
processed 40000 rows...
processed 50000 rows...
processed 60000 rows...
processed 70000 rows...
processed 80000 rows...
processed 90000 rows...
processed 100000 rows...
processed 110000 rows...
processed 120000 rows...
processed 130000 rows...
done

Regenerating ODN Variables

To regenerate the entire variables dataset, use the all-variable.js script. This will download variables for all datasets simultaneously and dump them to a directory, which make take a while. Be patient if the script seems to have stalled. It is probably just waiting for the Socrata backend to process a request.

% node data/process/all-variables
Usage: all-variables.js {directory}

Note that this script has a pretty CLI that will repeatedly clear the terminal as it refreshes.

Renaming a Dataset

Renaming a dataset (changing its ID) is cumbersome because it requires partially regenerating the variables dataset. Try to get the ID right the first time, and if you really need to change it, either regenerate the variables dataset or delete the old rows and append new ones. Each row in the variables dataset has a row_id which is the row's entity ID and variable ID joined with a -. This can be used to update only the rows that have changed using DataSync.

Note that a dataset ID cannot be changed once it has been deployed to production since this would cause old links and code to fail.

Out of Memory

If you get some out of memory error, try passing --max-old-space-size=8192 to node

Submitting Changes

Testing Changes Locally

Please verify your changes locally before submitting a pull request. To do this, first start the ODN backend locally using the instructions here.

Most importantly, start the server using npm run server and then run the unit tests in another terminal with npm test. This will ensure that your changes didn't completely break the backend, but it will not ensure that your data is available.

To check that your data is available, find and entity from your dataset and get its ID. Then, call data availability for the entity using the following URL:

http://localhost:3001/data/v1/availability?app_token=CqcTvF7wVsI8IYAq7CdZszLbU&entity_id={entityID}

Just switch out entityID for your entity, send the request, and ensure that your new dataset appears in the response. Tools like JSONView for Chrome can really help when deciphering minified backend responses.

Submitting Changes on GitHub

Create a new branch with your changes and push it to GitHub. Create a pull request to merge your new branch into staging and notify a member of the ODN team to review your changes.

Now, you're ready to move on to the next step: Adding Data to the ODN Frontend