-
Notifications
You must be signed in to change notification settings - Fork 12
Adding Data to the ODN Backend
After you have formatted your dataset and uploaded it to Socrata, you can add it to the ODN Backend.
If your dataset contains entities that are not currently in the ODN, you must add the new entities.
The first step of adding a dataset to the ODN backend is creating a source declaration. Source declarations tell the ODN how to categorize, name, and locate each dataset. They also tell the ODN which variables are in the dataset.
All source declarations are stored in /data/sources.json
.
Topics are at the first level of the source declaration tree.
These are broad groupings of datasets like demographics
, education
, and crime
.
Each topic may optionally specify a human readable name
attribute.
If not provided, one will be generated automatically from the topic's ID.
Try to find a topic that will fit your dataset instead of creating a new one if possible. If you do create a new topic, ensure that it has the ID that you want, because you will never be able to change it!
Each dataset represents a Socrata dataset. Datasets have the following fields:
-
name
(optional): Human readable name of the dataset. Will be generated from the dataset's ID if not provided. -
fxf
(required): NBE ID of the dataset. The backend will not function with OBE datasets. -
domain
(optional): Defaults toodn.data.socrata.com
. -
constraints
(required): List of constraints in the dataset. Each constraint is a column of the dataset used for slicing the data. For example, thejobs.occupations
dataset has two constraints:occupation
andyear
. When listing constraints, you can either specify just the column name (e.g."year"
), or include an ordering (e.g.["year", "desc"]
). Constraints are ordered in ascending order by default. If a constraint column is not listed here, it will be completely ignored by the backend. -
sources
(required): List of source of the data. A source can be a string referring to an attribution, or it can be a tuple in the form[sourceID, sourceURL]
wheresourceID
is a reference to an attribution andsourceURL
is a link to the specific page where the data was downloaded. -
searchTerms
(required): List of terms to use when searching for datasets related to this one. These search terms will be OR'd together and passed along to Cetera. -
variables
(required): List of variables in the dataset. See the next section for details. Each dataset must also contain a list of variables.
Ensure that your dataset has the proper ID (lowercase, no spaces, readable), because you will never be able to change it!
Each variable corresponds to a value of the variable
column in the source dataset.
You can use a SOQL $group
query to get all of the variables in a dataset.
For example, to get all of the variables in the ODN Population dataset.
The ordering of variables in a dataset will be respected by the ODN backend.
Each variable has the following fields:
-
name
(optional): If not provided, the ODN Backend will infer a name for the variable. -
description
(optional): Descriptive name of the variable. -
type
(optional): Type of formatter to use. See the table below for options. -
stoplight
(optional): This should be set to true if the variable is clearly good or bad. For example, a higher high school graduation rate is clearly good, while a higher population is not. Ifstoplight=true
, higher values will show up as green and lower values will show up as red. To switch the scale, use thereverse
option. -
reverse
(optional): Flips the map scale when used in conjunction withstoplight=true
.
This table shows how 1.23 is formatted by different formatters:
Format | Value |
---|---|
number | 1.23 |
dollar | $1.23 |
percent | 123% |
ratio | 1.23% |
precision1 | 1.2 |
precision2 | 1.23 |
For a full list of formatters, see the source code at /app/data/values/format.js
.
This includes some deprecated formatters for special cases.
This is the source declaration for the demographics.population
dataset:
{
"demographics": {
"datasets": {
"population": {
"fxf": "9jg8-ki9x",
"constraints": [["year", "desc"]],
"variables": {
"count": {"name": "Population Count"},
"change": {
"name": "Population Rate of Change",
"description": "Percent change from the previous year",
"type": "percent",
"stoplight": true
}
},
"searchTerms": ["population", "household", "demographics", "ethnicity", "minority"],
"sources": ["acs"]
}
}
}
}
Update ODN Variables
The ODN backend uses the variables dataset to find out what data is available for a given set of entities. It must be updated whenever a new dataset is added, or when the ID of an existing dataset it changed.
After adding the source declaration, use the use the
variables.js
script to extract all of the variables from the dataset.
Make sure you run the script from the base of the project.
% node data/process/variables.js
Usage: variables.js {datasetID} {outputFile}
For example, if we want to get the variables for the demographics.population
dataset dataset and output them to population-variables.csv
:
% node data/process/variables demographics.population population-variables.csv
found dataset: odn.data.socrata.com:9jg8-ki9x
loading variables: count, change
processed 10000 rows...
processed 20000 rows...
processed 30000 rows...
processed 40000 rows...
processed 50000 rows...
processed 60000 rows...
processed 70000 rows...
processed 80000 rows...
processed 90000 rows...
processed 100000 rows...
processed 110000 rows...
processed 120000 rows...
processed 130000 rows...
done
Regenerating ODN Variables
To regenerate the entire variables dataset, use the all-variable.js
script. This will download
variables for all datasets simultaneously and dump them to a directory,
which make take a while. Be patient if the script seems to have stalled.
It is probably just waiting for the Socrata backend to process a request.
% node data/process/all-variables
Usage: all-variables.js {directory}
Note that this script has a pretty CLI that will repeatedly clear the terminal as it refreshes.
Renaming a dataset (changing its ID) is cumbersome because it requires partially
regenerating the variables dataset. Try to get the ID right the first time,
and if you really need to change it, either regenerate the variables dataset
or delete the old rows and append new ones. Each row in the variables dataset
has a row_id
which is the row's entity ID and variable ID joined with a -
.
This can be used to update only the rows that have changed using DataSync.
Note that a dataset ID cannot be changed once it has been deployed to production since this would cause old links and code to fail.
If you get some out of memory error, try passing --max-old-space-size=8192
to node
Please verify your changes locally before submitting a pull request. To do this, first start the ODN backend locally using the instructions here.
Most importantly, start the server using npm run server
and then run the unit
tests in another terminal with npm test
. This will ensure that your changes didn't
completely break the backend, but it will not ensure that your data is available.
To check that your data is available, find and entity from your dataset and get its ID. Then, call data availability for the entity using the following URL:
http://localhost:3001/data/v1/availability?app_token=CqcTvF7wVsI8IYAq7CdZszLbU&entity_id={entityID}
Just switch out entityID
for your entity, send the request,
and ensure that your new dataset appears in the response.
Tools like JSONView for Chrome
can really help when deciphering minified backend responses.
Create a new branch with your changes and push it to GitHub.
Create a pull request to merge your new branch into staging
and notify a member of the ODN team to review your changes.
Now, you're ready to move on to the next step: Adding Data to the ODN Frontend