In this lesson, you'll continue investigating new formats for data. Specifically, you'll investigate one of the most popular data formats for the web: JSON files.
You will be able to:
- Describe features of the JSON format and the Python
json
module - Use Python to load and parse JSON documents
JSON stands for JavaScript Object Notation. Similar to CSV, JSON is a plain text data format. However the structure of JSON — based on the syntax of JavaScript — is more complex.
Here's a brief preview of a JSON file:
As you can see, JSON is not a tabular format with one set of rows and one set of columns. JSON files are often nested in a hierarchical structure and will have data structures analogous to Python dictionaries and lists. Here's all of the built-in supported data types in JSON and their counterparts in Python:
In theory we could write our own custom code to split that string on {
, "
, :
etc. and parse the contents of the file into the appropriate Python data structures.
Instead, we'll go ahead and use a pre-built Python module designed for this purpose. It will give us a powerful starting point for accessing and manipulating the data in JSON files. This module is called json
.
You can find full documentation for this module here.
To use the json
module, start by importing it:
import json
To load data from a JSON file, you first open the file using Python's built-in open
function. Then you pass the file object to the json.load
function, which returns a Python object representing the contents of the file.
In the cell below, we open the campaign finance JSON file previewed above:
with open('nyc_2001_campaign_finance.json') as f:
data = json.load(f)
print(type(data))
<class 'dict'>
As you can see, this loaded the data as a dictionary. You can begin to investigate the contents of a JSON file by using our traditional Python methods.
Since we have a dictionary, check its keys:
data.keys()
dict_keys(['meta', 'data'])
Investigate what data types are stored within the values associated with those keys:
for v in data.values():
print(type(v))
<class 'dict'>
<class 'list'>
Then we can dig a level deeper. What are the keys of the nested dictionary?
data['meta'].keys()
dict_keys(['view'])
And what is the type of the value associated with that key?
type(data['meta']['view'])
dict
Again, what are the keys of that twice-nested dictionary?
data['meta']['view'].keys()
dict_keys(['id', 'name', 'attribution', 'averageRating', 'category', 'createdAt', 'description', 'displayType', 'downloadCount', 'hideFromCatalog', 'hideFromDataJson', 'indexUpdatedAt', 'newBackend', 'numberOfComments', 'oid', 'provenance', 'publicationAppendEnabled', 'publicationDate', 'publicationGroup', 'publicationStage', 'rowClass', 'rowsUpdatedAt', 'rowsUpdatedBy', 'tableId', 'totalTimesRated', 'viewCount', 'viewLastModified', 'viewType', 'columns', 'grants', 'metadata', 'owner', 'query', 'rights', 'tableAuthor', 'tags', 'flags'])
That is a lot of keys. One way we might try to view all of that information is using the pandas
package to make a table.
import pandas as pd
pd.set_option("max_colwidth", 120)
pd.DataFrame(
data=data['meta']['view'].values(),
index=data['meta']['view'].keys(),
columns=["value"]
)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
value | |
---|---|
id | 8dhd-zvi6 |
name | 2001 Campaign Payments |
attribution | Campaign Finance Board (CFB) |
averageRating | 0 |
category | City Government |
createdAt | 1315950830 |
description | A listing of public funds payments for candidates for City office during the 2001 election cycle |
displayType | table |
downloadCount | 1470 |
hideFromCatalog | False |
hideFromDataJson | False |
indexUpdatedAt | 1536596254 |
newBackend | False |
numberOfComments | 0 |
oid | 4140996 |
provenance | official |
publicationAppendEnabled | False |
publicationDate | 1371845179 |
publicationGroup | 240370 |
publicationStage | published |
rowClass | |
rowsUpdatedAt | 1371845177 |
rowsUpdatedBy | 5fuc-pqz2 |
tableId | 932968 |
totalTimesRated | 0 |
viewCount | 233 |
viewLastModified | 1536605717 |
viewType | tabular |
columns | [{'id': -1, 'name': 'sid', 'dataTypeName': 'meta_data', 'fieldName': ':sid', 'position': 0, 'renderTypeName': 'meta_... |
grants | [{'inherited': False, 'type': 'viewer', 'flags': ['public']}] |
metadata | {'rdfSubject': '0', 'rdfClass': '', 'attachments': [{'filename': 'Data_Dictionary_Public_Funds_Payments_FINAL.xlsx',... |
owner | {'id': '5fuc-pqz2', 'displayName': 'NYC OpenData', 'profileImageUrlLarge': '/api/users/5fuc-pqz2/profile_images/LARG... |
query | {} |
rights | [read] |
tableAuthor | {'id': '5fuc-pqz2', 'displayName': 'NYC OpenData', 'profileImageUrlLarge': '/api/users/5fuc-pqz2/profile_images/LARG... |
tags | [finance, campaign finance board, cfb, nyccfb, campaign finance, elections, contributions, politics, campaign, funding] |
flags | [default, restorable, restorePossibleForType] |
So, it looks like the information under the meta
key is essentially all of the metadata about the dataset, including the category, attribution, tags, etc.
Now let's look at the main data.
This time, let's look at the value associated with the data
key. Recall that we previously identified that this had a list
data type, so let's look at the length:
len(data['data'])
285
Now let's look at a couple different values:
data['data'][0]
[1,
'E3E9CC9F-7443-43F6-94AF-B5A0F802DBA1',
1,
1315925633,
'392904',
1315925633,
'392904',
'{\n "invalidCells" : {\n "1519001" : "TOTALPAY",\n "1518998" : "PRIMARYPAY",\n "1519000" : "RUNOFFPAY",\n "1518999" : "GENERALPAY",\n "1518994" : "OFFICECD",\n "1518996" : "OFFICEDIST",\n "1518991" : "ELECTION"\n }\n}',
None,
'CANDID',
'CANDNAME',
None,
'OFFICEBORO',
None,
'CANCLASS',
None,
None,
None,
None]
data['data'][1]
[2,
'9D257416-581A-4C42-85CC-B6EAD9DED97F',
2,
1315925633,
'392904',
1315925633,
'392904',
'{\n}',
'2001',
'B4',
'Aboulafia, Sandy',
'5',
None,
'44',
'P',
'45410.00',
'0',
'0',
'45410.00']
data['data'][2]
[3,
'B80D7891-93CF-49E8-86E8-182B618E68F2',
3,
1315925633,
'392904',
1315925633,
'392904',
'{\n}',
'2001',
'445',
'Adams, Jackie R',
'5',
None,
'7',
'P',
'11073.00',
'0',
'0',
'11073.00']
This looks more like some kind of tabular data, where the first (0
-th) row is some kind of header. Again, let's use pandas to make this into a more-readable table format:
pd.DataFrame(data['data'])
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | E3E9CC9F-7443-43F6-94AF-B5A0F802DBA1 | 1 | 1315925633 | 392904 | 1315925633 | 392904 | {\n "invalidCells" : {\n "1519001" : "TOTALPAY",\n "1518998" : "PRIMARYPAY",\n "1519000" : "RUNOFFPAY",\n ... | None | CANDID | CANDNAME | None | OFFICEBORO | None | CANCLASS | None | None | None | None |
1 | 2 | 9D257416-581A-4C42-85CC-B6EAD9DED97F | 2 | 1315925633 | 392904 | 1315925633 | 392904 | {\n} | 2001 | B4 | Aboulafia, Sandy | 5 | None | 44 | P | 45410.00 | 0 | 0 | 45410.00 |
2 | 3 | B80D7891-93CF-49E8-86E8-182B618E68F2 | 3 | 1315925633 | 392904 | 1315925633 | 392904 | {\n} | 2001 | 445 | Adams, Jackie R | 5 | None | 7 | P | 11073.00 | 0 | 0 | 11073.00 |
3 | 4 | BB012003-78F5-406D-8A87-7FF8A425EE3F | 4 | 1315925633 | 392904 | 1315925633 | 392904 | {\n} | 2001 | HF | Addabbo, Joseph P | 5 | None | 32 | P | 75350.00 | 73970.00 | 0 | 149320.00 |
4 | 5 | 945825F9-2F5D-47C2-A16B-75B93E61E1AD | 5 | 1315925633 | 392904 | 1315925633 | 392904 | {\n} | 2001 | IR | Alamo-Estrada, Agustin | 5 | None | 14 | P | 25000.00 | 2400.00 | 0 | 27400.00 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
280 | 281 | C50E6A4C-BDE9-4F12-97F4-95D467013540 | 281 | 1315925633 | 392904 | 1315925633 | 392904 | {\n} | 2001 | 537 | Wilson, John H | 5 | None | 13 | P | 0 | 0 | 0 | 0 |
281 | 282 | 04C6D19F-FF63-47B0-B26D-3B8F98B4C16B | 282 | 1315925633 | 392904 | 1315925633 | 392904 | {\n} | 2001 | 559 | Wooten, Donald T | 5 | None | 42 | P | 0 | 0 | 0 | 0 |
282 | 283 | A451E0E9-D382-4A97-AAD8-D7D382055F8D | 283 | 1315925633 | 392904 | 1315925633 | 392904 | {\n} | 2001 | 280 | Yassky, David | 5 | None | 33 | P | 75350.00 | 75350.00 | 0 | 150700.00 |
283 | 284 | E84BCD0C-D6F4-450F-B55B-3199A265C781 | 284 | 1315925633 | 392904 | 1315925633 | 392904 | {\n} | 2001 | 274 | Zapiti, Mike | 5 | None | 22 | P | 12172.00 | 0 | 0 | 12172.00 |
284 | 285 | 5BBC9676-2119-4FB5-9DAB-DE3F71B7681A | 285 | 1315925633 | 392904 | 1315925633 | 392904 | {\n} | 2001 | 442 | Zett, Lori M | 5 | None | 24 | P | 0 | 0 | 0 | 0 |
285 rows × 19 columns
We still have some work to do to understand what all of the columns are supposed to mean, but now we have a general sense of what the data looks like.
Now, let's say that our task is:
Extract the description of the dataset
We know from our initial exploration that this JSON file contains meta
and data
, and that meta
has this kind of high-level information whereas data
has the actual records relating to campaign finance.
Let's look at the keys of meta
again:
data['meta']['view'].keys()
dict_keys(['id', 'name', 'attribution', 'averageRating', 'category', 'createdAt', 'description', 'displayType', 'downloadCount', 'hideFromCatalog', 'hideFromDataJson', 'indexUpdatedAt', 'newBackend', 'numberOfComments', 'oid', 'provenance', 'publicationAppendEnabled', 'publicationDate', 'publicationGroup', 'publicationStage', 'rowClass', 'rowsUpdatedAt', 'rowsUpdatedBy', 'tableId', 'totalTimesRated', 'viewCount', 'viewLastModified', 'viewType', 'columns', 'grants', 'metadata', 'owner', 'query', 'rights', 'tableAuthor', 'tags', 'flags'])
Ok, description
is the 7th one. Let's pull the value associated with the description
key:
data['meta']['view']['description']
'A listing of public funds payments for candidates for City office during the 2001 election cycle'
This is the general process you will use when extracting information from a JSON file.
As you can see, there's a lot going on here with the deeply nested structure of JSON data files.