Skip to content
This repository has been archived by the owner on Apr 27, 2023. It is now read-only.

tim-patterson/jsonsql

Repository files navigation

JsonSQL

License: MIT

Build Status

A standalone app that allows querying of newline delimited json using simple sql expressions

Getting Started

The easiest way of getting started if you're on OSX is to just brew install the code.

brew install tim-patterson/tap/jsonsql

This will install the latest version that has been compiled natively so as to not need java installed.

Running from source

To build and run from source instead simply run the following

git clone git@github.com:tim-patterson/jsonsql.git
cd jsonsql
./run

Basics

The basic syntax of JsonSQL is very similar to basic sql with a couple of major points of difference

  • Instead of tables we select from files or directories, ie json '/some/location'
  • JsonSQL is dynamically/weakly typed, the actual structure of the data being queried isn't known at query planning/compilation time

Describing json structure

describe json 'test_data/nested.json';

describe output

Often we're working with tools such as hive, presto, athena, spark etc where something shaped more like a create table statement is more useful to us, by appending as table on to the describe statement the output is rendered more like a create table statement

describe json 'test_data/nested.json' as table;

describe table_output

Basic selects

select rownum, arrayval, structval from json 'test_data/nested.json';

select output

One thing to note here is that because traditionally the * in select * is an expansion that it done at query planning time based on the known structure of a table, it simply wont work in JsonSQL, in some cases it might be handy to use the __all__ virtual column that returns the table whole row as a single column, ie

select __all__ from json 'test_data/nested.json';

select output

Dealing with nested values - structs/maps

structs are pretty simple to deal with, just use dot notation to drill down into nested structures

select rownum, structval.inner_key from json 'test_data/nested.json';

select output

As an alternative an indexing style syntax may be used, this can be useful if the key is to be computed

select rownum, structval["inner_key"] from json 'test_data/nested.json';

select output

Note internally both syntaxes actually gets converted to use the idx function

ie. select rownum, idx(structval, 'inner_key') from json 'test_data/nested.json'; will return the same output

Dealing with nested values - arrays

For arrays we use a similar lateral view type syntax as is used in hive.

With this syntax we "explode" out the array type to produce a row for each entry

select rownum, arrayval from json 'test_data/nested.json'
lateral view arrayval limit 5;

select output

Note in the above the exploded arrayval is shadowing the underlying table's arrayval. the lateral view can be aliased if needed, ie lateral view arrayval as exploded

If we instead just want to return a single element out of an array we can use the same index style syntax as used by structs

select rownum, arrayval[0] from json 'test_data/nested.json';

select output

Querying from AWS S3

By using an S3 url in our table JsonSQL can query data stored in s3. It uses the aws java sdk to do this. Standard aws environment vars like AWS_PROFILE, AWS_REGION etc will be picked up and used. select output

Querying from http sources

Just as we can query from s3 urls we can also query from http(s) urls.

Try this one to see how it works

select
  d.title,
  d.score,
  d.permalink,
  d.num_comments
from (
  select children.data as d
  from json 'https://www.reddit.com/r/all.json?limit=100'
  lateral view data.children
)
order by num_comments desc;

Project Goals

The main goals for this project are to be a lightweight standalone simple to use tool for adhoc querying of unstructured data using a syntax that's as close as possible to standard sql, while remaining simple and easily hackable.

Alternatives

The major alternative that provides querying of json data using sql without having to create table definitions etc is Apache Drill