A standalone app that allows querying of newline delimited json using simple sql expressions
The easiest way of getting started if you're on OSX is to just brew install the code.
brew install tim-patterson/tap/jsonsql
This will install the latest version that has been compiled natively so as to not need java installed.
To build and run from source instead simply run the following
git clone git@github.com:tim-patterson/jsonsql.git
cd jsonsql
./run
The basic syntax of JsonSQL is very similar to basic sql with a couple of major points of difference
- Instead of tables we select from files or directories, ie
json '/some/location'
- JsonSQL is dynamically/weakly typed, the actual structure of the data being queried isn't known at query planning/compilation time
describe json 'test_data/nested.json';
Often we're working with tools such as hive, presto, athena, spark etc where something shaped more like a create table statement is more useful to us,
by appending as table
on to the describe statement the output is rendered more like a create table statement
describe json 'test_data/nested.json' as table;
select rownum, arrayval, structval from json 'test_data/nested.json';
One thing to note here is that because traditionally the *
in select *
is an expansion that it done at query
planning time based on the known structure of a table, it simply wont work in JsonSQL, in
some cases it might be handy to use the __all__
virtual column that returns the table whole row as a single column, ie
select __all__ from json 'test_data/nested.json';
structs are pretty simple to deal with, just use dot notation to drill down into nested structures
select rownum, structval.inner_key from json 'test_data/nested.json';
As an alternative an indexing style syntax may be used, this can be useful if the key is to be computed
select rownum, structval["inner_key"] from json 'test_data/nested.json';
Note internally both syntaxes actually gets converted to use the
idx
function
ie.
select rownum, idx(structval, 'inner_key') from json 'test_data/nested.json';
will return the same output
For arrays we use a similar lateral view
type syntax as is used in hive.
With this syntax we "explode" out the array type to produce a row for each entry
select rownum, arrayval from json 'test_data/nested.json'
lateral view arrayval limit 5;
Note in the above the exploded
arrayval
is shadowing the underlying table'sarrayval
. the lateral view can be aliased if needed, ielateral view arrayval as exploded
If we instead just want to return a single element out of an array we can use the same index style syntax as used by structs
select rownum, arrayval[0] from json 'test_data/nested.json';
By using an S3 url in our table JsonSQL can query data stored in s3. It uses the aws java sdk to do this.
Standard aws environment vars like AWS_PROFILE
, AWS_REGION
etc will be picked up and used.
Just as we can query from s3 urls we can also query from http(s) urls.
Try this one to see how it works
select
d.title,
d.score,
d.permalink,
d.num_comments
from (
select children.data as d
from json 'https://www.reddit.com/r/all.json?limit=100'
lateral view data.children
)
order by num_comments desc;
The main goals for this project are to be a lightweight standalone simple to use tool for adhoc querying of unstructured data using a syntax that's as close as possible to standard sql, while remaining simple and easily hackable.
The major alternative that provides querying of json data using sql without having to create table definitions etc is Apache Drill