discussion: auto-detect/unify schema when batch querying a set of json files #18189

lmatz · 2024-08-22T08:47:29Z

https://www.reddit.com/r/dataengineering/comments/16wysjd/what_is_the_best_way_to_query_json_and_parquet/:

We have a bunch of files on S3 (minio) primarly in JSON and parquet format. We are searching for a simple solution to do some ad hoc sql queries against those files to verify data and see the structure. mainly for the engineers to build the data pipelines in dbt/dagster. we don't want a complex solution which involves many components and there's no need to scale computing beyond a single node.

our preferred way would be something which we can point to a s3 bucket and it shows all the files as e.g. views in the database. it shouldn't require to manually specify the schema/structure of the files. just lookup the structure on read. the files are usually only around 1-20MB. so really small scale.

Before building an MV in RW, we typically want to "explore" the data a bit to see what columns there are. If the data is not normalized, probably need to build an ETL pipeline first and then build MV on top of it.

Direct batch querying without specifying a schema can speed up "exploration" as there could be many upstream sources.

Right now, we need to declare the schema first: #18174 (comment)

https://duckdb.org/docs/data/json/overview.html:

An example of two json files:

martin@Martins-MacBook-Pro Code % cat test_json.json 
[
    {
        "ID": "abc",
        "Name": "Alice",
        "Age": 25,
        "useless": "nothing right"
    },
    {
        "ID": 2,
        "Name": "Bob",
        "Salary": 60000
    },
    {
        "ID": 3,
        "Age": 35,
        "Salary": 70000
    }
]
martin@Martins-MacBook-Pro Code % cat test_json2.json
[
    {
        "ID": "abc",
        "Name": "Alice",
        "Age": 25,
        "Region": "Earth"
    },
    {
        "ID": 2,
        "Name": "Bob",
        "Salary": 60000,
        "Gender": "NA"
    },
    {
        "ID": 3,
        "Name": "Charlie",
        "Age": 35,
        "Salary": 70000
    }
]

martin@Martins-MacBook-Pro Code % duckdb
v1.0.0 1f98600c2c
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
D select * from read_json("*.json")
  ;
┌───────┬─────────┬───────┬───────────────┬────────┬─────────┬─────────┐
│  ID   │  Name   │  Age  │    useless    │ Salary │ Region  │ Gender  │
│ json  │ varchar │ int64 │    varchar    │ int64  │ varchar │ varchar │
├───────┼─────────┼───────┼───────────────┼────────┼─────────┼─────────┤
│ "abc" │ Alice   │    25 │ nothing right │        │         │         │
│ 2     │ Bob     │       │               │  60000 │         │         │
│ 3     │         │    35 │               │  70000 │         │         │
│ "abc" │ Alice   │    25 │               │        │ Earth   │         │
│ 2     │ Bob     │       │               │  60000 │         │ NA      │
│ 3     │ Charlie │    35 │               │  70000 │         │         │
└───────┴─────────┴───────┴───────────────┴────────┴─────────┴─────────┘
D

Besides directly querying, we can ingest the data into a table:

CREATE TABLE todos AS
      SELECT *
      FROM read_json('*.json');
D select * from todos;
┌───────┬─────────┬───────┬───────────────┬────────┬─────────┬─────────┐
│  ID   │  Name   │  Age  │    useless    │ Salary │ Region  │ Gender  │
│ json  │ varchar │ int64 │    varchar    │ int64  │ varchar │ varchar │
├───────┼─────────┼───────┼───────────────┼────────┼─────────┼─────────┤
│ "abc" │ Alice   │    25 │ nothing right │        │         │         │
│ 2     │ Bob     │       │               │  60000 │         │         │
│ 3     │         │    35 │               │  70000 │         │         │
│ "abc" │ Alice   │    25 │               │        │ Earth   │         │
│ 2     │ Bob     │       │               │  60000 │         │ NA      │
│ 3     │ Charlie │    35 │               │  70000 │         │         │
└───────┴─────────┴───────┴───────────────┴────────┴─────────┴─────────┘

We can create an empty table out of it:

D CREATE TABLE todos AS
      SELECT *
      FROM read_json('*.json') limit 0;
D select * from todos;
┌──────┬─────────┬───────┬─────────┬────────┬─────────┬─────────┐
│  ID  │  Name   │  Age  │ useless │ Salary │ Region  │ Gender  │
│ json │ varchar │ int64 │ varchar │ int64  │ varchar │ varchar │
├──────┴─────────┴───────┴─────────┴────────┴─────────┴─────────┤
│                            0 rows                             │
└───────────────────────────────────────────────────────────────┘
D describe todos;
┌─────────────┬─────────────┬─────────┬─────────┬─────────┬─────────┐
│ column_name │ column_type │  null   │   key   │ default │  extra  │
│   varchar   │   varchar   │ varchar │ varchar │ varchar │ varchar │
├─────────────┼─────────────┼─────────┼─────────┼─────────┼─────────┤
│ ID          │ JSON        │ YES     │         │         │         │
│ Name        │ VARCHAR     │ YES     │         │         │         │
│ Age         │ BIGINT      │ YES     │         │         │         │
│ useless     │ VARCHAR     │ YES     │         │         │         │
│ Salary      │ BIGINT      │ YES     │         │         │         │
│ Region      │ VARCHAR     │ YES     │         │         │         │
│ Gender      │ VARCHAR     │ YES     │         │         │         │
└─────────────┴─────────────┴─────────┴─────────┴─────────┴─────────┘
D

After doing some data exploration, we can proceed to select out the columns we truly want by filtering out useless/bad data:

 CREATE TABLE real_table as
  SELECT * exclude (useless) from todos;
D select * from real_table;
┌───────┬─────────┬───────┬────────┬─────────┬─────────┐
│  ID   │  Name   │  Age  │ Salary │ Region  │ Gender  │
│ json  │ varchar │ int64 │ int64  │ varchar │ varchar │
├───────┼─────────┼───────┼────────┼─────────┼─────────┤
│ "abc" │ Alice   │    25 │        │         │         │
│ 2     │ Bob     │       │  60000 │         │         │
│ 3     │         │    35 │  70000 │         │         │
│ "abc" │ Alice   │    25 │        │ Earth   │         │
│ 2     │ Bob     │       │  60000 │         │ NA      │
│ 3     │ Charlie │    35 │  70000 │         │         │
└───────┴─────────┴───────┴────────┴─────────┴─────────┘

We remark that "ID" in both files have different data types, therefore they are unified into json

The text was updated successfully, but these errors were encountered:

github-actions bot added this to the release-2.1 milestone Aug 22, 2024

lmatz added the needs-discussion label Aug 22, 2024

lmatz removed this from the release-2.1 milestone Oct 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

discussion: auto-detect/unify schema when batch querying a set of json files #18189

discussion: auto-detect/unify schema when batch querying a set of json files #18189

lmatz commented Aug 22, 2024 •

edited

Loading

discussion: auto-detect/unify schema when batch querying a set of json files #18189

discussion: auto-detect/unify schema when batch querying a set of json files #18189

Comments

lmatz commented Aug 22, 2024 • edited Loading

lmatz commented Aug 22, 2024 •

edited

Loading