Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

discussion: auto-detect/unify schema when batch querying a set of json files #18189

Open
lmatz opened this issue Aug 22, 2024 · 0 comments
Open

Comments

@lmatz
Copy link
Contributor

lmatz commented Aug 22, 2024

https://www.reddit.com/r/dataengineering/comments/16wysjd/what_is_the_best_way_to_query_json_and_parquet/:

We have a bunch of files on S3 (minio) primarly in JSON and parquet format. We are searching for a simple solution to do some ad hoc sql queries against those files to verify data and see the structure. mainly for the engineers to build the data pipelines in dbt/dagster. we don't want a complex solution which involves many components and there's no need to scale computing beyond a single node.

our preferred way would be something which we can point to a s3 bucket and it shows all the files as e.g. views in the database. it shouldn't require to manually specify the schema/structure of the files. just lookup the structure on read. the files are usually only around 1-20MB. so really small scale.

Before building an MV in RW, we typically want to "explore" the data a bit to see what columns there are. If the data is not normalized, probably need to build an ETL pipeline first and then build MV on top of it.

Direct batch querying without specifying a schema can speed up "exploration" as there could be many upstream sources.

Right now, we need to declare the schema first: #18174 (comment)

https://duckdb.org/docs/data/json/overview.html:

An example of two json files:

martin@Martins-MacBook-Pro Code % cat test_json.json 
[
    {
        "ID": "abc",
        "Name": "Alice",
        "Age": 25,
        "useless": "nothing right"
    },
    {
        "ID": 2,
        "Name": "Bob",
        "Salary": 60000
    },
    {
        "ID": 3,
        "Age": 35,
        "Salary": 70000
    }
]
martin@Martins-MacBook-Pro Code % cat test_json2.json
[
    {
        "ID": "abc",
        "Name": "Alice",
        "Age": 25,
        "Region": "Earth"
    },
    {
        "ID": 2,
        "Name": "Bob",
        "Salary": 60000,
        "Gender": "NA"
    },
    {
        "ID": 3,
        "Name": "Charlie",
        "Age": 35,
        "Salary": 70000
    }
]
martin@Martins-MacBook-Pro Code % duckdb
v1.0.0 1f98600c2c
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
D select * from read_json("*.json")
  ;
┌───────┬─────────┬───────┬───────────────┬────────┬─────────┬─────────┐
│  ID   │  Name   │  Age  │    useless    │ Salary │ Region  │ Gender  │
│ json  │ varchar │ int64 │    varchar    │ int64  │ varchar │ varchar │
├───────┼─────────┼───────┼───────────────┼────────┼─────────┼─────────┤
│ "abc" │ Alice   │    25 │ nothing right │        │         │         │
│ 2     │ Bob     │       │               │  60000 │         │         │
│ 3     │         │    35 │               │  70000 │         │         │
│ "abc" │ Alice   │    25 │               │        │ Earth   │         │
│ 2     │ Bob     │       │               │  60000 │         │ NA      │
│ 3     │ Charlie │    35 │               │  70000 │         │         │
└───────┴─────────┴───────┴───────────────┴────────┴─────────┴─────────┘
D 

Besides directly querying, we can ingest the data into a table:

CREATE TABLE todos AS
      SELECT *
      FROM read_json('*.json');
D select * from todos;
┌───────┬─────────┬───────┬───────────────┬────────┬─────────┬─────────┐
│  ID   │  Name   │  Age  │    useless    │ Salary │ Region  │ Gender  │
│ json  │ varchar │ int64 │    varchar    │ int64  │ varchar │ varchar │
├───────┼─────────┼───────┼───────────────┼────────┼─────────┼─────────┤
│ "abc" │ Alice   │    25 │ nothing right │        │         │         │
│ 2     │ Bob     │       │               │  60000 │         │         │
│ 3     │         │    35 │               │  70000 │         │         │
│ "abc" │ Alice   │    25 │               │        │ Earth   │         │
│ 2     │ Bob     │       │               │  60000 │         │ NA      │
│ 3     │ Charlie │    35 │               │  70000 │         │         │
└───────┴─────────┴───────┴───────────────┴────────┴─────────┴─────────┘

We can create an empty table out of it:

D CREATE TABLE todos AS
      SELECT *
      FROM read_json('*.json') limit 0;
D select * from todos;
┌──────┬─────────┬───────┬─────────┬────────┬─────────┬─────────┐
│  ID  │  Name   │  Age  │ useless │ Salary │ Region  │ Gender  │
│ json │ varchar │ int64 │ varchar │ int64  │ varchar │ varchar │
├──────┴─────────┴───────┴─────────┴────────┴─────────┴─────────┤
│                            0 rows                             │
└───────────────────────────────────────────────────────────────┘
D describe todos;
┌─────────────┬─────────────┬─────────┬─────────┬─────────┬─────────┐
│ column_name │ column_type │  null   │   key   │ default │  extra  │
│   varchar   │   varchar   │ varchar │ varchar │ varchar │ varchar │
├─────────────┼─────────────┼─────────┼─────────┼─────────┼─────────┤
│ ID          │ JSON        │ YES     │         │         │         │
│ Name        │ VARCHAR     │ YES     │         │         │         │
│ Age         │ BIGINT      │ YES     │         │         │         │
│ useless     │ VARCHAR     │ YES     │         │         │         │
│ Salary      │ BIGINT      │ YES     │         │         │         │
│ Region      │ VARCHAR     │ YES     │         │         │         │
│ Gender      │ VARCHAR     │ YES     │         │         │         │
└─────────────┴─────────────┴─────────┴─────────┴─────────┴─────────┘
D 

After doing some data exploration, we can proceed to select out the columns we truly want by filtering out useless/bad data:

 CREATE TABLE real_table as
  SELECT * exclude (useless) from todos;
D select * from real_table;
┌───────┬─────────┬───────┬────────┬─────────┬─────────┐
│  ID   │  Name   │  Age  │ Salary │ Region  │ Gender  │
│ json  │ varchar │ int64 │ int64  │ varchar │ varchar │
├───────┼─────────┼───────┼────────┼─────────┼─────────┤
│ "abc" │ Alice   │    25 │        │         │         │
│ 2     │ Bob     │       │  60000 │         │         │
│ 3     │         │    35 │  70000 │         │         │
│ "abc" │ Alice   │    25 │        │ Earth   │         │
│ 2     │ Bob     │       │  60000 │         │ NA      │
│ 3     │ Charlie │    35 │  70000 │         │         │
└───────┴─────────┴───────┴────────┴─────────┴─────────┘

We remark that "ID" in both files have different data types, therefore they are unified into json

@github-actions github-actions bot added this to the release-2.1 milestone Aug 22, 2024
@lmatz lmatz removed this from the release-2.1 milestone Oct 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant