You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We have a bunch of files on S3 (minio) primarly in JSON and parquet format. We are searching for a simple solution to do some ad hoc sql queries against those files to verify data and see the structure. mainly for the engineers to build the data pipelines in dbt/dagster. we don't want a complex solution which involves many components and there's no need to scale computing beyond a single node.
our preferred way would be something which we can point to a s3 bucket and it shows all the files as e.g. views in the database. it shouldn't require to manually specify the schema/structure of the files. just lookup the structure on read. the files are usually only around 1-20MB. so really small scale.
Before building an MV in RW, we typically want to "explore" the data a bit to see what columns there are. If the data is not normalized, probably need to build an ETL pipeline first and then build MV on top of it.
Direct batch querying without specifying a schema can speed up "exploration" as there could be many upstream sources.
Right now, we need to declare the schema first: #18174 (comment)
https://www.reddit.com/r/dataengineering/comments/16wysjd/what_is_the_best_way_to_query_json_and_parquet/:
Before building an MV in RW, we typically want to "explore" the data a bit to see what columns there are. If the data is not normalized, probably need to build an ETL pipeline first and then build MV on top of it.
Direct batch querying without specifying a schema can speed up "exploration" as there could be many upstream sources.
Right now, we need to declare the schema first: #18174 (comment)
https://duckdb.org/docs/data/json/overview.html:
An example of two json files:
Besides directly querying, we can ingest the data into a table:
We can create an empty table out of it:
After doing some data exploration, we can proceed to select out the columns we truly want by filtering out useless/bad data:
We remark that "ID" in both files have different data types, therefore they are unified into
json
The text was updated successfully, but these errors were encountered: