pip install pydantic-glue
Converts pydantic
schemas to json schema
and then to AWS glue schema
,
so in theory anything that can be converted to JSON Schema could also work.
When using AWS Kinesis Firehose
in a configuration that receives JSONs and writes parquet
files on S3,
one needs to define a AWS Glue
table so Firehose knows what schema to use when creating the parquet files.
AWS Glue lets you define a schema using Avro
or JSON Schema
and then to create a table from that schema,
but as of May 2022
there are limitations on AWS that tables that are created that way can't be used with Kinesis Firehose.
https://stackoverflow.com/questions/68125501/invalid-schema-error-in-aws-glue-created-via-terraform
This is also confirmed by AWS support.
What one could do is create a table set the columns manually, but this means you now have two sources of truth to maintain.
This tool allows you to define a table in pydantic
and generate a JSON with column types that can be used with terraform
to create a Glue table.
Take the following pydantic class
from pydantic import BaseModel
from typing import List
class Bar(BaseModel):
name: str
age: int
class Foo(BaseModel):
nums: List[int]
bars: List[Bar]
other: str
Running pydantic-glue
pydantic-glue -f example.py -c Foo
you get this JSON in the terminal:
{
"//": "Generated by pydantic-glue at 2022-05-25 12:35:55.333570. DO NOT EDIT",
"columns": {
"nums": "array<int>",
"bars": "array<struct<name:string,age:int>>",
"other": "string"
}
}
and can be used in terraform like that
locals {
columns = jsondecode(file("${path.module}/glue_schema.json")).columns
}
resource "aws_glue_catalog_table" "table" {
name = "table_name"
database_name = "db_name"
storage_descriptor {
dynamic "columns" {
for_each = local.columns
content {
name = columns.key
type = columns.value
}
}
}
}
Alternatively you can run CLI with -o
flag to set output file location:
pydantic-glue -f example.py -c Foo -o example.json -l
Wherever there is a type
key in the input JSON Schema, an additional key glue_type
may be
defined to override the type that is used in the AWS Glue Schema. This is, for example, useful for
a pydantic model that has a field of type int
that is unix epoch time, while the column type you
would like in Glue is of type timestamp
.
Additional JSON Schema keys to a pydantic model can be added by using the
Field
function
with the argument json_schema_extra
like so:
from pydantic import BaseModel, Field
class A(BaseModel):
epoch_time: int = Field(
...,
json_schema_extra={
"glue_type": "timestamp",
},
)
The resulting JSON Schema will be:
{
"properties": {
"epoch_time": {
"glue_type": "timestamp",
"title": "Epoch Time",
"type": "integer"
}
},
"required": [
"epoch_time"
],
"title": "A",
"type": "object"
}
And the result after processing with pydantic-glue:
{
"//": "Generated by pydantic-glue at 2022-05-25 12:35:55.333570. DO NOT EDIT",
"columns": {
"epoch_time": "timestamp",
}
}
Recursing through object properties terminates when you supply a glue_type
to use. If the type is
complex, you must supply the full complex type yourself.
pydantic
gets converted to JSON Schema- the JSON Schema types get mapped to Glue types recursively
- Not all types are supported, I just add types as I need them, but adding types is very easy, feel free to open issues or send a PR if you stumbled upon a non-supported use case
- the tool could be easily extended to working with JSON Schema directly
- thus, anything that can be converted to a JSON Schema should also work.