Conversation
|
|
||
| | Logical Type | Description | Parameters | | ||
| | :---- | :---- | :---- | | ||
| | STRUCT\<name: type, ...\> | An ordered collection of named fields. | Field list | |
There was a problem hiding this comment.
in SQL stardard, it is record type, should we follow that?
There was a problem hiding this comment.
Looks like It is called ROW type. I think keeping the type to exactly match one standard doesn't have to be a goal as users are going to be interacting with the translated types specific to the query systems they are using.
| | STRUCT\<name: type, ...\> | An ordered collection of named fields. | Field list | | ||
| | LIST\<element\_type\> | An ordered list of elements of the same type. | Element type | | ||
| | MAP\<key\_type, value\_type\> | A collection of key-value pairs. Keys must be unique. | Key, Value types | | ||
| | UNION\<type1, type2, ...\> | A value that can be one of several specified types. | Type list | |
There was a problem hiding this comment.
is UNION in SQL standard?
There was a problem hiding this comment.
Don't think so. Kindly look at the above comment. I am not taking this as an objective.
|
|
||
| | Logical Type | Description | Parameters | | ||
| | :---- | :---- | :---- | | ||
| | DICTIONARY\<K, V\> | A dictionary-encoded type for low-cardinality columns to improve performance and reduce storage. K is an integer index type, V is the value type. | K: Index Type, V: Value Type | |
There was a problem hiding this comment.
Dictionary is more like a physical encoding, not logical type?
There was a problem hiding this comment.
It is actually a logical type as defined in systems like Apache Arrow to allow users to directly interact with it.
| | VARIANT | DenseUnion or LargeBinary | BYTE\_ARRAY \+ JSON | string or union | VariantType | JSON | | ||
|
|
||
|
|
||
| ## Implementation |
There was a problem hiding this comment.
Is Expression or pushdown stuff based on new type system included in this RFC?
There was a problem hiding this comment.
The main goal of this RFC here is to layout the types and the rational for it. As a first step, we want to use this type-system as an internal format replacing avro. The direct integration with query engines types would include the push-down evaluation and other optimizations.
|
|
||
| A specific hudi core module "hudi-core-type" will define the above types. The translation layer to and from other type-systems such as Avro, Spark, Flink, Parquet,.. will reside in their own separate modules to keep the dependency clean. | ||
|
|
||
| The table schema itself will need to be tracked in metadata table. |
There was a problem hiding this comment.
Need more elaboration here, specify how schemas are persisted (format, versioning) in the metadata table.
|
|
||
| ## Design | ||
|
|
||
| The canonical in-memory representation for all types will be based on the Apache Arrow specification. The main reasons for this is that : |
There was a problem hiding this comment.
Generally, type system should primarily be a logical abstraction. Can we make it orthogonal with physical/implementation choices, e.g., zero-copy, multi-modal etc.
There was a problem hiding this comment.
The information about zero-copy and multi-modal is to inspire the choice of alignment to the type-system specification. The initial implementation will focus on implementing the specification and not on the optimizations.
0e8f043 to
dde6998
Compare
yihua
left a comment
There was a problem hiding this comment.
cc @jonvex we'll also review this RFC.
#13711 for Hudi 1.1 has introduced an interim class ValueType for supporting logical types in column stats index before this new type system is ready. The new type system should also consider the use case. See #13711 (comment) for more details.
| | FLOAT16 | Float16 | FLOAT (promoted) | float (promoted) | FloatType (promoted) | FLOAT (promoted) | | ||
| | FLOAT | Float32 | FLOAT | float | FloatType | FLOAT | | ||
| | DOUBLE | Float64 | DOUBLE | double | DoubleType | DOUBLE | | ||
| | DECIMAL(p,s) | Decimal128(p,s) or Decimal256(p,s) | FIXED\_LEN\_BYTE\_ARRAY \+ DECIMAL | bytes \+ decimal | DecimalType(p,s) | DECIMAL(p,s) | |
There was a problem hiding this comment.
In parquet and avro, decimal can be either a fixed len byte array, or variable len bytes. How can we prevent the loss of this information? Maybe there is not any use for one of those cases, but do we want to try and support that?
| The canonical in-memory representation for all types will be based on the Apache Arrow specification. The main reasons for this is that : | ||
|
|
||
| - Apache Arrow provides a standard in-memory format that eliminates the costly process of data serialization and deserialization when moving data across system boundaries. This enables "zero-copy" data exchange, which radically reduces computational overhead and query latency. | ||
| - This helps us more easily achieve seamless data exchange with ecosystem of Arrow-native tools. |
There was a problem hiding this comment.
@balaji-varadarajan-ai Do we need to consider which popular oss data engines support these arrow native types, such as spark, flink, trino etc?
There was a problem hiding this comment.
Sorry about that i think you cover spark, and flink further in the Interoperability Mapping section
|
@balaji-varadarajan-ai I was investigating new schema and schema evolution recently, which is the missing detail part in this rfc, including:
I'm willing to take this part, and raise discussion/issue to make it clear. |
|
|
||
| The following table defines the canonical mapping from the proposed logical types to the types of key external systems. | ||
|
|
||
| | Logical Type | Apache Arrow Type | Apache Parquet Type (Physical \+ Logical) | Apache Avro Type | Apache Spark Type | Apache Flink Type | |
There was a problem hiding this comment.
Should we also consider the Hive Metastore type system here?
Change Logs
[HUDI-9730] RFC for new Hudi Type System
Impact
New Type System for Hudi
Risk level (write none, low medium or high below)
None
Documentation Update
Describe any necessary documentation update if there is any new feature, config, or user-facing change. If not, put "none".
ticket number here and follow the instruction to make
changes to the website.
Contributor's checklist