[HUDI-9730] RFC-99 Hudi Type System by bvaradar · Pull Request #13743 · apache/hudi

bvaradar · 2025-08-20T21:31:37Z

Change Logs

[HUDI-9730] RFC for new Hudi Type System

Impact

New Type System for Hudi

Risk level (write none, low medium or high below)

None

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change. If not, put "none".

The config description must be updated if new configs are added or the default value of the configs are changed
Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
ticket number here and follow the instruction to make
changes to the website.

Contributor's checklist

Read through contributor's guide
Change Logs and Impact were stated clearly
Adequate tests were added if applicable
CI passed

danny0405 · 2025-08-21T01:50:29Z

rfc/rfc-99/rfc-99.md

+
+| Logical Type | Description | Parameters |
+| :---- | :---- | :---- |
+| STRUCT\<name: type, ...\> | An ordered collection of named fields. | Field list |


in SQL stardard, it is record type, should we follow that?

Looks like It is called ROW type. I think keeping the type to exactly match one standard doesn't have to be a goal as users are going to be interacting with the translated types specific to the query systems they are using.

danny0405 · 2025-08-21T01:50:47Z

rfc/rfc-99/rfc-99.md

+| STRUCT\<name: type, ...\> | An ordered collection of named fields. | Field list |
+| LIST\<element\_type\> | An ordered list of elements of the same type. | Element type |
+| MAP\<key\_type, value\_type\> | A collection of key-value pairs. Keys must be unique. | Key, Value types |
+| UNION\<type1, type2, ...\> | A value that can be one of several specified types. | Type list |


is UNION in SQL standard?

Don't think so. Kindly look at the above comment. I am not taking this as an objective.

rfc/README.md

cshuo · 2025-08-22T01:41:53Z

rfc/rfc-99/rfc-99.md

+
+| Logical Type | Description | Parameters |
+| :---- | :---- | :---- |
+| DICTIONARY\<K, V\> | A dictionary-encoded type for low-cardinality columns to improve performance and reduce storage. K is an integer index type, V is the value type. | K: Index Type, V: Value Type |


Dictionary is more like a physical encoding, not logical type?

It is actually a logical type as defined in systems like Apache Arrow to allow users to directly interact with it.

cshuo · 2025-08-22T01:51:43Z

rfc/rfc-99/rfc-99.md

+| VARIANT | DenseUnion or LargeBinary | BYTE\_ARRAY \+ JSON | string or union | VariantType | JSON |
+
+
+## Implementation


Is Expression or pushdown stuff based on new type system included in this RFC?

The main goal of this RFC here is to layout the types and the rational for it. As a first step, we want to use this type-system as an internal format replacing avro. The direct integration with query engines types would include the push-down evaluation and other optimizations.

cshuo · 2025-08-22T02:02:29Z

rfc/rfc-99/rfc-99.md

+
+A specific hudi core module "hudi-core-type" will define the above types. The translation layer to and from other type-systems such as Avro, Spark, Flink, Parquet,.. will reside in their own separate modules to keep the dependency clean. 
+
+The table schema itself will need to be tracked in metadata table.


Need more elaboration here, specify how schemas are persisted (format, versioning) in the metadata table.

cshuo · 2025-08-22T07:19:53Z

rfc/rfc-99/rfc-99.md

+
+## Design
+
+The canonical in-memory representation for all types will be based on the Apache Arrow specification. The main reasons for this is that :


Generally, type system should primarily be a logical abstraction. Can we make it orthogonal with physical/implementation choices, e.g., zero-copy, multi-modal etc.

The information about zero-copy and multi-modal is to inspire the choice of alignment to the type-system specification. The initial implementation will focus on implementing the specification and not on the optimizations.

hudi-bot · 2025-08-28T01:49:52Z

CI report:

dde6998 Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

yihua

cc @jonvex we'll also review this RFC.

#13711 for Hudi 1.1 has introduced an interim class ValueType for supporting logical types in column stats index before this new type system is ready. The new type system should also consider the use case. See #13711 (comment) for more details.

jonvex · 2025-09-04T20:41:37Z

rfc/rfc-99/rfc-99.md

+| FLOAT16 | Float16 | FLOAT (promoted) | float (promoted) | FloatType (promoted) | FLOAT (promoted) |
+| FLOAT | Float32 | FLOAT | float | FloatType | FLOAT |
+| DOUBLE | Float64 | DOUBLE | double | DoubleType | DOUBLE |
+| DECIMAL(p,s) | Decimal128(p,s) or Decimal256(p,s) | FIXED\_LEN\_BYTE\_ARRAY \+ DECIMAL | bytes \+ decimal | DecimalType(p,s) | DECIMAL(p,s) |


In parquet and avro, decimal can be either a fixed len byte array, or variable len bytes. How can we prevent the loss of this information? Maybe there is not any use for one of those cases, but do we want to try and support that?

rahil-c · 2025-10-07T23:17:15Z

rfc/rfc-99/rfc-99.md

+The canonical in-memory representation for all types will be based on the Apache Arrow specification. The main reasons for this is that :
+
+- Apache Arrow provides a standard in-memory format that eliminates the costly process of data serialization and deserialization when moving data across system boundaries. This enables "zero-copy" data exchange, which radically reduces computational overhead and query latency.
+- This helps us more easily achieve seamless data exchange with ecosystem of Arrow-native tools.


@balaji-varadarajan-ai Do we need to consider which popular oss data engines support these arrow native types, such as spark, flink, trino etc?

Sorry about that i think you cover spark, and flink further in the Interoperability Mapping section

cshuo · 2025-11-13T03:52:16Z

@balaji-varadarajan-ai I was investigating new schema and schema evolution recently, which is the missing detail part in this rfc, including:

The api needed for New schema and schema evolution, e.g, schema-id
where to store new schema
- mdt: key-val format detail, single commit or all history schema
- mdt disbaled: store in separate meta directory.
when should schema be stored
- store on each commit or only on schema changes

I'm willing to take this part, and raise discussion/issue to make it clear.

geserdugarov · 2025-11-30T06:03:07Z

rfc/rfc-99/rfc-99.md

+
+The following table defines the canonical mapping from the proposed logical types to the types of key external systems.
+
+| Logical Type | Apache Arrow Type | Apache Parquet Type (Physical \+ Logical) | Apache Avro Type | Apache Spark Type | Apache Flink Type |


Should we also consider the Hive Metastore type system here?

bvaradar changed the title ~~[HUDI-9730] RFC for new Hudi Type System~~ [HUDI-9730] RFC-99 Hudi Type System Aug 20, 2025

github-actions bot added the size:M PR with lines of changes in (100, 300] label Aug 20, 2025

danny0405 reviewed Aug 21, 2025

View reviewed changes

rfc/README.md Outdated Show resolved Hide resolved

cshuo reviewed Aug 22, 2025

View reviewed changes

[HUDI-9730] RFC for new Hudi Type System

dde6998

balaji-varadarajan-ai force-pushed the rfc_type_system branch from 0e8f043 to dde6998 Compare August 28, 2025 00:10

vinothchandar added the rfc Request for comments label Aug 28, 2025

vinothchandar self-assigned this Sep 4, 2025

yihua reviewed Sep 4, 2025

View reviewed changes

jonvex reviewed Sep 4, 2025

View reviewed changes

rahil-c reviewed Oct 7, 2025

View reviewed changes

geserdugarov reviewed Nov 30, 2025

View reviewed changes

hudi-bot mentioned this pull request Dec 9, 2025

Hudi Type System #17163

Closed

		\| VARIANT \| DenseUnion or LargeBinary \| BYTE\_ARRAY \+ JSON \| string or union \| VariantType \| JSON \|


		## Implementation


		A specific hudi core module "hudi-core-type" will define the above types. The translation layer to and from other type-systems such as Avro, Spark, Flink, Parquet,.. will reside in their own separate modules to keep the dependency clean.

		The table schema itself will need to be tracked in metadata table.


		## Design

		The canonical in-memory representation for all types will be based on the Apache Arrow specification. The main reasons for this is that :


		The following table defines the canonical mapping from the proposed logical types to the types of key external systems.

		\| Logical Type \| Apache Arrow Type \| Apache Parquet Type (Physical \+ Logical) \| Apache Avro Type \| Apache Spark Type \| Apache Flink Type \|

Conversation

bvaradar commented Aug 20, 2025

Change Logs

Impact

Risk level (write none, low medium or high below)

Documentation Update

Contributor's checklist

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hudi-bot commented Aug 28, 2025

CI report:

Uh oh!

yihua left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rahil-c Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cshuo commented Nov 13, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

rahil-c Oct 7, 2025 •

edited

Loading