Skip to content

[HUDI-9730] RFC-99 Hudi Type System#13743

Open
bvaradar wants to merge 1 commit intoapache:masterfrom
bvaradar:rfc_type_system
Open

[HUDI-9730] RFC-99 Hudi Type System#13743
bvaradar wants to merge 1 commit intoapache:masterfrom
bvaradar:rfc_type_system

Conversation

@bvaradar
Copy link
Contributor

Change Logs

[HUDI-9730] RFC for new Hudi Type System

Impact

New Type System for Hudi

Risk level (write none, low medium or high below)

None

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change. If not, put "none".

  • The config description must be updated if new configs are added or the default value of the configs are changed
  • Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
    ticket number here and follow the instruction to make
    changes to the website.

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@bvaradar bvaradar changed the title [HUDI-9730] RFC for new Hudi Type System [HUDI-9730] RFC-99 Hudi Type System Aug 20, 2025
@github-actions github-actions bot added the size:M PR with lines of changes in (100, 300] label Aug 20, 2025

| Logical Type | Description | Parameters |
| :---- | :---- | :---- |
| STRUCT\<name: type, ...\> | An ordered collection of named fields. | Field list |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in SQL stardard, it is record type, should we follow that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like It is called ROW type. I think keeping the type to exactly match one standard doesn't have to be a goal as users are going to be interacting with the translated types specific to the query systems they are using.

| STRUCT\<name: type, ...\> | An ordered collection of named fields. | Field list |
| LIST\<element\_type\> | An ordered list of elements of the same type. | Element type |
| MAP\<key\_type, value\_type\> | A collection of key-value pairs. Keys must be unique. | Key, Value types |
| UNION\<type1, type2, ...\> | A value that can be one of several specified types. | Type list |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is UNION in SQL standard?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't think so. Kindly look at the above comment. I am not taking this as an objective.


| Logical Type | Description | Parameters |
| :---- | :---- | :---- |
| DICTIONARY\<K, V\> | A dictionary-encoded type for low-cardinality columns to improve performance and reduce storage. K is an integer index type, V is the value type. | K: Index Type, V: Value Type |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dictionary is more like a physical encoding, not logical type?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is actually a logical type as defined in systems like Apache Arrow to allow users to directly interact with it.

| VARIANT | DenseUnion or LargeBinary | BYTE\_ARRAY \+ JSON | string or union | VariantType | JSON |


## Implementation
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is Expression or pushdown stuff based on new type system included in this RFC?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main goal of this RFC here is to layout the types and the rational for it. As a first step, we want to use this type-system as an internal format replacing avro. The direct integration with query engines types would include the push-down evaluation and other optimizations.


A specific hudi core module "hudi-core-type" will define the above types. The translation layer to and from other type-systems such as Avro, Spark, Flink, Parquet,.. will reside in their own separate modules to keep the dependency clean.

The table schema itself will need to be tracked in metadata table.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need more elaboration here, specify how schemas are persisted (format, versioning) in the metadata table.


## Design

The canonical in-memory representation for all types will be based on the Apache Arrow specification. The main reasons for this is that :
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally, type system should primarily be a logical abstraction. Can we make it orthogonal with physical/implementation choices, e.g., zero-copy, multi-modal etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The information about zero-copy and multi-modal is to inspire the choice of alignment to the type-system specification. The initial implementation will focus on implementing the specification and not on the optimizations.

@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@vinothchandar vinothchandar added the rfc Request for comments label Aug 28, 2025
@vinothchandar vinothchandar self-assigned this Sep 4, 2025
Copy link
Contributor

@yihua yihua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @jonvex we'll also review this RFC.

#13711 for Hudi 1.1 has introduced an interim class ValueType for supporting logical types in column stats index before this new type system is ready. The new type system should also consider the use case. See #13711 (comment) for more details.

| FLOAT16 | Float16 | FLOAT (promoted) | float (promoted) | FloatType (promoted) | FLOAT (promoted) |
| FLOAT | Float32 | FLOAT | float | FloatType | FLOAT |
| DOUBLE | Float64 | DOUBLE | double | DoubleType | DOUBLE |
| DECIMAL(p,s) | Decimal128(p,s) or Decimal256(p,s) | FIXED\_LEN\_BYTE\_ARRAY \+ DECIMAL | bytes \+ decimal | DecimalType(p,s) | DECIMAL(p,s) |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In parquet and avro, decimal can be either a fixed len byte array, or variable len bytes. How can we prevent the loss of this information? Maybe there is not any use for one of those cases, but do we want to try and support that?

The canonical in-memory representation for all types will be based on the Apache Arrow specification. The main reasons for this is that :

- Apache Arrow provides a standard in-memory format that eliminates the costly process of data serialization and deserialization when moving data across system boundaries. This enables "zero-copy" data exchange, which radically reduces computational overhead and query latency.
- This helps us more easily achieve seamless data exchange with ecosystem of Arrow-native tools.
Copy link
Collaborator

@rahil-c rahil-c Oct 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@balaji-varadarajan-ai Do we need to consider which popular oss data engines support these arrow native types, such as spark, flink, trino etc?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry about that i think you cover spark, and flink further in the Interoperability Mapping section

@cshuo
Copy link
Collaborator

cshuo commented Nov 13, 2025

@balaji-varadarajan-ai I was investigating new schema and schema evolution recently, which is the missing detail part in this rfc, including:

  1. The api needed for New schema and schema evolution, e.g, schema-id
  2. where to store new schema
    • mdt: key-val format detail, single commit or all history schema
    • mdt disbaled: store in separate meta directory.
  3. when should schema be stored
    • store on each commit or only on schema changes

I'm willing to take this part, and raise discussion/issue to make it clear.


The following table defines the canonical mapping from the proposed logical types to the types of key external systems.

| Logical Type | Apache Arrow Type | Apache Parquet Type (Physical \+ Logical) | Apache Avro Type | Apache Spark Type | Apache Flink Type |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also consider the Hive Metastore type system here?

@hudi-bot hudi-bot mentioned this pull request Dec 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

rfc Request for comments size:M PR with lines of changes in (100, 300]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants