Skip to content

RFC-100: Lance File Format support in Hudi #14127

@rahil-c

Description

@rahil-c

Feature Description

What the feature achieves:
This feature enables Apache Hudi to use Lance as a storage file format, similar to existing formats we support like Parquet and ORC. By doing so, Hudi tables can store and query multi-modal data —including , semi-structured, and unstructured (e.g., embeddings, images, video) — while still benefiting from Hudi’s transactional, incremental, and metadata management layers.

Why this feature is needed:
Existing file formats like Parquet and ORC are optimized for tabular analytics, not AI/ML workloads involving embeddings, tensors, or unstructured content.
Modern data platforms increasingly need to manage hybrid data — where text, image, and vector data coexist with traditional tabular features, and with it newer ai/ml formats have emerged recently such as (Lance, Vortex, Nimble, etc) to cater to these workloads.

User Experience

How users will use this feature:
For more details please reference the RFC: https://github.com/apache/hudi/pull/13924/files#diff-f05ae69c4f41edc32aabfbfc016a12ad1af72917314844f8ae52671234508c56R37

Hudi RFC Requirements

RFC PR link: (if applicable)
https://github.com/apache/hudi/pull/13924/files#diff-f05ae69c4f41edc32aabfbfc016a12ad1af72917314844f8ae52671234508c56R37

Why RFC is/isn't needed:

  • Does this change public interfaces/APIs? (Yes/No) Yes
  • Does this change storage format? (Yes/No) Yes
  • Justification:

We will be incrementally making changes to the storage format, there are two other prerequisite RFCs, one around introducing a new type system RFC-99, and the other introducing the notion of a Column Group in Hudi RFC-80

Github Discussion here: #14128 (comment)

Sub-issues

Metadata

Metadata

Assignees

Labels

type:featureNew features and enhancements

Type

No type

Projects

Status

Open

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions