-
Notifications
You must be signed in to change notification settings - Fork 2.5k
Description
Feature Description
What the feature achieves:
This feature enables Apache Hudi to use Lance as a storage file format, similar to existing formats we support like Parquet and ORC. By doing so, Hudi tables can store and query multi-modal data —including , semi-structured, and unstructured (e.g., embeddings, images, video) — while still benefiting from Hudi’s transactional, incremental, and metadata management layers.
Why this feature is needed:
Existing file formats like Parquet and ORC are optimized for tabular analytics, not AI/ML workloads involving embeddings, tensors, or unstructured content.
Modern data platforms increasingly need to manage hybrid data — where text, image, and vector data coexist with traditional tabular features, and with it newer ai/ml formats have emerged recently such as (Lance, Vortex, Nimble, etc) to cater to these workloads.
User Experience
How users will use this feature:
For more details please reference the RFC: https://github.com/apache/hudi/pull/13924/files#diff-f05ae69c4f41edc32aabfbfc016a12ad1af72917314844f8ae52671234508c56R37
Hudi RFC Requirements
RFC PR link: (if applicable)
https://github.com/apache/hudi/pull/13924/files#diff-f05ae69c4f41edc32aabfbfc016a12ad1af72917314844f8ae52671234508c56R37
Why RFC is/isn't needed:
- Does this change public interfaces/APIs? (Yes/No) Yes
- Does this change storage format? (Yes/No) Yes
- Justification:
We will be incrementally making changes to the storage format, there are two other prerequisite RFCs, one around introducing a new type system RFC-99, and the other introducing the notion of a Column Group in Hudi RFC-80
Github Discussion here: #14128 (comment)
Sub-issues
Metadata
Metadata
Assignees
Labels
Type
Projects
Status