Skip to content

New LSM file layout #14310

@danny0405

Description

@danny0405

Feature Description

What the feature achieves:

  • fast streaming ingestion with minimum write amplification
  • more efficent queries and compaction (5x)
  • more efficient for point queries with fast data skipping (10x)
  • the leveled layout supports more flexible compaction strategies, for e.g, the minor compaction is more friendly for streaming
  • the sorted columnar files has better compaction ratio than avro(10x)
  • more efficient for integration with popular OLAP engines which mostly also have native LSM style storage backend like Starrocks and Doris

Why this feature is needed:
In 1.1, we have made a lot of efforts to improve the perf for streaming write and read with Flink, while in analytic scenarios, many queries still require better efficienies for shorter e2e response time (SLA) with enough CPU/memory resources configured, the current base+delta merging can not really extend quite well in this case. While in industry, Starrocks and Doris all have LSM-style storage backend to support OLAP queries, even for OLTP, there some practices like MyRocks in meta, take the MyRocks as an example, they reports almost half cost saving after the migration from B-tree to LSM.

And in JD, they have some practices in production with out-performing numbers too (TODO for the benckmark and real production numbers for LSM with Hudi)

User Experience

How users will use this feature:
A new option can be specfied to declare the layout type, either the current or lsm, no explicit API change for users.

Hudi RFC Requirements

RFC PR link: (if applicable)

Todo

Metadata

Metadata

Labels

Type

No type

Projects

Status

Scoping

Relationships

None yet

Development

No branches or pull requests

Issue actions