Going beyond millions of fragments #5947

jackye1995 · 2026-02-12T20:02:59Z

jackye1995
Feb 12, 2026
Maintainer

Last year, I posted a discussion thread about scaling Lance manifest beyond a flat list of fragments: #4000. In that discussion my main points were:

a flat list of fragments should be sufficient. Delta has very similar setup and can scale to petabyte size tables.
we need benchmark to know what exactly our breaking point is, i.e. at how many fragments is the read/write performance going to be really bad
if we really want to move to a more complex structure, we can probably do some b-tree structure with write buffer like b-epsilon tree to fully solve this problem.

I think we definitely still want to learn more about 2, but for 1 and 3, things has changed a bit. Most noticeably, with Lance getting more and more popular in foundation AI labs, we are seeing very large Lance table users trying to do tables with over 100 trillion rows (EB level size) to train LLM with basically as much data as possible.

At this moment, we have 1M rows per fragment by default, and even if we extend that default, the manifest would end up with millions of fragments which would make both read and write performance suffer.

My initial thought was that we should just implement point 3, which evolves Lance to a multi-level manifest structure similar to Iceberg (but better). We had a discussion about this in community sync, and @westonpace brought up an invaluable point: at this scale, would it be better to just make it multiple Lance tables?

I personally really like this proposal, because:

it keeps the table format simple
with table at that size, vector index also reaches mathematical limit for recall performance. Doing multiple tables make the vector indexes local to each table and greatly improves recall
we are already working on the partitioned namespace feature and it's making a lot of progress (see the latest spec here: https://lance.org/format/namespace/partitioning-spec/)

The key question comes to: Is there any specific feature that user would gain by extending the table format to multi-level manifest, vs the partitioned namespace approach? Curious what people think!

I will try to produce some benchmark for 2 over the weekend so we can discuss things more concretely.

jackye1995 · 2026-02-16T23:11:00Z

jackye1995
Feb 16, 2026
Maintainer Author

Sharing some details of my benchmark investigation, I ran with S3 standard and S3 express.

I ran 5000 sequential commits, for every commit, I:

write the fragment (10 rows)
measure the time to commit the fragment (I set auto cleanup to false to avoid measuring cleanup time)
measure the time to do a fresh dataset load without any inherited cache

The result is quite interesting:

Overall, S3 express performance is much more stable than S3 standard, but it is not really better than S3 standard. For commit, S3 express started to be much faster, but ended up being slower than S3 standard. For load, S3 standard is always faster than S3 express.

I analyzed it more regarding the actual S3 latency, by replaying the S3 read and write operations only. For each file in _versions directory from oldest to newest, I:

read the manifest file to measure GET latency
write the content to a new _validate folder to measure PUT latency
do a listing to get all objects in _validate to measure LIST latency

This is what I got:

So S3 express is much better for GET and PUT, but not really significantly better for LIST. And because we get lexicographical sorting for S3 standard but not S3 express, our LIST latency stay around 60-90ms for S3 standard since we only need to list a single page for S3 standard, while S3 express latency grows linearly because it has to get all items in the directory. (the graph is a bit misleading because I also measured S3 standard listing all items, not just the first page) That explains the latency we observed initially.

From size perspective, the manifest size started from 593.0 B to 391KB in the end, growing consistently 80 bytes per fragment:

To give some more concrete sense:

13K fragments would result in a manifest around 1MB, this can hold 13 billion rows by default.
1M fragments would result in a manifest around 76MB, this will be a table with 1 trillion rows.

My Takeaways

The maniefest version hint + HEAD + parallel listing approach is definitely worth doing on S3 express, I need to finish that work 😆
I think this validates that for petabyte/billions-of-rows scale, the Lance manifest with a flat list of fragments is good enough. We could also get some perf on similar systems like Delta/Iceberg, but I think Lance would perform better in most cases with the same setup.
there is an interesting jump in latency for manifest load operation in S3 standard, I think that could be potentially due to the increased size that caused additional data read requests in S3. If we want to keep manifest read performance below that threshold consistently, a multi-level manifest approach might be attractive. Otherwise, doing multiple tables for larger tables seem like a simpler way to go.

2 replies

majin1102 Feb 24, 2026
Collaborator

Great job on this. A small question.

measure the time to do a fresh dataset load without any inherited cache

Are these tests based on ManifestNamingScheme::V2? Do we anticipate performance improvement by using hints to replace listing lexicographically?

jackye1995 Feb 24, 2026
Maintainer Author

yes, all using manifest scheme v2. See my next comment thread, which shows the perf improvement

jackye1995 · 2026-02-24T07:46:58Z

jackye1995
Feb 24, 2026
Maintainer Author

I have implemented the version hint + head + parallel listing approach: #5997. Here are more detailed results (this time I am running it to 20000 commits, the trend is even more clear)

Some terms:

JSON sync: use a JSON format hint file + synchronously write the hint file after committing the manifest file
JSON async: use a JSON format hint file + fire-and-forget writing the hint file
file_size (async): instead of writing a JSON hint file, we write a file with size equal to the version, so that we only need to do a HEAD to get the version hint instead of reading the file.
no hint: baseline of existing behavior

Summary

This is average over 20000 runs

S3 Express All Configurations

We can see that clearly S3 express requires the version hint to work efficiently

S3 Standard All Configurations

Write is not too different, but load performance is improved.

S3 Express vs S3 Standard (JSON hint async)

JSON vs file_size format

S3 Express:

S3 Standard:

There is improvement of 1ms/2ms, 8ms/6ms improvement with the file_size format. I am actually not sure if there is benefit in using this format. The improvement is there, but not a lot. Not sure if this is worth the complexity, curious what others think!

Async vs Sync Commit

S3 Express:

S3 Standard:

Async commit is for sure faster (10ms on average), but could result in worse load performance because it could result in out-of-date hint.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Going beyond millions of fragments #5947

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Going beyond millions of fragments #5947

Uh oh!

Uh oh!

jackye1995 Feb 12, 2026 Maintainer

Replies: 2 comments · 2 replies

Uh oh!

Uh oh!

jackye1995 Feb 16, 2026 Maintainer Author

My Takeaways

Uh oh!

majin1102 Feb 24, 2026 Collaborator

Uh oh!

jackye1995 Feb 24, 2026 Maintainer Author

Uh oh!

Uh oh!

jackye1995 Feb 24, 2026 Maintainer Author

Summary

S3 Express All Configurations

S3 Standard All Configurations

S3 Express vs S3 Standard (JSON hint async)

JSON vs file_size format

Async vs Sync Commit

jackye1995
Feb 12, 2026
Maintainer

Replies: 2 comments 2 replies

jackye1995
Feb 16, 2026
Maintainer Author

majin1102 Feb 24, 2026
Collaborator

jackye1995 Feb 24, 2026
Maintainer Author

jackye1995
Feb 24, 2026
Maintainer Author