perf!: improve conflict resolution performance by wjones127 · Pull Request #3882 · lance-format/lance

wjones127 · 2025-05-26T21:46:40Z

BREAKING CHANGE: signature of CommitHandler::list_manifest_locations() has a new parameter sorted_descending. It is handled in the default implementation, so no changes are needed unless that method is overridden. Also, ObjectStore::read_dir_all() is now synchronous and returns errors as part of the stream.

This PR optimizes the conflict resolution IO pattern.

Eliminated at least 3 IO hops, down to 5 when there are conflicts and no cache.
Reduced total IO requests by more than O(num_transactions)

wjones127 · 2025-05-27T21:25:44Z

java/core/src/test/java/com/lancedb/lance/DatasetTest.java

-  @Test
-  void testCommitConflict() {
-    String testMethodName = new Object() {}.getClass().getEnclosingMethod().getName();
-    String datasetPath = tempDir.resolve(testMethodName).toString();
-    try (RootAllocator allocator = new RootAllocator(Long.MAX_VALUE)) {
-      TestUtils.SimpleTestDataset testDataset =
-          new TestUtils.SimpleTestDataset(allocator, datasetPath);
-      try (Dataset dataset = testDataset.createEmptyDataset()) {
-        assertEquals(1, dataset.version());
-        assertEquals(1, dataset.latestVersion());
-        assertThrows(
-            IllegalArgumentException.class,
-            () -> {
-              testDataset.write(0, 5);
-            });
-      }
-    }
-  }
-


This test didn't make any sense. Append shouldn't generate a commit conflict, almost ever. We might want to add tests back again later with a proper commit conflict.

codecov-commenter · 2025-05-28T01:39:38Z

Codecov Report

Attention: Patch coverage is 91.82879% with 21 lines in your changes missing coverage. Please review.

Project coverage is 78.41%. Comparing base (6ac436e) to head (a7c12d4).
Report is 3 commits behind head on main.

Files with missing lines	Patch %	Lines
rust/lance/src/dataset.rs	88.88%	7 Missing and 6 partials ⚠️
rust/lance-io/src/object_store.rs	82.60%	3 Missing and 1 partial ⚠️
rust/lance-table/src/io/commit.rs	95.00%	2 Missing and 1 partial ⚠️
rust/lance/src/io/commit.rs	96.15%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3882      +/-   ##
==========================================
+ Coverage   78.34%   78.41%   +0.07%     
==========================================
  Files         280      280              
  Lines      107283   107680     +397     
  Branches   107283   107680     +397     
==========================================
+ Hits        84047    84436     +389     
- Misses      19912    19917       +5     
- Partials     3324     3327       +3

Flag	Coverage Δ
unittests	`78.41% <91.82%> (+0.07%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

jackye1995 · 2025-05-29T01:58:05Z

rust/lance-io/src/object_store/providers/memory.rs

            use_constant_size_upload_parts: false,
            list_is_lexically_ordered: true,
-            io_parallelism: get_num_compute_intensive_cpus(),
+            io_parallelism: DEFAULT_CLOUD_IO_PARALLELISM,


Why do we want to change this behavior? The original one seems more reasonable for memory store

This one was surprising to me as well. But because CI runners only have like 4 CPUs, this was limiting the number of parallel requests that could be in flight. It was making tests about num_hops fail. This is really only a problem when you are combining memory store with ThrottledStore.

jackye1995 · 2025-05-29T02:21:59Z

rust/lance/src/dataset.rs

+        .commit_handler
+        .list_manifest_locations(&dataset.base, dataset.object_store(), true)
+        .try_take_while(move |location| {
+            futures::future::ready(Ok(location.version > latest_version))


maybe in future iterations, technically we can also push this down so that the list operation can just list manifests that are after the latest_version and avoid listing all manifests in the version directory.

The return value of list_manifest_locations is a stream, so it should be efficient when it can be (list operation has lexical order and we are using V2 manifest paths). But often it will be unavoidable to list all manifests.

Though thinking about it, it might be worthwhile doing a fast path at least for local filesystem. The existing implementation finds the latest version (which has a fast local implementation) and then opens all the versions in between. That's probably what we should do for local.

Would it be possible to leverage the start-after header in S3 when possible so we don't need to list
anything before the latest version?
https://docs.aws.amazon.com/AmazonS3/latest/API/API_ListObjectsV2.html#API_ListObjectsV2_RequestSyntax

I'm not worried about S3. If you are using V2 manifest paths, then the lexical order is descending (or latest first). So it's just a matter of stopping early to not have to list anything before the current version.

Wrote a follow up ticket here: #3904

jackye1995

looks good to me, thanks for the fix!

#3882 introduced the improvement to list transactions in descending order, and we use the output list to rebase an ongoing transaction. This is logically incorrect because we should rebase in ascending order from oldest transaction to latest transaction. I think currently there is no correctness issue because for merge-insert we do not handle upserts of the same row, and for FRI related operations the handling always sort the related fragment reuse versions internally. But it's still better to fix this logical error which might impact conflict resolution cases we might handle in the future.

github-actions bot added the performance label May 26, 2025

wjones127 changed the title ~~perf: improve conflict resolution performance~~ perf!: improve conflict resolution performance May 27, 2025

github-actions bot added breaking-change python java labels May 27, 2025

wjones127 added 6 commits May 27, 2025 14:21

set backoff unit

d11e006

wip

c786d4c

wip

3f004ef

get tests passing

6e7cfa1

cleanup

117498f

breaking change

a608dd3

wjones127 force-pushed the perf/commit-loop branch from 3c055a4 to a608dd3 Compare May 27, 2025 21:24

wjones127 commented May 27, 2025

View reviewed changes

wjones127 added 2 commits May 27, 2025 15:23

try this

9fd76a3

fix io parallelism of memory store

f5619c9

wjones127 marked this pull request as ready for review May 28, 2025 01:27

try to fix windows

a7c12d4

jackye1995 reviewed May 29, 2025

View reviewed changes

jackye1995 approved these changes May 29, 2025

View reviewed changes

jackye1995 mentioned this pull request May 29, 2025

feat!: introduce fragment reuse index to defer compaction index remap #3847

Merged

wjones127 mentioned this pull request May 29, 2025

Optimize local fs listing of transactions #3904

Open

wjones127 merged commit bf2f72e into lance-format:main May 29, 2025
27 checks passed

wjones127 deleted the perf/commit-loop branch May 29, 2025 19:08

jackye1995 mentioned this pull request Jun 8, 2025

fix: rebase transactions from oldest to latest #3955

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf!: improve conflict resolution performance#3882

perf!: improve conflict resolution performance#3882
wjones127 merged 9 commits intolance-format:mainfrom
wjones127:perf/commit-loop

wjones127 commented May 26, 2025 •

edited

Loading

Uh oh!

wjones127 May 27, 2025

Uh oh!

codecov-commenter commented May 28, 2025 •

edited

Loading

Uh oh!

jackye1995 May 29, 2025

Uh oh!

wjones127 May 29, 2025

Uh oh!

jackye1995 May 29, 2025

Uh oh!

wjones127 May 29, 2025

Uh oh!

jackye1995 May 29, 2025

Uh oh!

wjones127 May 29, 2025

Uh oh!

wjones127 May 29, 2025

Uh oh!

jackye1995 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

wjones127 commented May 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented May 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jackye1995 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

wjones127 commented May 26, 2025 •

edited

Loading

codecov-commenter commented May 28, 2025 •

edited

Loading