Skip to content

Conversation

@zhangyue19921010
Copy link
Contributor

Background

Currently, when Spark writes data into a Lance table using the Merge into syntax, it performs data shuffle with Segment_id as the shuffle key and conducts concurrent data writing.
During the join operation between the source data and the Lance target table data, the source data is split into three categories: insert data, update data and delete data. For the insert data, the segment_id field in the intermediate result dataset of the join operation is assigned a null value. This results in a shuffle operation based on null values, which shuffles all insert data into a single task and consequently causes data skew (All insert data goes into the same write task, with shuffling based on null values).

Solution

Attempt to modify the requiredDistribution() method

Then reconstruct the expression of Distributions.clustered(new NamedReference[] {segmentId}), to achieve the logic that a random number is used as the random value when the segment_id value is null.

Before
截屏2026-01-12 15 02 13

After
截屏2026-01-12 15 02 29

@github-actions github-actions bot added the enhancement New feature or request label Jan 12, 2026
@github-actions
Copy link
Contributor

ACTION NEEDED
Lance follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

For details on the error please inspect the "PR Title Check" action.

@fangbo
Copy link
Contributor

fangbo commented Jan 12, 2026

Great optimization !

@zhangyue19921010 zhangyue19921010 changed the title feat: Support and Optimize spark merge into feat: support and optimize Spark MERGE INTO Jan 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants