Skip to content

Preserve non-null comparison column values during segment commit#11027

Closed
KKcorps wants to merge 1 commit intoapache:masterfrom
KKcorps:partial_upsert_multicol_null_patch
Closed

Preserve non-null comparison column values during segment commit#11027
KKcorps wants to merge 1 commit intoapache:masterfrom
KKcorps:partial_upsert_multicol_null_patch

Conversation

@KKcorps
Copy link
Contributor

@KKcorps KKcorps commented Jul 4, 2023

This PR completes the bug fix introduced with #10704

With the fix introduced previously, although we merge the comparison column values, we still mark the original columns as null in the bitmap so that it can help during restart.

However, during the segment build, we simply filter out all the columns which have null set in the bitmap and thus we lose all these values.

Reproducing the bug

  • create a partital upsert table with two comparison columns mtime and mtime_2

  • publish the following two records to the table

{ "rsvp_count": 23, "venue_name": "Venue A", "event_id": "E12345", "event_time": 1688372645422, "group_city": "San Francisco", "group_country": "USA", "group_id": 1234567890, "group_name": "OpenAI enthusiasts", "group_lat": 37.7749, "group_lon": -122.4194, "mtime_2": 1687341314322 }


{ "rsvp_count": 23, "venue_name": "Venue A", "event_id": "E12345", "event_time": 1688372645422, "group_city": "San Francisco", "group_country": "USA", "group_id": 1234567890, "group_name": "OpenAI enthusiasts", "group_lat": 37.7749, "group_lon": -122.4194, "mtime": 1687341314100 }
  • verify the records are showing up correctly in the table. Then do either reload or force commit.

  • You will see that no records show up in the table now. If you use skipUpsert(true) though, you will be able to see everything

@KKcorps KKcorps requested a review from Jackie-Jiang July 4, 2023 08:29
@codecov-commenter
Copy link

codecov-commenter commented Jul 4, 2023

Codecov Report

Merging #11027 (8a5299f) into master (0821e3b) will decrease coverage by 0.01%.
The diff coverage is 0.00%.

@@            Coverage Diff             @@
##           master   #11027      +/-   ##
==========================================
- Coverage    0.11%    0.11%   -0.01%     
==========================================
  Files        2197     2199       +2     
  Lines      118596   118744     +148     
  Branches    17980    18017      +37     
==========================================
  Hits          137      137              
- Misses     118439   118587     +148     
  Partials       20       20              
Flag Coverage Δ
integration1temurin11 ?
integration1temurin17 0.00% <0.00%> (ø)
integration1temurin20 0.00% <0.00%> (ø)
integration2temurin11 ?
integration2temurin17 0.00% <0.00%> (ø)
integration2temurin20 0.00% <0.00%> (ø)
unittests1temurin11 ?
unittests1temurin17 ?
unittests1temurin20 ?
unittests2temurin11 0.11% <0.00%> (-0.01%) ⬇️
unittests2temurin17 0.11% <0.00%> (-0.01%) ⬇️
unittests2temurin20 0.11% <0.00%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...local/indexsegment/mutable/MutableSegmentImpl.java 0.00% <0.00%> (ø)
...ocal/segment/readers/PinotSegmentRecordReader.java 0.00% <0.00%> (ø)
...ava/org/apache/pinot/segment/spi/IndexSegment.java 0.00% <0.00%> (ø)

... and 5 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@Jackie-Jiang
Copy link
Contributor

I don't follow the special handling of comparison column (still treating them as null value) in #10704, and I feel we should just remove that special handling. Handling it in record reader is a very hacky solution.
cc @egalpin for more context on that special handling

@egalpin
Copy link
Member

egalpin commented Jul 5, 2023

The motivation behind wanting to keep nullness encoded is being able to easily discern between a defaultValue representation of null Vs. an actual null so that comparing 2 defaultValues would not result in a comparison result of 0 therefore triggering upsert.

I don't feel that this is a required "guard" anymore though; it may have been needed at one point but the various algorithms for multiple comparison column upsert have changed a lot throughout implementation. We guard against any newly ingested record having all null comparison columns[1], so by the time we reach compareTo the values being compared would "at worst" be <defaultValue> of the previously ingested record and the non-null value (i.e. non-defaultValue) of the newly ingested record. Such a comparison would have the same result as the current implementation (current implementation checks if the previous record's value for the same comparison index is null and if so accepts the new record as a valid upsert).

All that said though, if we don't encode nullness then we lose the ability to perform valuable queries like "show me all records that have null comparisonColumnX" (i.e. "show me all records which have never had data written by a producer that uses comparisonColumnX).

[1] https://github.com/egalpin/pinot/blob/68fdfa4a12926c7ce45dadc6a86d973ce5ff3669/pinot-segment-local/src/main/java/org/apache/pinot/segment/local/indexsegment/mutable/MutableSegmentImpl.java#L597 ( -> this method has move, I think as of #10703)

@Jackie-Jiang
Copy link
Contributor

Picked the solution in #11044

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants