Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve JSON_MATCH performance. #15049

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

bziobrowski
Copy link
Contributor

@bziobrowski bziobrowski commented Feb 12, 2025

PR improves performance and reduces temporary object generation in JSON_MATCH() function calls by:

  • reducing and/or delaying immutable to mutable conversion as much as possible
  • using immutable objects instead of allocating
  • for != and NOT IN expressions - switching from OR-ing all bitmaps except those for unwanted values to reading bitmap with all values for the key and then removing values present in bitmaps of unwanted values
  • comparing dict ids instead of string values
  • reusing byte[] buffer when reading from string dictionary

Running the following query on githubComplexTypeEvents table with 2.5mil records (same base data, but multiplied) and JSON index on payload column shows (in async profiler/JFR) :

  • response time reduction from ~ 500ms to about ~250ms
  • allocation reduction from 400MiB to about 20MiB .
SELECT
  COUNT(*) AS event_count,
  SUM(payload.size) AS event_size,
  SUM( CASE WHEN created_at_timestamp >= ago('PT3H') THEN 1 ELSE 0 END ) AS transaction_count_3h,
  SUM( CASE WHEN created_at_timestamp >= ago('PT3H') THEN payload.distinct_size END ) AS transaction_sum_3h,
  SUM( CASE WHEN created_at_timestamp >= ago('PT15M') THEN 1 END ) AS transaction_count_15min,
  SUM( CASE WHEN created_at_timestamp >= ago('PT15M') THEN payload.distinct_size END ) AS transaction_sum_15min,
  type
FROM githubComplexTypeEvents
WHERE created_at_timestamp between ago('P10000D') AND now()
  AND payload.size > 0
  AND JSON_MATCH(
    payload,	
    '("$.ref" = ''refs/heads/master'' OR ("$.ref" = ''refs/heads/gh-pages'' AND ("$.commits[0].author.name" != ''chao1995'' OR "$.commits[0].author.name" IS NOT NULL)))'
  )
GROUP BY type

Synthetic benchmark shows speedup of about 4 times :

branch:

Benchmark                   (_numRows)  (_numSegments)                                                                                                                                                                                                                                                                                                                              (_query)  (_scenario)  Mode  Cnt   Score   Error  Units
BenchmarkQueriesSSQE.query     2500000              10  SELECT
  COUNT(*) AS count,
  SUM(INT_COL) AS size,
  LOW_CARDINALITY_STRING_COL as type
FROM MyTable
WHERE JSON_MATCH(
    JSON_COL,	
    '("$.type" = ''type0'' OR ("$.type" = ''type1'' AND ("$.changes[0].author.name" != ''author10''     OR "$.changes[1].author.name" IS NOT NULL)))'
  )
GROUP BY LOW_CARDINALITY_STRING_COL   EXP(0.001)  avgt    5  23.332 ± 1.553  ms/op

master:

Benchmark                   (_numRows)  (_numSegments)                                                                                                                                                                                                                                                                                                                              (_query)  (_scenario)  Mode  Cnt   Score   Error  Units
BenchmarkQueriesSSQE.query     2500000              10  SELECT
  COUNT(*) AS count,
  SUM(INT_COL) AS size,
  LOW_CARDINALITY_STRING_COL as type
FROM MyTable
WHERE JSON_MATCH(
    JSON_COL,
    '("$.type" = ''type0'' OR ("$.type" = ''type1'' AND ("$.changes[0].author.name" != ''author10''     OR "$.changes[1].author.name" IS NOT NULL)))'
  )
GROUP BY LOW_CARDINALITY_STRING_COL   EXP(0.001)  avgt    5  93.082 ± 1.948  ms/op

Note: I didn't change how flattened bitmaps are mapped to actual doc ids (simple forEach mapping values and adding to MutableRoaringBitmap) because other methods (tested in BenchmarkRoaringBitmapMapping) weren't any faster, and some were much slower:

Benchmark                                             Mode  Cnt         Score       Error   Units
mapSimple                                             avgt    5       634.450 ±    22.122   ms/op
mapSimple:gc.alloc.rate                               avgt    5        63.587 ±     2.168  MB/sec
mapSimple:gc.alloc.rate.norm                          avgt    5  42325777.600 ± 90594.149    B/op
mapSimple:gc.count                                    avgt    5         6.000              counts
mapSimple:gc.time                                     avgt    5        14.000                  ms
mapWithDefaults                                       avgt    5       617.683 ±    50.692   ms/op
mapWithDefaults:gc.alloc.rate                         avgt    5        65.271 ±     5.294  MB/sec
mapWithDefaults:gc.alloc.rate.norm                    avgt    5  42282179.200 ± 90573.488    B/op
mapWithDefaults:gc.count                              avgt    5         5.000              counts
mapWithDefaults:gc.time                               avgt    5        12.000                  ms
mapWithInitCapacity                                   avgt    5       630.927 ±    45.332   ms/op
mapWithInitCapacity:gc.alloc.rate                     avgt    5        64.439 ±     4.636  MB/sec
mapWithInitCapacity:gc.alloc.rate.norm                avgt    5  42641448.800 ± 90578.670    B/op
mapWithInitCapacity:gc.count                          avgt    5         6.000              counts
mapWithInitCapacity:gc.time                           avgt    5        11.000                  ms
mapWithMaxInitCapacity                                avgt    5       633.701 ±    18.636   ms/op
mapWithMaxInitCapacity:gc.alloc.rate                  avgt    5        61.892 ±     1.812  MB/sec
mapWithMaxInitCapacity:gc.alloc.rate.norm             avgt    5  41146993.600 ± 90594.149    B/op
mapWithMaxInitCapacity:gc.count                       avgt    5         5.000              counts
mapWithMaxInitCapacity:gc.time                        avgt    5        12.000                  ms
mapWithOptimisedForArraysAppender                     avgt    5       635.250 ±    52.749   ms/op
mapWithOptimisedForArraysAppender:gc.alloc.rate       avgt    5        63.465 ±     5.259  MB/sec
mapWithOptimisedForArraysAppender:gc.alloc.rate.norm  avgt    5  42282185.600 ± 90594.149    B/op
mapWithOptimisedForArraysAppender:gc.count            avgt    5         6.000              counts
mapWithOptimisedForArraysAppender:gc.time             avgt    5        12.000                  ms
mapWithOptimisedForRunsAppender                       avgt    5       628.546 ±    39.954   ms/op
mapWithOptimisedForRunsAppender:gc.alloc.rate         avgt    5        64.306 ±     4.051  MB/sec
mapWithOptimisedForRunsAppender:gc.alloc.rate.norm    avgt    5  42397132.000 ± 90503.570    B/op
mapWithOptimisedForRunsAppender:gc.count              avgt    5         6.000              counts
mapWithOptimisedForRunsAppender:gc.time               avgt    5        12.000                  ms
mapWithPartialRadixSort                               avgt    5       706.383 ±    24.308   ms/op
mapWithPartialRadixSort:gc.alloc.rate                 avgt    5       133.071 ±     4.476  MB/sec
mapWithPartialRadixSort:gc.alloc.rate.norm            avgt    5  98613905.600 ± 90594.149    B/op
mapWithPartialRadixSort:gc.count                      avgt    5        12.000              counts
mapWithPartialRadixSort:gc.time                       avgt    5        25.000                  ms
mapWithPartialRadixSortPrealloc                       avgt    5       660.032 ±    49.422   ms/op
mapWithPartialRadixSortPrealloc:gc.alloc.rate         avgt    5       114.218 ±     8.313  MB/sec
mapWithPartialRadixSortPrealloc:gc.alloc.rate.norm    avgt    5  79067249.600 ± 90594.149    B/op
mapWithPartialRadixSortPrealloc:gc.count              avgt    5         9.000              counts
mapWithPartialRadixSortPrealloc:gc.time               avgt    5        18.000                  ms
mapWithRunCompressDisabled                            avgt    5       613.402 ±    22.244   ms/op
mapWithRunCompressDisabled:gc.alloc.rate              avgt    5        65.703 ±     2.368  MB/sec
mapWithRunCompressDisabled:gc.alloc.rate.norm         avgt    5  42282169.600 ± 90594.149    B/op
mapWithRunCompressDisabled:gc.count                   avgt    5         5.000              counts
mapWithRunCompressDisabled:gc.time                    avgt    5        11.000                  ms

@codecov-commenter
Copy link

codecov-commenter commented Feb 12, 2025

Codecov Report

Attention: Patch coverage is 81.57895% with 63 lines in your changes missing coverage. Please review.

Project coverage is 63.36%. Comparing base (59551e4) to head (e2a68fa).
Report is 1721 commits behind head on master.

Files with missing lines Patch % Lines
...local/realtime/impl/json/MutableJsonIndexImpl.java 78.72% 23 Missing and 17 partials ⚠️
...t/index/readers/json/ImmutableJsonIndexReader.java 84.86% 11 Missing and 12 partials ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #15049      +/-   ##
============================================
+ Coverage     61.75%   63.36%   +1.61%     
- Complexity      207     1481    +1274     
============================================
  Files          2436     2733     +297     
  Lines        133233   153871   +20638     
  Branches      20636    23755    +3119     
============================================
+ Hits          82274    97499   +15225     
- Misses        44911    49002    +4091     
- Partials       6048     7370    +1322     
Flag Coverage Δ
custom-integration1 100.00% <ø> (+99.99%) ⬆️
integration 100.00% <ø> (+99.99%) ⬆️
integration1 100.00% <ø> (+99.99%) ⬆️
integration2 0.00% <ø> (ø)
java-11 63.34% <81.57%> (+1.63%) ⬆️
java-21 63.23% <81.57%> (+1.60%) ⬆️
skip-bytebuffers-false 63.35% <81.57%> (+1.60%) ⬆️
skip-bytebuffers-true 63.21% <81.57%> (+35.49%) ⬆️
temurin 63.36% <81.57%> (+1.61%) ⬆️
unittests 63.36% <81.57%> (+1.61%) ⬆️
unittests1 56.10% <37.42%> (+9.21%) ⬆️
unittests2 33.81% <80.40%> (+6.08%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@richardstartin
Copy link
Member

Just a note on whether to use the RoaringBitmapWriter, the benchmark results, and the necessity for the benchmark itself which might have saved a bit of time, and to avoid the result of this benchmark being generalised inappropriately. This isn't really the use case it was designed for, and just adding to the bitmap makes perfect sense. It's intended for consuming ascending or nearly ascending values, which the flattened doc id mapping won't preserve in general. The rationale was outlined here. If the added doc id is outside of the current window of 2^16 values, it falls back to simply adding to the underlying bitmap so can only add overhead. On the slow partial radix sort option, the intended use case is for adding unsorted values in one go as in RoaringBitmap.bitmapOfUnordered - batching as done here was never considered and wouldn't be worthwhile because the cost the approach saves can only be amortised in one go. It was benchmarked for intended use here. I can't think of a use case in Pinot for using the partial radix sort option.

The rest of the changes look good.

@bziobrowski
Copy link
Contributor Author

Thanks for the advice, Richard!
When I was trying to speed up the mapping part, the only useful page google proposed was https://richardstartin.github.io/posts/roaringbitmap-performance-tricks .
I guess I've to limit search to your blog :)

@gortiz
Copy link
Contributor

gortiz commented Feb 21, 2025

@richardstartin, I have to say I really miss your posts!

* It stores either a bitmap from posting list that must be cloned before mutating (readOnly=true)
* or an already cloned bitmap.
*/
static class LazyBitmap {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Being a private class it may sound weird, but wouldn't it be beneficial to define an Interface instead of using the LazyBitmap implementation directly. I see some excessive coupling here in case we need to try different alternatives in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LazyBitmap shouldn't leave the internals of parent class so I doubt we'll ever have to check other implementations.
Changes here are similar to those in ImmutableJsonIndexReader but a bit more complex due to:

  • working with RoaringBitmap (there's no simple way to differentiate between instance taken from index that shouldn't be changed and a copy)
  • null RoaringBitmap having meaning in some contexts (e.g. filter() method )
  • some operations mixing LazyBitmap with RoaringBitmap. Wrapping the latter with an instance of the former would be pointless and slow things down.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants