Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce IdSet and add IdSetAggregationFunction #5926

Merged
merged 1 commit into from
Aug 31, 2020

Conversation

Jackie-Jiang
Copy link
Contributor

@Jackie-Jiang Jackie-Jiang commented Aug 26, 2020

Description

For issue #5925

Introduce IdSet which represents a collection of ids and can be used to replace the large IN clause to optimize the query.
4 types of IdSet are introduced:

  • EmptyIdSet: Used as a place holder for empty id set
  • RoaringBitmapIdSet: Based on RoaringBitmap and can be used to store INT ids
  • Roaring64NavigableMapIdSet: Based on Roaring64NavigableMap and can be used to store LONG ids
  • BloomFilterIdSet: Based on BloomFilter and can be used to store any type of ids (contains won't be 100% accurate because of the nature of BloomFilter)

Add IdSetAggregationFunction to create IdSet for a column.
3 configurable parameters for the function:

  • sizeThresholdInBytes: Once the size of the IdSet reaches this threshold, convert it to BloomFilterIdSet to save space
  • expectedInsertions: Number of expected insertions for the BloomFilter, must be positive
  • fpp: Desired false positive probability for the BloomFilter, must be positive and less than 1.0

The parameters can be passed to the function as the second argument, e.g.:
SELECT IDSET(intColumn, 'sizeThresholdInBytes=1000;expectedInsertions=1000;fpp=0.03') FROM testTable

Bump up the version of RoaringBitmap to 0.9.0 to include some bug fixes for Roaring64NavigableMap.

private final Type _type;
private final RoaringBitmap _bitmap;
private final Roaring64NavigableMap _longBitmap;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think this constructor is the only thing we need? If not, We should make this private and add static builders

/**
* The {@code IdSet} represents a collection of ids. It can be used to optimize the query with huge IN clause.
*/
public class IdSet implements Comparable<IdSet> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let’s make this an interface with multiple implementations?

@Jackie-Jiang Jackie-Jiang force-pushed the id_set branch 2 times, most recently from 456de34 to fab464f Compare August 28, 2020 02:40
@Jackie-Jiang
Copy link
Contributor Author

@kishoreg Refactored the PR and introduced the BloomFilterIdSet

@codecov-commenter
Copy link

codecov-commenter commented Aug 28, 2020

Codecov Report

Merging #5926 into master will increase coverage by 0.90%.
The diff coverage is 68.99%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #5926      +/-   ##
==========================================
+ Coverage   66.44%   67.35%   +0.90%     
==========================================
  Files        1075     1191     +116     
  Lines       54773    62438    +7665     
  Branches     8168     9533    +1365     
==========================================
+ Hits        36396    42054    +5658     
- Misses      15700    17286    +1586     
- Partials     2677     3098     +421     
Flag Coverage Δ
#integration 43.75% <48.87%> (?)
#unittests 58.79% <56.33%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...ot/broker/broker/AllowAllAccessControlFactory.java 100.00% <ø> (ø)
.../helix/BrokerUserDefinedMessageHandlerFactory.java 52.83% <0.00%> (-13.84%) ⬇️
...ava/org/apache/pinot/client/AbstractResultSet.java 53.33% <0.00%> (-3.81%) ⬇️
.../main/java/org/apache/pinot/client/Connection.java 44.44% <0.00%> (-4.40%) ⬇️
.../org/apache/pinot/client/ResultTableResultSet.java 24.00% <0.00%> (-10.29%) ⬇️
...ot/common/messages/RoutingTableRebuildMessage.java 54.54% <ø> (ø)
...ache/pinot/common/metadata/ZKMetadataProvider.java 70.55% <ø> (+3.70%) ⬆️
...metadata/segment/LLCRealtimeSegmentZKMetadata.java 47.61% <ø> (+3.35%) ⬆️
...g/apache/pinot/common/metrics/AbstractMetrics.java 82.57% <ø> (+7.90%) ⬆️
...a/org/apache/pinot/common/metrics/BrokerGauge.java 91.66% <ø> (+1.66%) ⬆️
... and 860 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e4ea9fc...fab464f. Read the comment docs.

Copy link
Member

@kishoreg kishoreg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants