Skip to content

Multi-stage: Perf issue when IN expression has a lot of entries #13617

@gortiz

Description

@gortiz

Since we updated Calcite dependency from 1.31 to 1.37, multi stage engine has issues executing queries that use IN expressions with a lot (lets say >50) elements. Specifically, we spend a lot (in the order of seconds) time optimizing the query.

It can be easily tested by running MultistageEngineQuickStart and executing a query like:

--explain plan for
SELECT DestCityName
FROM (
         SELECT DestCityName
         FROM airlineStats
         WHERE DestCityName IN (
 			    'a0', 'a1','a2','a3','a4','a5','a6','a7','a8','a9',
                            'a10', 'a11', 'a12', 'a13', 'a14', 'a15', 'a16', 'a17', 'a18', 'a19',
                            'a20', 'a21', 'a22', 'a23', 'a24', 'a25', 'a26', 'a27', 'a28', 'a29',
                            'a30', 'a31', 'a32', 'a33', 'a34', 'a35', 'a36', 'a37', 'a38', 'a39',
                            'a40', 'a41', 'a42', 'a43', 'a44', 'a45', 'a46', 'a47', 'a48', 'a49',
                            'a50', 'a51', 'a52', 'a53', 'a54', 'a55', 'a56', 'a57', 'a58', 'a59',
                            'a60', 'a61', 'a62', 'a63', 'a64', 'a65', 'a66', 'a67', 'a68', 'a69',
                            'a70', 'a71', 'a72', 'a73', 'a74', 'a75', 'a76', 'a77', 'a78', 'a79',
                            'a80', 'a81', 'a82', 'a83', 'a84', 'a85', 'a86', 'a87', 'a88', 'a89',
                            'a90', 'a91', 'a92', 'a93', 'a94', 'a95', 'a96', 'a97', 'a98', 'a99',
                            'a100', 'a101', 'a102', 'a103', 'a104', 'a105', 'a106', 'a107', 'a108', 'a109',
                            'a110', 'a111', 'a112', 'a113', 'a114', 'a115', 'a116', 'a117', 'a118', 'a119',
                            'a120', 'a121', 'a122', 'a123', 'a124', 'a125', 'a126', 'a127', 'a128', 'a129',
                            'a130', 'a131', 'a132', 'a133', 'a134', 'a135', 'a136', 'a137', 'a138', 'a139',
                            'a140', 'a141', 'a142', 'a143', 'a144', 'a145', 'a146', 'a147', 'a148', 'a149',
                            'a150', 'a151', 'a152', 'a153', 'a154', 'a155', 'a156', 'a157', 'a158', 'a159', 
                            'a160', 'a161', 'a162', 'a163', 'a164', 'a165', 'a166', 'a167', 'a168', 'a169',
                            'a170', 'a171', 'a172', 'a173', 'a174', 'a175', 'a176', 'a177', 'a178', 'a179',
                            'a180', 'a181', 'a182', 'a183', 'a184', 'a185', 'a186', 'a187', 'a188', 'a189',
                            'a190', 'a191', 'a192', 'a193', 'a194', 'a195', 'a196', 'a197', 'a198', 'a199'
             )
         GROUP BY DestCityName
             LIMIT 2147483647
     ) as a

After studying the issue for a while, it looks like the issue comes from a newly introduced Calcite optimization when dealing with ORs. In order to calculate that optimization, Calcite applies an algorithm whose cost is at least quadratic in terms of sub-predicates in the OR.

These OR expressions are generated either by Calcite or by Pinot in different parts of the code. Right now I found that:

  • IN can be transformed into OR while SqlNodes are converted into RelNodes. Calcite knows it may be a problem and by default does not convert INs that have more than 20 elements. But we explicitly set that limit to Integer.MAX_VALUE in PinotRuleUtils.PINOT_SQL_TO_REL_CONFIG.
  • PinotFilterExpandSearchRule, where we transform any remanent SEARCH node into ORS.

We have explored a couple of alternatives to fix the problem in #13605 and #13614

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions