[Postgres] Optimise IN and NOT_IN Queries for Primitive and ARRAY Fields#251
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #251 +/- ##
============================================
+ Coverage 80.49% 80.50% +0.01%
- Complexity 1162 1210 +48
============================================
Files 217 224 +7
Lines 5551 5626 +75
Branches 490 487 -3
============================================
+ Hits 4468 4529 +61
Misses 753 753
- Partials 330 344 +14
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
|
||
| // Extract array type if available | ||
| String arrayTypeCast = null; | ||
| if (expression.getLhs() instanceof ArrayIdentifierExpression) { |
There was a problem hiding this comment.
should we try getting rid of instanceof here. Should we use visitor?
There was a problem hiding this comment.
Yes, will refactor.
There was a problem hiding this comment.
@puneet-traceable Refactored to use visitors
| .collect(Collectors.joining(", ")); | ||
|
|
||
| // Use array overlap operator for array fields | ||
| if (arrayTypeCast != null) { |
There was a problem hiding this comment.
nit: arrayType is probably better
|
LGTM |
|
|
||
| @Override | ||
| public String visit(JsonIdentifierExpression expression) { | ||
| return null; // JSON fields don't have array type |
There was a problem hiding this comment.
Should we be throwing exceptions from here?
Description
This PR optimises
INandNOT_INqueries both for primitives/array fields in PG.Current State and Scope of Optimisation
Currently Generated SQL Queries
Primitive Fields
_id)SELECT COUNT(*) FROM (SELECT * FROM "myTestFlat" WHERE ARRAY["_id"]::text[] && ARRAY[('1'::int4), ('3'::int4), ('5'::int4)]::text[]) p(countWithParser)item)SELECT COUNT(*) FROM (SELECT * FROM "myTestFlat" WHERE ARRAY["item"]::text[] && ARRAY[('Soap'), ('Shampoo')]::text[]) p(countWithParser)price)SELECT COUNT(*) FROM (SELECT * FROM "myTestFlat" WHERE ARRAY["price"]::text[] && ARRAY[('5'::int4), ('10'::int4)]::text[]) p(countWithParser)_id)SELECT COUNT(*) FROM (SELECT * FROM "myTestFlat" WHERE "_id" IS NULL OR NOT (ARRAY["_id"]::text[] && ARRAY[('1'::int4), ('3'::int4), ('5'::int4)]::text[])) p(countWithParser)item)SELECT COUNT(*) FROM (SELECT * FROM "myTestFlat" WHERE "item" IS NULL OR NOT (ARRAY["item"]::text[] && ARRAY[('Soap')]::text[])) p(countWithParser)Array Fields
tags)SELECT COUNT(*) FROM (SELECT * FROM "myTestFlat" WHERE ARRAY["tags"]::text[] && ARRAY[('hygiene'), ('grooming')]::text[]) p(countWithParser)numbers)SELECT COUNT(*) FROM (SELECT * FROM "myTestFlat" WHERE ARRAY["numbers"]::text[] && ARRAY[('1'::int4), ('10'::int4)]::text[]) p(countWithParser)tags)SELECT COUNT(*) FROM (SELECT * FROM "myTestFlat" WHERE "tags" IS NULL OR NOT (ARRAY["tags"]::text[] && ARRAY[('hygiene')]::text[])) p(countWithParser)Observations
For primitives, it casts to
::text[]arrays and then uses the overlap operator to evaluate the predicate. This is very efficient because PG cannot use indexes on the LHS col anymore + casting overhead. Instead, we should start generatingINqueries for primitives.For arrays, it casts both LHS and RHS to
::text[]. This again is efficient because PG cannot use index on the casted LHS col.New Queries (after this change)
This PR has the following changes to optimise the queries above:
INoperator with no casting on the LHS col.2.1: Users continue using
IdentifierExpressionfor array columns - This is not supported anymore and will break any existing queries. Users must start usingArrayIdentifierExpressionfor arrays. This is safe because flat collections are not being used by any customers right now.2.2: Users start using
ArrayIdentifierExpressionwithout theArrayType(sodocument-storeknow that this is an array col but cannot tell the type of its objects).2.3: Users start using the
ArrayIdentifierExpressionwith the correspondingArrayType- This is the most optimal case.Primitive Fields (Using Scalar Parser - Optimized)
_id)SELECT COUNT(*) FROM (SELECT * FROM "myTestFlat" WHERE "_id" IN (('1'::int4), ('3'::int4), ('5'::int4))) p(countWithParser)item)SELECT COUNT(*) FROM (SELECT * FROM "myTestFlat" WHERE "item" IN (('Soap'), ('Shampoo'))) p(countWithParser)price)SELECT COUNT(*) FROM (SELECT * FROM "myTestFlat" WHERE "price" IN (('5'::int4), ('10'::int4))) p(countWithParser)_id)SELECT COUNT(*) FROM (SELECT * FROM "myTestFlat" WHERE "_id" IS NULL OR NOT ("_id" IN (('1'::int4), ('3'::int4), ('5'::int4)))) p(countWithParser)item)SELECT COUNT(*) FROM (SELECT * FROM "myTestFlat" WHERE "item" IS NULL OR NOT ("item" IN (('Soap')))) p(countWithParser)Observation: We keep using the older logic of casting both LHS and RHS to
::text[], resulting in the current poor perf.Array Fields (Using [ArrayIdentifierExpression] without [ArrayType])
tags)SELECT * FROM "myTestFlat" WHERE "tags"::text[] && ARRAY[?, ?]::text[]numbers)SELECT * FROM "myTestFlat" WHERE "numbers"::text[] && ARRAY[?, ?]::text[]tags)SELECT * FROM "myTestFlat" WHERE "tags" IS NULL OR NOT ("tags"::text[] && ARRAY[?]::text[])Observation: We keep using the older logic of casting both LHS and RHS to
::text[], resulting in the current poor perf.Array Fields (Using [ArrayIdentifierExpression] with [ArrayType])
tags)SELECT * FROM "myTestFlat" WHERE "tags" && ARRAY[?, ?]::text[]numbers)SELECT * FROM "myTestFlat" WHERE "numbers" && ARRAY[?, ?]tags)SELECT * FROM "myTestFlat" WHERE "tags" IS NULL OR NOT ("tags" && ARRAY[?]::text[])Observation: With the type info in hand, we cast only RHS for
text[]arrays. For arrays of primitive types, we don't cast at all, resulting in the best performance. Note that even with"tags" && ARRAY[?, ?]::text[], PG would be able to use indices for this query. This casting is required because o/w, JDBC binds these params as character varying[] which results in a casting error.Testing
Checklist: