[Postgres] Optimise IN and NOT_IN Queries for Primitive and ARRAY Fields by suddendust · Pull Request #251 · hypertrace/document-store

suddendust · 2025-11-20T08:38:07Z

Description

This PR optimises IN and NOT_IN queries both for primitives/array fields in PG.

Current State and Scope of Optimisation

Currently Generated SQL Queries

Primitive Fields

Operation	Field Type	Generated SQL
IN	INT (`_id`)	`SELECT COUNT() FROM (SELECT FROM "myTestFlat" WHERE ARRAY["_id"]::text[] && ARRAY[('1'::int4), ('3'::int4), ('5'::int4)]::text[]) p(countWithParser)`
IN	STRING (`item`)	`SELECT COUNT() FROM (SELECT FROM "myTestFlat" WHERE ARRAY["item"]::text[] && ARRAY[('Soap'), ('Shampoo')]::text[]) p(countWithParser)`
IN	NUMERIC (`price`)	`SELECT COUNT() FROM (SELECT FROM "myTestFlat" WHERE ARRAY["price"]::text[] && ARRAY[('5'::int4), ('10'::int4)]::text[]) p(countWithParser)`
NOT_IN	INT (`_id`)	`SELECT COUNT() FROM (SELECT FROM "myTestFlat" WHERE "_id" IS NULL OR NOT (ARRAY["_id"]::text[] && ARRAY[('1'::int4), ('3'::int4), ('5'::int4)]::text[])) p(countWithParser)`
NOT_IN	STRING (`item`)	`SELECT COUNT() FROM (SELECT FROM "myTestFlat" WHERE "item" IS NULL OR NOT (ARRAY["item"]::text[] && ARRAY[('Soap')]::text[])) p(countWithParser)`

Array Fields

Operation	Field Type	Generated SQL
IN	TEXT[] (`tags`)	`SELECT COUNT() FROM (SELECT FROM "myTestFlat" WHERE ARRAY["tags"]::text[] && ARRAY[('hygiene'), ('grooming')]::text[]) p(countWithParser)`
IN	INTEGER[] (`numbers`)	`SELECT COUNT() FROM (SELECT FROM "myTestFlat" WHERE ARRAY["numbers"]::text[] && ARRAY[('1'::int4), ('10'::int4)]::text[]) p(countWithParser)`
NOT_IN	TEXT[] (`tags`)	`SELECT COUNT() FROM (SELECT FROM "myTestFlat" WHERE "tags" IS NULL OR NOT (ARRAY["tags"]::text[] && ARRAY[('hygiene')]::text[])) p(countWithParser)`

Observations

For primitives, it casts to ::text[] arrays and then uses the overlap operator to evaluate the predicate. This is very efficient because PG cannot use indexes on the LHS col anymore + casting overhead. Instead, we should start generating IN queries for primitives.
For arrays, it casts both LHS and RHS to ::text[]. This again is efficient because PG cannot use index on the casted LHS col.

New Queries (after this change)

This PR has the following changes to optimise the queries above:

For primitives, it uses the IN operator with no casting on the LHS col.
For arrays, we have three cases:
2.1: Users continue using IdentifierExpression for array columns - This is not supported anymore and will break any existing queries. Users must start using ArrayIdentifierExpression for arrays. This is safe because flat collections are not being used by any customers right now.
2.2: Users start using ArrayIdentifierExpression without the ArrayType (so document-store know that this is an array col but cannot tell the type of its objects).
2.3: Users start using the ArrayIdentifierExpression with the corresponding ArrayType - This is the most optimal case.

Primitive Fields (Using Scalar Parser - Optimized)

Operation	Field Type	Generated SQL
IN	INT (`_id`)	`SELECT COUNT() FROM (SELECT FROM "myTestFlat" WHERE "_id" IN (('1'::int4), ('3'::int4), ('5'::int4))) p(countWithParser)`
IN	STRING (`item`)	`SELECT COUNT() FROM (SELECT FROM "myTestFlat" WHERE "item" IN (('Soap'), ('Shampoo'))) p(countWithParser)`
IN	NUMERIC (`price`)	`SELECT COUNT() FROM (SELECT FROM "myTestFlat" WHERE "price" IN (('5'::int4), ('10'::int4))) p(countWithParser)`
NOT_IN	INT (`_id`)	`SELECT COUNT() FROM (SELECT FROM "myTestFlat" WHERE "_id" IS NULL OR NOT ("_id" IN (('1'::int4), ('3'::int4), ('5'::int4)))) p(countWithParser)`
NOT_IN	STRING (`item`)	`SELECT COUNT() FROM (SELECT FROM "myTestFlat" WHERE "item" IS NULL OR NOT ("item" IN (('Soap')))) p(countWithParser)`

Observation: We keep using the older logic of casting both LHS and RHS to `::text[]`, resulting in the current poor perf.

Array Fields (Using [ArrayIdentifierExpression] without [ArrayType])

Operation	Field Type	Generated SQL
IN	TEXT[] (`tags`)	`SELECT * FROM "myTestFlat" WHERE "tags"::text[] && ARRAY[?, ?]::text[]`
IN	INTEGER[] (`numbers`)	`SELECT * FROM "myTestFlat" WHERE "numbers"::text[] && ARRAY[?, ?]::text[]`
NOT_IN	TEXT[] (`tags`)	`SELECT * FROM "myTestFlat" WHERE "tags" IS NULL OR NOT ("tags"::text[] && ARRAY[?]::text[])`

Observation: We keep using the older logic of casting both LHS and RHS to `::text[]`, resulting in the current poor perf.

Array Fields (Using [ArrayIdentifierExpression] with [ArrayType])

Operation	Field Type	Generated SQL
IN	TEXT[] (`tags`)	`SELECT * FROM "myTestFlat" WHERE "tags" && ARRAY[?, ?]::text[]`
IN	INTEGER[] (`numbers`)	`SELECT * FROM "myTestFlat" WHERE "numbers" && ARRAY[?, ?]`
NOT_IN	TEXT[] (`tags`)	`SELECT * FROM "myTestFlat" WHERE "tags" IS NULL OR NOT ("tags" && ARRAY[?]::text[])`

Observation: With the type info in hand, we cast only RHS for `text[]` arrays. For arrays of primitive types, we don't cast at all, resulting in the best performance. Note that even with `"tags" && ARRAY[?, ?]::text[]`, PG would be able to use indices for this query. This casting is required because o/w, JDBC binds these params as character varying[] which results in a casting error.

Testing

Added integration tests for all 3 cases.
Tested them in a live environment.

Checklist:

My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
Any dependent changes have been merged and published in downstream modules

codecov · 2025-11-20T08:39:39Z

Codecov Report

❌ Patch coverage is 83.69565% with 15 lines in your changes missing coverage. Please review.
✅ Project coverage is 80.50%. Comparing base (61e844b) to head (468bd36).
⚠️ Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
.../parser/filter/PostgresContainsParserSelector.java	61.53%	0 Missing and 5 partials ⚠️
...ery/v1/parser/filter/PostgresInParserSelector.java	64.28%	0 Missing and 5 partials ⚠️
.../v1/parser/filter/PostgresNotInParserSelector.java	64.28%	0 Missing and 5 partials ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##               main     #251      +/-   ##
============================================
+ Coverage     80.49%   80.50%   +0.01%     
- Complexity     1162     1210      +48     
============================================
  Files           217      224       +7     
  Lines          5551     5626      +75     
  Branches        490      487       -3     
============================================
+ Hits           4468     4529      +61     
  Misses          753      753              
- Partials        330      344      +14

Flag	Coverage Δ
integration	`80.50% <83.69%> (+0.01%)`	⬆️
unit	`57.83% <61.95%> (+0.15%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

puneet-traceable · 2025-11-20T08:56:36Z

...ostgres/query/v1/parser/filter/nonjson/field/PostgresInRelationalFilterParserArrayField.java

+
+    // Extract array type if available
+    String arrayTypeCast = null;
+    if (expression.getLhs() instanceof ArrayIdentifierExpression) {


should we try getting rid of instanceof here. Should we use visitor?

Yes, will refactor.

@puneet-traceable Refactored to use visitors

puneet-traceable · 2025-11-20T10:36:37Z

...ostgres/query/v1/parser/filter/nonjson/field/PostgresInRelationalFilterParserArrayField.java

+            .collect(Collectors.joining(", "));
+
+    // Use array overlap operator for array fields
+    if (arrayTypeCast != null) {


nit: arrayType is probably better

puneet-traceable · 2025-11-20T10:40:41Z

LGTM

suresh-prakash · 2025-11-21T05:53:27Z

.../documentstore/postgres/query/v1/parser/filter/nonjson/field/PostgresArrayTypeExtractor.java

+
+  @Override
+  public String visit(JsonIdentifierExpression expression) {
+    return null; // JSON fields don't have array type


Should we be throwing exceptions from here?

Agreed, done.

…ctor

suddendust added 4 commits November 20, 2025 11:49

Initial commit

9ed8cba

Add array types

aa60382

WIP

21e409d

Add backward compatibility tests

193d023

suddendust requested review from avinashkolluru, kotharironak, skjindal93 and suresh-prakash as code owners November 20, 2025 08:38

suddendust added 2 commits November 20, 2025 14:10

Spotless

34ee89e

Refactored test case

d0193a5

suddendust changed the title ~~[Draft] [Postgres] Optimise IN and NOT_IN Queries for Primitive and ARRAY Fields~~ [Postgres] Optimise IN and NOT_IN Queries for Primitive and ARRAY Fields Nov 20, 2025

puneet-traceable reviewed Nov 20, 2025

View reviewed changes

suddendust added 3 commits November 20, 2025 15:16

Remove instanceof checks using visitors

bb811e6

Added UTs for coverage

ecdaea2

Spotless

e67f181

puneet-traceable reviewed Nov 20, 2025

View reviewed changes

suddendust added 3 commits November 20, 2025 19:38

Remove backward compat tests

a9a038e

Fixed failing test cases

a98ecb2

Renamed variable

447ace1

suresh-prakash previously approved these changes Nov 21, 2025

View reviewed changes

Throw exceptions instead of returning nulls in PostgresArrayTypeExtra…

468bd36

…ctor

suddendust dismissed suresh-prakash’s stale review via 468bd36 November 21, 2025 06:06

suresh-prakash approved these changes Nov 21, 2025

View reviewed changes

suresh-prakash merged commit d411cbc into hypertrace:main Nov 21, 2025
6 checks passed

suddendust mentioned this pull request Nov 21, 2025

Optimise IN queries for json fields in flat collections #252

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Postgres] Optimise IN and NOT_IN Queries for Primitive and ARRAY Fields#251

[Postgres] Optimise IN and NOT_IN Queries for Primitive and ARRAY Fields#251
suresh-prakash merged 13 commits intohypertrace:mainfrom
suddendust:primitive_in_handling

suddendust commented Nov 20, 2025 •

edited

Loading

Uh oh!

codecov bot commented Nov 20, 2025 •

edited

Loading

Uh oh!

puneet-traceable Nov 20, 2025

Uh oh!

suddendust Nov 20, 2025

Uh oh!

suddendust Nov 20, 2025

Uh oh!

puneet-traceable Nov 20, 2025

Uh oh!

suddendust Nov 20, 2025

Uh oh!

puneet-traceable commented Nov 20, 2025

Uh oh!

suresh-prakash Nov 21, 2025

Uh oh!

suddendust Nov 21, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

suddendust commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Current State and Scope of Optimisation

Currently Generated SQL Queries

Primitive Fields

Array Fields

Observations

New Queries (after this change)

Primitive Fields (Using Scalar Parser - Optimized)

Observation: We keep using the older logic of casting both LHS and RHS to ::text[], resulting in the current poor perf.

Array Fields (Using [ArrayIdentifierExpression] without [ArrayType])

Observation: We keep using the older logic of casting both LHS and RHS to ::text[], resulting in the current poor perf.

Array Fields (Using [ArrayIdentifierExpression] with [ArrayType])

Testing

Checklist:

Uh oh!

codecov bot commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

puneet-traceable Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

suddendust Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

suddendust Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

puneet-traceable Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

suddendust Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

puneet-traceable commented Nov 20, 2025

Uh oh!

suresh-prakash Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

suddendust Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

suddendust commented Nov 20, 2025 •

edited

Loading

Observation: We keep using the older logic of casting both LHS and RHS to `::text[]`, resulting in the current poor perf.

Observation: We keep using the older logic of casting both LHS and RHS to `::text[]`, resulting in the current poor perf.

codecov bot commented Nov 20, 2025 •

edited

Loading