Skip to content

Conversation

@mkleczek
Copy link
Contributor

@mkleczek mkleczek commented Nov 17, 2025

Calculation of hint message when requested relation is not present in schema cache requires creation of a FuzzySet (to use fuzzy search to find candidate tables). For schemas with many tables it is costly. This patch introduces dbTablesFuzzyIndex in SchemaCache to memoize the FuzzySet creation.

Fixes #4462
Fixes #4463

@steve-chavez
Copy link
Member

Definitely looks related to #4463. Have you tested the impact of this PR on a schema with lots of tables?

Looks like we'll need the same test setup for #4462.

@steve-chavez
Copy link
Member

I think I've managed to repro, with 125K tables:

17/Nov/2025:17:56:40 -0500: Schema cache queried in 5230.2 milliseconds
17/Nov/2025:17:56:40 -0500: Schema cache loaded 125271 Relations, 232 Relationships, 149 Functions, 15 Domain Representations, 45 Media Type Handlers, 1196 Timezones
17/Nov/2025:17:56:40 -0500: Schema cache loaded in 237.2 milliseconds

The server dies when trying to compute the hint:

$ curl localhost:3000/xprojects
curl: (52) Empty reply from server

# correct table names do work
$ curl localhost:3000/projects
[{"id":1,"name":"Windows 7","client_id":1}, 
 {"id":2,"name":"Windows 10","client_id":1}, 
 {"id":3,"name":"IOS","client_id":2}, 
 {"id":4,"name":"OSX","client_id":2}, 
 {"id":5,"name":"Orphan","client_id":null}]$ 

@steve-chavez
Copy link
Member

The server dies when trying to compute the hint:
$ curl localhost:3000/xprojects
curl: (52) Empty reply from server

Same result with this PR. I'll add a test case.

@mkleczek
Copy link
Contributor Author

The server dies when trying to compute the hint:
$ curl localhost:3000/xprojects
curl: (52) Empty reply from server

Same result with this PR. I'll add a test case.

Yeah... this PR won't help with the first hint computation (FuzzySet is created anyway). It would only help with subsequent hints (without this PR FuzzySet is created each time).

Seems like FuzzySet implementation is memory hungry.

Also - it depends on the number of schemas as with this PR there is a separate FuzzySet per schema created (doesn't help if all tables are in a single schema).

@mkleczek
Copy link
Contributor Author

@steve-chavez
I've spent a little bit of time researching the problem and it seems we do not have any good options for databases with a large number of relations.

My suggestion would be to either:

  1. Get rid of fuzzy relation search altogether.
  2. Only search when the number of relations is small enough (we need to come up with the right number).
  3. Have a configuration property to turn on/off fuzzy search.
  4. Hit the database with dummy SELECT 1 FROM non_existing_table LIMIT 0 or similar and retrieve information from the reported error (if there is no error that means the schema cache is stale).

Aside:
To be honest I have a gut feeling we do something wrong here trying to re-implement catalog search in PostgREST - this is the role of the DBMS. I am not even sure if #3869 is the right direction as ideally we should be doing just syntactical transformations of requests to queries (and if possible not have the schema cache a all). I can even imagine the implementation that delegates query construction to the database itself (ie. have SQL generating SQL) and only caching them for subsequent executions.

OTOH I understand that if we already have the schema cache then let's use it - trade-offs everywhere :)

@wolfgangwalther
Copy link
Member

  1. Get rid of fuzzy relation search altogether.

+1 for this also from a dependency perspective: We have to pin an older version of the fuzzy-thing dependency, because a newer version causes even more regression. I'd like to get rid of that dependency, ideally.

@steve-chavez
Copy link
Member

steve-chavez commented Nov 18, 2025

Get rid of fuzzy relation search altogether.

From my experience maintaining a postgREST-as-a-service, before the fuzzy search hint there were lots of support tickets that came from typos but users were quick to to blame postgREST and suspect a malfunction. It was a huge time sink. After the fuzzy search, those complaints just stopped.

Nowadays with typescript clients and other typed clients we have in the ecosystem, maybe this is not such a big problem but I'd still argue have the fuzzy search is a net win for 99% use cases.

Only search when the number of relations is small enough (we need to come up with the right number).

This looks like the viable short-term solution. Most use cases don't have that many tables too.

I am not even sure if #3869 is the right direction as ideally we should be doing just syntactical transformations of requests to queries (and if possible not have the schema cache a all)

IIRC, #3869 was mostly for better error messages, but we have other features that might also reuse it. To not use the schema cache we'd have to implement resource embedding and other features in PostgreSQL itself (possible but would take a loot of time).

+1 for this also from a dependency perspective: We have to pin an older version of the fuzzy-thing dependency, because a newer version causes even more regression. I'd like to get rid of that dependency, ideally.

Could we instead vendor the dependency? 🤔 (see laserpants/fuzzyset-haskell#9)

@steve-chavez
Copy link
Member

$ curl localhost:3000/xprojects
curl: (52) Empty reply from server

Besides the above, I'm also getting an "empty reply from server" on / (openAPI). I guess the empty replies happens because of the space leak protection #2122?

+1 for this also from a dependency perspective: We have to pin an older version of the fuzzy-thing dependency, because a newer version causes even more regression. I'd like to get rid of that dependency, ideally.

Looking at the algorithms used as inspiration on laserpants/fuzzyset-haskell#2 (comment), I wonder if it's optimized for our use case. Maybe we can come up with a better library? (cc @taimoorzaeem)

@mkleczek
Copy link
Contributor Author

Get rid of fuzzy relation search altogether.

From my experience maintaining a postgREST-as-a-service, before the fuzzy search hint there were lots of support tickets that came from typos but users were quick to to blame postgREST and suspect a malfunction. It was a huge time sink. After the fuzzy search, those complaints just stopped.

Makes sense.

[...]

+1 for this also from a dependency perspective: We have to pin an older version of the fuzzy-thing dependency, because a newer version causes even more regression. I'd like to get rid of that dependency, ideally.

Could we instead vendor the dependency? 🤔 (see laserpants/fuzzyset-haskell#9)

I think it makes sense to merge this PR regardless of these decisions as it is a quite obvious optimization (even though it does not fix the OOM issue).
@steve-chavez @wolfgangwalther WDYT?

@steve-chavez
Copy link
Member

I think it makes sense to merge this PR regardless of these decisions as it is a quite obvious optimization (even though it does not fix the OOM issue).
@steve-chavez @wolfgangwalther WDYT?

I don't think we should merge without a test that proves what's being improved (since it doesn't solve #4463).

@taimoorzaeem
Copy link
Collaborator

Looking at the algorithms used as inspiration on laserpants/fuzzyset-haskell#2 (comment), I wonder if it's optimized for our use case. Maybe we can come up with a better library? (cc @taimoorzaeem)

I will be happy to write a library 👍 . I would just need to do some feasibility, maybe a better library already exist on hackage? Needs some investigation.

@mkleczek mkleczek force-pushed the dbtables-fuzzyset-cache branch from a1b92a6 to 20b391f Compare November 18, 2025 18:39
@mkleczek
Copy link
Contributor Author

mkleczek commented Nov 18, 2025

I think it makes sense to merge this PR regardless of these decisions as it is a quite obvious optimization (even though it does not fix the OOM issue).
@steve-chavez @wolfgangwalther WDYT?

I don't think we should merge without a test that proves what's being improved (since it doesn't solve #4463).

Done.
Updated mixed load test to include requests for non-existing relation and the fixture to create 1000 tables.

See: https://github.com/PostgREST/postgrest/actions/runs/19477317041/attempts/1#summary-55740304508

@mkleczek
Copy link
Contributor Author

mkleczek commented Nov 18, 2025

Looking at the algorithms used as inspiration on laserpants/fuzzyset-haskell#2 (comment), I wonder if it's optimized for our use case. Maybe we can come up with a better library? (cc @taimoorzaeem)

I will be happy to write a library 👍 . I would just need to do some feasibility, maybe a better library already exist on hackage? Needs some investigation.

As I mentioned earlier - I've done some research and I couldn't find any interesting alternatives. Fuzzy search algorithms are divided into two subgroups: online and offline.

Online algorithms do not require additional memory but require scanning the whole list (so are O(n) at least).

Offline algorithms require creating an index. Best results are achieved using indexes based on n-grams - this is exactly what FuzzySet we use is. The problem with n-grams based index is that it is big (for each word we keep multiple entries: one per n-gram found in the word).

I am skeptical we can come up with a solution to 125k tables...

@mkleczek mkleczek force-pushed the dbtables-fuzzyset-cache branch from 20b391f to 2057eaa Compare November 19, 2025 05:20
@steve-chavez
Copy link
Member

steve-chavez commented Nov 19, 2025

Done.
Updated mixed load test to include requests for non-existing relation and the fixture to create 1000 tables.
See: https://github.com/PostgREST/postgrest/actions/runs/19477317041/attempts/1#summary-55740304508

@mkleczek We should not modify the mixed loadtest for this, that should stay the same across versions so we can see regressions. Additionally, that doesn't prove there's an enhancement here.

Test should be like:

  • A new dedicated fixture (like we have fortest/io/replica.sql).
  • A test that triggers a not found error
  • The test should measure there's a speed up via Server-Timing or other means (for this it might help to increase the number of tables to a big number)

@mkleczek
Copy link
Contributor Author

mkleczek commented Nov 19, 2025

Done.
Updated mixed load test to include requests for non-existing relation and the fixture to create 1000 tables.
See: https://github.com/PostgREST/postgrest/actions/runs/19477317041/attempts/1#summary-55740304508

@mkleczek We should not modify the mixed loadtest for this, that should stay the same across versions so we can see regressions.

I am not sure I understand this.

The old mixed loadtest does not account for this scenario, and without it it is not a good regression test (the proof is that we didn't catch performance regression introduced by #3869 )

So I would say adding the "table not found" scenario to it is an important addition - it makes it more realistic and covering wider set of scenarios.

Additionally, that doesn't prove there's an enhancement here.

https://github.com/PostgREST/postgrest/actions/runs/19477317041/attempts/1#summary-55740304508
shows more than 20% increase in throughput (108 vs 82 req/s) and around 10% less CPU usage comparing to main and 14.

Am I missing something here?

Test should be like:

  • A new dedicated fixture (like we have fortest/io/replica.sql).
  • A test that triggers a not found error
  • The test should measure there's a speed up via Server-Timing or other means (for this it might help to increase the number of tables to a big number)

Hmm...
Do we run io tests against different branches (similar to load tests)?
If not then I don't really understand how we could show improvement vs main or 14.

I thought load tests (not io tests) are supposed to test performance?

@steve-chavez could you advice on the way you would like it to be tested?

@steve-chavez
Copy link
Member

steve-chavez commented Nov 19, 2025

I am not sure I understand this.
The old mixed loadtest does not account for this scenario, and without it it is not a good regression test (the proof is that we didn't catch performance regression introduced by #3869 )

The problem is that "mixed" is already conflating too many scenarios (see #4123), with this addition is even worse as we also conflate "successful traffic" with "error traffic" (note that now we have errors.0: 404 Not Found in the loadtest results).

Another problem is that loadtests are slow, so we shouldn't add too many of them (IMO a dedicated one for errors doesn't seem worth it) if we can instead have an io test.

Hmm...
Do we run io tests against different branches (similar to load tests)?
If not then I don't really understand how we could show improvement vs main or 14.
I thought load tests (not io tests) are supposed to test performance?

We do have perf related tests on:

def test_schema_cache_load_max_duration(defaultenv):
"schema cache load should not surpass a max_duration of elapsed milliseconds"
max_duration = 500.0
env = {
**defaultenv,
"PGRST_DB_SCHEMAS": "apflora",
"PGRST_DB_POOL": "2",
"PGRST_DB_ANON_ROLE": "postgrest_test_anonymous",
}
with run(env=env, wait_max_seconds=30, no_startup_stdout=False) as postgrest:
log_lines = postgrest.read_stdout(nlines=50)
schema_cache_lines = [
line for line in log_lines if "Schema cache loaded in" in line
]
match = re.search(
r"Schema cache loaded in ([0-9]+(?:\.[0-9])?) milliseconds",
schema_cache_lines[-1],
)
assert match, f"unexpected log format: {schema_cache_lines[-1]}"
duration_ms = float(match.group(1))
assert duration_ms < max_duration

It looks enough to have an upper bound of expected time for computing the hint to not have regressions.

@steve-chavez could you advice on the way you would like it to be tested?

I'd suggest adding the fixtures on test/io/test_big_schema.py too, it seems fit. Roughly following the steps I mentioned above, I'm thinking measuring Server-Timing should be enough. You could also add this test on another PR, to see the current server-timing numbers and then updating the numbers on this PR, to confirm the improvement.

@mkleczek mkleczek force-pushed the dbtables-fuzzyset-cache branch 2 times, most recently from ff6d3eb to c2e1d94 Compare November 20, 2025 16:22
@mkleczek
Copy link
Contributor Author

@steve-chavez could you advice on the way you would like it to be tested?

I'd suggest adding the fixtures on test/io/test_big_schema.py too, it seems fit. Roughly following the steps I mentioned above, I'm thinking measuring Server-Timing should be enough. You could also add this test on another PR, to see the current server-timing numbers and then updating the numbers on this PR, to confirm the improvement.

@steve-chavez
Done. (see test_second_request_for_non_existent_table_should_be_quick in test_big_schema.py)



def test_second_request_for_non_existent_table_should_be_quick(defaultenv):
"requesting a non-existent relationship the second time should be quick"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: Why is the second time quicker? Does the fuzzy index get populated after hitting an error?

From the code, it looks like the fuzzy index is only populated when the schema cache is built.

Copy link
Contributor Author

@mkleczek mkleczek Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: Why is the second time quicker? Does the fuzzy index get populated after hitting an error?

From the code, it looks like the fuzzy index is only populated when the schema cache is built.

That's because of Haskell laziness.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, then to avoid confusion I suggest:

Suggested change
"requesting a non-existent relationship the second time should be quick"
"requesting a non-existent relationship should be quick after the schema cache is loaded (2nd request)"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second thought, not sure if I understand this.

When we do a request with resource embedding the schema cache is already loaded and works on the first request. Why does this change for the fuzzy index and we need to make 2 requests?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe what we need is some comments about this optimization, I suggest adding those on the type definition #4472 (comment)

Copy link
Contributor Author

@mkleczek mkleczek Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second thought, not sure if I understand this.

When we do a request with resource embedding the schema cache is already loaded and works on the first request. Why does this change for the fuzzy index and we need to make 2 requests?

This is indeed tricky: schema cache is loaded lazily as well. But there are these two lines in AppState.retryingSchemaCacheLoad:

(t, _) <- timeItT $ observer $ SchemaCacheSummaryObs $ showSummary sCache
observer $ SchemaCacheLoadedObs t

which cause evaluation of SchemaCache fields.

I've decided not to add dbTablesFuzzyIndex to schema cache summary and leave its evaluation till first use.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Introduced the type alias, added comments to SchemaCache field and updated the test description.

@mkleczek mkleczek force-pushed the dbtables-fuzzyset-cache branch from c2e1d94 to ee52b3f Compare November 21, 2025 05:28
@mkleczek mkleczek force-pushed the dbtables-fuzzyset-cache branch 2 times, most recently from f8d5cf5 to 6226416 Compare November 21, 2025 19:43
@steve-chavez
Copy link
Member

@mkleczek Sorry for the late reply here.

Does this fix both #4462 and #4463? If so, maybe we should change the commit prefix to fix: and also add an entry to the CHANGELOG?

. As discussed in https://github.com/PostgREST/postgrest/pull/4523/changes#r2598153417 - leaving the thing as it is implemented right now is not only affecting stability for large schemas but is a DoS vulnerability (an attacker can request random table names to overload the system).
Re-creating FuzzySet upon each request opens up yet another DoS vulnerability.

It does look like the outcome should fix + patch release. Could we add a test or loadtest for the above?

@mkleczek
Copy link
Contributor Author

It does look like the outcome should fix + patch release. Could we add a test or loadtest for the above?

@steve-chavez
We have the test case in big schema tests suite. Updates to mixed load tests were reverted upon your request. Not sure what test you have in mind.

@taimoorzaeem
Copy link
Collaborator

How about a loadtest similar to what I mentioned here: #4523 (comment)?

According to my analysis, max 500 number of tables works best for fuzzy search even for 24 character long hashes as table names.

So, maybe the Loadtest (fuzzy-search) should be something like:

  1. Create maximum 500 tables with random table names (30 character hashes).
  2. Then, send some number of requests (I am not sure what the usual is for other load tests), where the table names are random (30 character hashes).
  3. Then, measure the performance?

There is no reason for it to be 30 characters long. The problem that occurred in #4463 had names 30 characters long, so maybe we should do that?

"GET /api/resources/c13d05e5-9e55-4d03-bf7e-042a2ade7e49/data/?page_size=1&DATE_DEBUT__sort=desc&P_SOUSCRITE__exact=6 HTTP/1.1"

@steve-chavez @mkleczek WDYT?

@mkleczek
Copy link
Contributor Author

mkleczek commented Dec 22, 2025

How about a loadtest similar to what I mentioned here: #4523 (comment)?

@steve-chavez @taimoorzaeem

  • the problem with fuzzy search is obviously affecting the users and has been well analyzed in several discussions
  • It is apparent that it is not possible to fix the issue in general and we don't have any other choice than provide mitigations
  • performance issues in parts of code responsible for "not found" error handling are very problematic because they are DoS vulnerabilities - closing these vulnerabilities should be a top priority (TBH they are good candidates for CVEs which would be a shame because PostgREST has clean record in CVE databases)
  • this PR is straightforward and obviously improves the situation (while not providing a general solution to fuzzy search it addresses deficiencies in current code) - at least it closes DoS vulnerability.

So while I am open to add some specialized load test for non-existent database objects, I think it is not necessary condition to merge this PR.

I would say it is the other way around - it should be merged urgently and - what's more - it should be backported to 14.x.

@steve-chavez @taimoorzaeem my 2 cents

@steve-chavez
Copy link
Member

steve-chavez commented Dec 23, 2025

Added a loadtest in #4576, that should help prove #4462 is solved here.

@mkleczek Could you rebase?

@mkleczek mkleczek force-pushed the dbtables-fuzzyset-cache branch from b2e4c42 to 069064f Compare December 28, 2025 15:55
@mkleczek
Copy link
Contributor Author

Added a loadtest in #4576, that should help prove #4462 is solved here.

@mkleczek Could you rebase?

Done. See https://github.com/PostgREST/postgrest/actions/runs/20556133512/attempts/1#summary-59040541838

@taimoorzaeem
Copy link
Collaborator

Done. See https://github.com/PostgREST/postgrest/actions/runs/20556133512/attempts/1#summary-59040541838

From results, this looks like a massive improvement. 🚀

@mkleczek
Copy link
Contributor Author

Done. See https://github.com/PostgREST/postgrest/actions/runs/20556133512/attempts/1#summary-59040541838

From results, this looks like a massive improvement. 🚀

I haven't yet dig into the test implemented by @steve-chavez but such huge speedup might be misleading. We have a 500 limit on the number of tables that we index in the FuzzySet. So possibly we just don't calculate any hint in the test hence such a difference.

@mkleczek
Copy link
Contributor Author

Done. See https://github.com/PostgREST/postgrest/actions/runs/20556133512/attempts/1#summary-59040541838

From results, this looks like a massive improvement. 🚀

I haven't yet dig into the test implemented by @steve-chavez but such huge speedup might be misleading. We have a 500 limit on the number of tables that we index in the FuzzySet. So possibly we just don't calculate any hint in the test hence such a difference.

@steve-chavez @taimoorzaeem

I've run postgrest-loadtest-against -k errors main with the number of tables in errors.sql set to 100 (to make sure we are actually computing the hint) and the speedup is massive as well.

What do you think about changing the number of tables in errors.sql to 490 (instead of current 20000)?

@steve-chavez
Copy link
Member

From results, this looks like a massive improvement. 🚀

🚀 Although the CPU usage is almost x3 on the errors load test unlike the mixed one, I wonder why that is?

What do you think about changing the number of tables in errors.sql to 490 (instead of current 20000)?

The problem is that just using 490 doesn't prove the high memory usage reported on #4462. I think we should merge as is and call #4462 solved.

@steve-chavez
Copy link
Member

The problem is that just using 490 doesn't prove the high memory usage reported on #4462

See the load test result which shows ~356 MB usage on non-head branches vs the head branch which reaches ~77 MB.

@taimoorzaeem
Copy link
Collaborator

taimoorzaeem commented Jan 1, 2026

Although the CPU usage is almost x3 on the errors load test unlike the mixed one, I wonder why that is?

Yeah, that's strange. I examined this and I think that's because on mixed ones, we also have have OPTIONS and a GET on root endpoint which make the overall CPU usage looks less. After removing these two cases and re-running the mixed ones, I noticed that the CPU usage is actually quite similar to what we have in the errors test.

@steve-chavez
Copy link
Member

we also have have OPTIONS and a GET on root endpoint which make the overall CPU usage looks less. After removing these two cases and re-running the mixed ones, I noticed that the CPU usage is actually quite similar to what we have in the errors test.

Great find. Noted this for later #4123 (comment).

it should be backported to 14.x.

@mkleczek Agree. Let's turn this into a fix: commit then?

@mkleczek mkleczek force-pushed the dbtables-fuzzyset-cache branch from 069064f to 05375b8 Compare January 2, 2026 09:20
@mkleczek mkleczek changed the title perf: Cache dbTables FuzzySet per schema fix: Performance and high memory usage of relation hint calculation Jan 2, 2026
@mkleczek
Copy link
Contributor Author

mkleczek commented Jan 2, 2026

it should be backported to 14.x.

@mkleczek Agree. Let's turn this into a fix: commit then?

Done.

Copy link
Collaborator

@taimoorzaeem taimoorzaeem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! We should also add an entry in the CHANGELOG.md.

* Calculation of hint message when requested relation is not present in schema cache requires creation of a FuzzySet (to use fuzzy search to find candidate tables). For schemas with many tables it is costly.
This patch introduces dbTablesFuzzyIndex in SchemaCache to memoize the FuzzySet creation.

* Additionally, because of FuzzySet large memory requirements, this patch introduces a limit of 500 relations per schema, above which FuzzySet is not created and hint calculation disabled.
@mkleczek mkleczek force-pushed the dbtables-fuzzyset-cache branch from 05375b8 to 82f6db1 Compare January 2, 2026 10:04
@mkleczek
Copy link
Contributor Author

mkleczek commented Jan 2, 2026

Looks good! We should also add an entry in the CHANGELOG.md.

Done.

@steve-chavez steve-chavez merged commit e592d56 into PostgREST:main Jan 2, 2026
32 checks passed
@postgrest-ci
Copy link

postgrest-ci bot commented Jan 2, 2026

Successfully created backport PR for v14:

- Fix not returning `Content-Length` on empty HTTP `201` responses by @laurenceisla in #4518
- Fix inaccurate Server-Timing header durations by @steve-chavez in #4522
- Fix inaccurate "Schema cache queried" logs by @steve-chavez in #4522
- Fix performance and high memory usage of relation hint calculation by @mkleczek in #4462 #4463
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was just going to release v14 then realized this changelog entry was put in the wrong spot, messing up our postgrest-release command.

This was already fixed on #4589, but please be mindful of the CHANGELOG placement @mkleczek 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

ServerDisconnectedError when querying tables created after PostgREST startup since v13+ Continuous memory growth and OOM kills since PostgREST v13+

4 participants