Skip to content

Cache last written LSN for last updated relations to reduce wait LSN time for queries to other relations #167

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 161 commits into from

Conversation

knizhnik
Copy link

lubennikovaav and others added 30 commits February 9, 2022 19:34
Make smgr API pluggable. Add smgr_hook that can be used to define custom smgrs.
Remove smgrsw[] array and smgr_sw selector. Instead, smgropen() loads
f_smgr implementation using smgr_hook.

Also add smgr_init_hook and smgr_shutdown_hook.
And a lot of mechanical changes in smgr.c functions.

This patch is proposed to community: https://commitfest.postgresql.org/33/3216/

Author: anastasia <lubennikovaav@gmail.com>
Add contrib/zenith that handles interaction with remote pagestore.
To use it add 'shared_preload_library = zenith' to postgresql.conf.

It adds a protocol for network communications - see libpagestore.c;
and implements smgr API.

Also it adds several custom GUC variables:
- zenith.page_server_connstring
- zenith.callmemaybe_connstring
- zenith.zenith_timeline
- zenith.wal_redo

Authors:
Stas Kelvich <stanconn@gmail.com>
Konstantin Knizhnik <knizhnik@garret.ru>
Heikki Linnakangas <heikki.linnakangas@iki.fi>
Add WAL redo helper for zenith - alternative postgres operation mode to replay wal by pageserver request.

To start postgres in wal-redo mode, run postgres with --wal-redo option
It requires zenith shared library and zenith.wal_redo

Author: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Save lastWrittenPageLSN in XLogCtlData to know what pages to request from remote pageserver.

Authors:
Konstantin Knizhnik <knizhnik@garret.ru>
Heikki Linnakangas <heikki.linnakangas@iki.fi>
In the test_createdb test, we created a new database, and created a new
branch after that. I was seeing the test fail with:

    PANIC:  could not open critical system index 2662

The WAL contained records like this:

    rmgr: XLOG        len (rec/tot):     49/  8241, tx:          0, lsn: 0/0163E8F0, prev 0/0163C8A0, desc: FPI , blkref #0: rel 1663/12985/1249 fork fsm blk 1 FPW
    rmgr: XLOG        len (rec/tot):     49/  8241, tx:          0, lsn: 0/01640940, prev 0/0163E8F0, desc: FPI , blkref #0: rel 1663/12985/1249 fork fsm blk 2 FPW
    rmgr: Standby     len (rec/tot):     54/    54, tx:          0, lsn: 0/01642990, prev 0/01640940, desc: RUNNING_XACTS nextXid 541 latestCompletedXid 539 oldestRunningXid 540; 1 xacts: 540
    rmgr: XLOG        len (rec/tot):    114/   114, tx:          0, lsn: 0/016429C8, prev 0/01642990, desc: CHECKPOINT_ONLINE redo 0/163C8A0; tli 1; prev tli 1; fpw true; xid 0:541; oid 24576; multi 1; offset 0; oldest xid 532 in DB 1; oldest multi 1 in DB 1; oldest/newest commit timestamp xid: 0/0; oldest running xid 540; online
    rmgr: Database    len (rec/tot):     42/    42, tx:        540, lsn: 0/01642A40, prev 0/016429C8, desc: CREATE copy dir 1663/1 to 1663/16390
    rmgr: Standby     len (rec/tot):     54/    54, tx:          0, lsn: 0/01642A70, prev 0/01642A40, desc: RUNNING_XACTS nextXid 541 latestCompletedXid 539 oldestRunningXid 540; 1 xacts: 540
    rmgr: XLOG        len (rec/tot):    114/   114, tx:          0, lsn: 0/01642AA8, prev 0/01642A70, desc: CHECKPOINT_ONLINE redo 0/1642A70; tli 1; prev tli 1; fpw true; xid 0:541; oid 24576; multi 1; offset 0; oldest xid 532 in DB 1; oldest multi 1 in DB 1; oldest/newest commit timestamp xid: 0/0; oldest running xid 540; online
    rmgr: Transaction len (rec/tot):     66/    66, tx:        540, lsn: 0/01642B20, prev 0/01642AA8, desc: COMMIT 2021-05-21 15:55:46.363728 EEST; inval msgs: catcache 21; sync
    rmgr: XLOG        len (rec/tot):    114/   114, tx:          0, lsn: 0/01642B68, prev 0/01642B20, desc: CHECKPOINT_SHUTDOWN redo 0/1642B68; tli 1; prev tli 1; fpw true; xid 0:541; oid 24576; multi 1; offset 0; oldest xid 532 in DB 1; oldest multi 1 in DB 1; oldest/newest commit timestamp xid: 0/0; oldest running xid 0; shutdown

The compute node had correctly replayed all the WAL up to the last
record, and opened up. But when you tried to connect to the new
database, the very first requests for the critical relations, like
pg_class, were made with request LSN 0/01642990. That's the last
record that's applicable to a particular block. Because the database
CREATE record didn't bump up the "last written LSN", the getpage
requests were made with too old LSN.

I fixed this by adding a SetLastWrittenLSN() call to the redo of
database CREATE record. It probably wouldn't hurt to also throw in a
call at the end of WAL replay, but let's see if we bump into more
cases like this first.

This doesn't seem to be happening with page server as of 'main'; I was
testing with a version where I had temporarily reverted all the recent
changes to reconstruct control file, checkpoints, relmapper files
etc. from the WAL records in the page server, so that the compute node
was redoing all the WAL. I'm pretty sure we need this fix even with
'main', even though this test case wasn't failing there right now.
Some operations in PostgreSQL are not WAL-logged at all (i.e. hint bits)
or delay wal-logging till the end of operation (i.e. index build).
So if such page is evicted, we will lose the update.

To fix it, we introduce PD_WAL_LOGGED bit to track whether the page was wal-logged.
If the page is evicted before it has been wal-logged, then zenith smgr creates FPI for it.

Authors:
Konstantin Knizhnik <knizhnik@garret.ru>
anastasia <lubennikovaav@gmail.com>
Add WalProposer background worker to broadcast WAL stream to Zenith WAL acceptors

Author: Konstantin Knizhnik <knizhnik@garret.ru>
Ignore unlogged table qualifier. Add respective changes to regression test outputs.

Author: Konstantin Knizhnik <knizhnik@garret.ru>
Request relation size via smgr function, not just stat(filepath).
Author: Konstantin Knizhnik <knizhnik@garret.ru>
Author: Konstantin Knizhnik <knizhnik@garret.ru>
…mmon error. TODO: add a comment, why this is fine for zenith.
…d of WAL page header, then return it back to the page origin
…of WAL at compute node

+ Check for presence of replication slot
…t inside.

WAL proposer (as bgw without BGWORKER_BACKEND_DATABASE_CONNECTION) previously
ignored SetLatch, so once caught up it stuck inside WalProposerPoll infinitely.

Futher, WaitEventSetWait didn't have timeout, so we didn't try to reconnect if
all connections are dead as well. Fix that.

Also move break on latch set to the end of the loop to attempt
ReconnectWalKeepers even if latch is constantly set.

Per test_race_conditions (Python version now).
…kpoint from WAL

+ Check for presence of zenith.signal file to allow skip reading checkpoint record from WAL

+ Pass prev_record_ptr through zenith.signal file to postgres
This patch aims to make our bespoke WAL redo machinery more robust
in the presence of untrusted (in other words, possibly malicious) inputs.

Pageserver delegates complex WAL decoding duties to postgres,
which means that the latter might fall victim to carefully designed
malicious WAL records and start doing harmful things to the system.
To prevent this, it has been decided to limit possible interactions
with the outside world using the Secure Computing BPF mode.

We use this mode to disable all syscalls not in the allowlist.
Please refer to src/backend/postmaster/seccomp.c to learn more
about the pros & cons of the current approach.

+ Fix some bugs in seccomp bpf wrapper

* Use SCMP_ACT_TRAP instead of SCMP_ACT_KILL_PROCESS to receive signals.
* Add a missing variant of select() syscall (thx to @knizhnik).
* Write error messages to an fd stderr's currently pointing to.
…ause it cause memory leak in wal-redo-postgres

2. Add check for local relations to make it possible to use DEBUG_COMPARE_LOCAL mode in SMGR

+ Call smgr_init_standard from smgr_init_zenith
this patch adds support for zenith_tenant variable. it has similar
format as zenith_timeline. It is used in callmemaybe query to pass
tenant to pageserver and in ServerInfo structure passed to wal acceptor
…recovery.

Rust's postgres_backend currently is too dummy to handle it properly: reading
happens in separate thread which just ignores CopyDone. Instead, writer thread
must get aware of termination and send CommandComplete. Also reading socket must
be transferred back to postgres_backend (or connection terminated completely
after COPY). Let's do that after more basic safkeeper refactoring and right now
cover this up to make tests pass.

ref #388
…ion position in wal_proppser to segment boundary
…ugging.

Now it contains only one function test_consume_xids() for xid wraparound testing.
@lubennikovaav
Copy link

lubennikovaav commented May 31, 2022

pgbench -i -s 1 fails with assertion on this branch:

TRAP: FailedAssertion("lsn != InvalidXLogRecPtr", File: "/home/anastasia/zenith/tmp_install/build/../../vendor/postgres/contrib/neon/pagestore_smgr.c", Line: 635, PID: 219368)

I think we must also use forknum as a part of the key.

UPD: pushed commit with fix

@lubennikovaav lubennikovaav force-pushed the last_written_lsn_cache branch from 28a64eb to c9d8ec7 Compare May 31, 2022 11:35
@@ -607,6 +610,11 @@ typedef struct XLogCtlInsert
WALInsertLockPadded *WALInsertLocks;
} XLogCtlInsert;

typedef struct RnodeForkKey {
Oid rnode;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's add a comment about why this is safe to use only rnode+forknum as a key

@@ -1345,7 +1345,7 @@ zenith_dbsize(Oid dbNode)
XLogRecPtr request_lsn;
bool latest;

request_lsn = zenith_get_request_lsn(&latest);
request_lsn = zenith_get_request_lsn(&latest, NULL, InvalidForkNumber);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

zenith_get_request_lsn is not ready to accept NULL argument:

GetLastWrittenPageLSN(rnode->relNode

@@ -8828,11 +8872,41 @@ GetLastWrittenPageLSN(void)
* SetLastWrittenPageLSN -- Set maximal LSN of written page

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please expand this comment to mention caching

@knizhnik
Copy link
Author

I think we must also use forknum as a part of the key.

Sorry, this fix is not correct.
This cache should not be precise as far as it stores maximum LSN.
This is why it is not possible to take in account forknumber, dbnode, spcnode.
Even if there are several relation matching cache entry, we still maintain max(lsn) for them.
And there is no difference with single lastWrittenPageLsn we have before.

Let me investigate why pgench -i -s1 is failed.May be we should check for valid LSN in SetLastWrittenPageLsn()
But in any case you fix is not compatible with my last optimization.
Either key should uniquely idetify relation, and so contain all four components: spcnode, dbnode, relnode, forknum
Either is can be partial key and so forknumer is not needed as well.

@lubennikovaav
Copy link

But in any case you fix is not compatible with my last optimization.

I'm not sure, if I understand the problem. My "fix" just adds an unnecessary part to the key or it is incompatible in some other way?

@knizhnik
Copy link
Author

My "fix" just adds an unnecessary part to the key or it is incompatible in some other way?

Relation is identified using four components: spcnode, dbnode, relnode, forknum
Originally I used only first three of them (RelFileNode) as a key. So there is no one-to-one mapping
between key and relations.
Then I decided to use only relnode as a key: probability of collision is assumed to be very small and non-critical.
Then you add forknum. But (relnode,forknum) is not unique key! That is why I said that you fix has no sense.

The reason of the problem is that we sometimes pass lsn=0 in SetLastWrittenLsn.
Before his PR it doesn't cause any problem because we in any case assign maximal value.
But now we can associate InvalidXLogRecPtr with this relation and then GetLastValidLsn may return it which cause assertion failure in zenith_get_request_lsn. But is enough just to ad check for InvalidXLogRecPtr in SetLastWritenPageLsn.

@aome510
Copy link

aome510 commented Jun 17, 2022

One similar idea I had is to have e.g. 100 last-written-LSN values in a hash table. When you evict a page, you calculate a hash of the BufferTag of the page you're evicting, and store the LSN at that slot in the hash table. When reading, similarly calculate the hash of the BufferTag, and use the LSN at that slot. There would be some "false sharing", but that's OK. However, it wouldn't be so efficient for the specific problem you're addressing. If you do a large COPY, it would very quickly update all the LSNs in the hash table, as it modifies a lot of pages.

@hlinnaka idea is similar to what Socrates does to estimate the latest LSN for a page. Quote from section 4.4:

Ideally, X-LSN would be the most recent page LSN for Page X. However, the Primary cannot remember the LSNs of all pages it has evicted (essentially the whole database). Instead, the Primary builds a hash map (on pageId) which stores in each bucket the highest LSN for every page evicted from the Primary keyed by pageId. Given that Page X was evicted at some point from the Primary, this mechanism will guarantee to give an X-LSN value that is at least as large as the largest LSN for Page X and is, thus, safe.

It's true that this approach may not work well with bulk loading if using a small bucket size for the hashmap. We can probably tweak the hash map to key by relation_id instead of page_id, so it is more suitable for the bulk-loading task.

IMHO, for bulk-loading task, using page_id may be not that bad because the hash map only stores LSN of evicted pages. If the shared buffer is large enough, the difference between the maximum LSN of an evicted pages and the last record LSN processed by the page server should be close, resulting in a small wait time.

@knizhnik
Copy link
Author

@hlinnaka idea is similar to what Socrates does to estimate the latest LSN for a page.

My main concern is synchronzation needed to access this shared cache.
Certainly, we can use standard postgres approach with hash partitioning (when each partitioned is synchronizae its own RW-lock). But I am not sure that it will completely eliminate the probem.

Also I think that using PageId is key is not so good idea. In case of random updates (like in pgbench), there will e a lot of cache collisions (i.e. all baskets will have recent LSNs). May be it is better to split relation into larger chunks (i.e. 1Mb or even more) and maintain latest LSN for each chunk? It also will not help much in case of random updates. But will be more efficient in case of appending data to multiple relations.

@hlinnaka - what do you think? Should I try to implement such more sophisticated caching of last recent LSNs?
Or current solution with very small cache with relation_id as a key is enough for now?

@hlinnaka
Copy link
Contributor

@hlinnaka idea is similar to what Socrates does to estimate the latest LSN for a page.

My main concern is synchronzation needed to access this shared cache. Certainly, we can use standard postgres approach with hash partitioning (when each partitioned is synchronizae its own RW-lock). But I am not sure that it will completely eliminate the probem.

I think that's good enough. The synchronization overhead is tiny compared to sending a request over the network.

Also I think that using PageId is key is not so good idea. In case of random updates (like in pgbench), there will e a lot of cache collisions (i.e. all baskets will have recent LSNs). May be it is better to split relation into larger chunks (i.e. 1Mb or even more) and maintain latest LSN for each chunk? It also will not help much in case of random updates. But will be more efficient in case of appending data to multiple relations.

Yeah, I think that's worth exploring.

If we have a more fine-grained cache with page-ids or chunks, can we use it for smgrnblocks() and smgrexists() calls too? We also have a relation size cache, so maybe

@hlinnaka - what do you think? Should I try to implement such more sophisticated caching of last recent LSNs? Or current solution with very small cache with relation_id as a key is enough for now?

The small per-relation cache certainly helps, but it's very easy to overflow it...

@aome510
Copy link

aome510 commented Jun 20, 2022

So I tried to run this patch with the new WAL backpressure tests

Results

latest main (https://app.circleci.com/pipelines/github/neondatabase/neon/7285/workflows/e18911a4-4cf7-43ee-863b-2d6ec11540db/jobs/74648):

test_heavy_write_workload[vanilla-10-5-5].run_duration: 37.701 s
test_heavy_write_workload[vanilla-10-5-5].read_latency_max: 0.021 s
test_heavy_write_workload[vanilla-10-5-5].read_latency_avg: 0.001 s
test_heavy_write_workload[vanilla-10-5-5].read_latency_stdev: 0.004 s
test_heavy_write_workload[neon_off-10-5-5].run_duration: 227.946 s
test_heavy_write_workload[neon_off-10-5-5].lsn_write_lag_max: 4 MB
test_heavy_write_workload[neon_off-10-5-5].lsn_write_lag_avg: 1 MB
test_heavy_write_workload[neon_off-10-5-5].lsn_write_lag_stdev: 1 MB
test_heavy_write_workload[neon_off-10-5-5].read_latency_max: 0.206 s
test_heavy_write_workload[neon_off-10-5-5].read_latency_avg: 0.037 s
test_heavy_write_workload[neon_off-10-5-5].read_latency_stdev: 0.030 s
test_heavy_write_workload[neon_on-10-5-5].run_duration: 250.031 s
test_heavy_write_workload[neon_on-10-5-5].lsn_write_lag_max: 4 MB
test_heavy_write_workload[neon_on-10-5-5].lsn_write_lag_avg: 1 MB
test_heavy_write_workload[neon_on-10-5-5].lsn_write_lag_stdev: 1 MB
test_heavy_write_workload[neon_on-10-5-5].read_latency_max: 0.500 s
test_heavy_write_workload[neon_on-10-5-5].read_latency_avg: 0.048 s
test_heavy_write_workload[neon_on-10-5-5].read_latency_stdev: 0.063 s
test_pgbench_simple_update_workload[vanilla-45-100].run_duration: 45.447 s
test_pgbench_simple_update_workload[vanilla-45-100].read_latency_max: 0.057 s
test_pgbench_simple_update_workload[vanilla-45-100].read_latency_avg: 0.002 s
test_pgbench_simple_update_workload[vanilla-45-100].read_latency_stdev: 0.010 s
test_pgbench_simple_update_workload[neon_off-45-100].run_duration: 46.292 s
test_pgbench_simple_update_workload[neon_off-45-100].lsn_write_lag_max: 0 MB
test_pgbench_simple_update_workload[neon_off-45-100].lsn_write_lag_avg: 0 MB
test_pgbench_simple_update_workload[neon_off-45-100].lsn_write_lag_stdev: 0 MB
test_pgbench_simple_update_workload[neon_off-45-100].read_latency_max: 0.040 s
test_pgbench_simple_update_workload[neon_off-45-100].read_latency_avg: 0.004 s
test_pgbench_simple_update_workload[neon_off-45-100].read_latency_stdev: 0.007 s
test_pgbench_simple_update_workload[neon_on-45-100].run_duration: 53.027 s
test_pgbench_simple_update_workload[neon_on-45-100].lsn_write_lag_max: 1 MB
test_pgbench_simple_update_workload[neon_on-45-100].lsn_write_lag_avg: 0 MB
test_pgbench_simple_update_workload[neon_on-45-100].lsn_write_lag_stdev: 0 MB
test_pgbench_simple_update_workload[neon_on-45-100].read_latency_max: 0.059 s
test_pgbench_simple_update_workload[neon_on-45-100].read_latency_avg: 0.005 s
test_pgbench_simple_update_workload[neon_on-45-100].read_latency_stdev: 0.009 s
test_pgbench_intensive_init_workload[vanilla-1000].run_duration: 128.655 s
test_pgbench_intensive_init_workload[vanilla-1000].read_latency_max: 0.017 s
test_pgbench_intensive_init_workload[vanilla-1000].read_latency_avg: 0.008 s
test_pgbench_intensive_init_workload[vanilla-1000].read_latency_stdev: 0.003 s
test_pgbench_intensive_init_workload[neon_off-1000].run_duration: 59.479 s
test_pgbench_intensive_init_workload[neon_off-1000].lsn_write_lag_max: 9 MB
test_pgbench_intensive_init_workload[neon_off-1000].lsn_write_lag_avg: 2 MB
test_pgbench_intensive_init_workload[neon_off-1000].lsn_write_lag_stdev: 2 MB
test_pgbench_intensive_init_workload[neon_off-1000].read_latency_max: 2.030 s
test_pgbench_intensive_init_workload[neon_off-1000].read_latency_avg: 0.775 s
test_pgbench_intensive_init_workload[neon_off-1000].read_latency_stdev: 0.477 s
test_pgbench_intensive_init_workload[neon_on-1000].run_duration: 141.701 s
test_pgbench_intensive_init_workload[neon_on-1000].lsn_write_lag_max: 15 MB
test_pgbench_intensive_init_workload[neon_on-1000].lsn_write_lag_avg: 13 MB
test_pgbench_intensive_init_workload[neon_on-1000].lsn_write_lag_stdev: 4 MB
test_pgbench_intensive_init_workload[neon_on-1000].read_latency_max: 27.029 s
test_pgbench_intensive_init_workload[neon_on-1000].read_latency_avg: 19.359 s
test_pgbench_intensive_init_workload[neon_on-1000].read_latency_stdev: 9.079 s

this patch + @hlinnaka's #175 (https://app.circleci.com/pipelines/github/neondatabase/neon/7286/workflows/9cba7367-a9eb-4a26-82bc-844e0bb7e9e5/jobs/74656):

test_heavy_write_workload[vanilla-10-5-5].run_duration: 39.358 s
test_heavy_write_workload[vanilla-10-5-5].read_latency_max: 0.006 s
test_heavy_write_workload[vanilla-10-5-5].read_latency_avg: 0.001 s
test_heavy_write_workload[vanilla-10-5-5].read_latency_stdev: 0.001 s
test_heavy_write_workload[neon_off-10-5-5].run_duration: 173.292 s
test_heavy_write_workload[neon_off-10-5-5].read_latency_max: 1.172 s
test_heavy_write_workload[neon_off-10-5-5].read_latency_avg: 0.401 s
test_heavy_write_workload[neon_off-10-5-5].read_latency_stdev: 0.341 s
test_heavy_write_workload[neon_off-10-5-5].lsn_write_lag_max: 17 MB
test_heavy_write_workload[neon_off-10-5-5].lsn_write_lag_avg: 15 MB
test_heavy_write_workload[neon_off-10-5-5].lsn_write_lag_stdev: 3 MB
test_heavy_write_workload[neon_on-10-5-5].run_duration: 188.862 s
test_heavy_write_workload[neon_on-10-5-5].lsn_write_lag_max: 17 MB
test_heavy_write_workload[neon_on-10-5-5].lsn_write_lag_avg: 14 MB
test_heavy_write_workload[neon_on-10-5-5].lsn_write_lag_stdev: 3 MB
test_heavy_write_workload[neon_on-10-5-5].read_latency_max: 1.304 s
test_heavy_write_workload[neon_on-10-5-5].read_latency_avg: 0.540 s
test_heavy_write_workload[neon_on-10-5-5].read_latency_stdev: 0.348 s
test_pgbench_simple_update_workload[vanilla-45-100].run_duration: 45.135 s
test_pgbench_simple_update_workload[vanilla-45-100].read_latency_max: 0.050 s
test_pgbench_simple_update_workload[vanilla-45-100].read_latency_avg: 0.002 s
test_pgbench_simple_update_workload[vanilla-45-100].read_latency_stdev: 0.009 s
test_pgbench_simple_update_workload[neon_off-45-100].run_duration: 46.109 s
test_pgbench_simple_update_workload[neon_off-45-100].read_latency_max: 0.027 s
test_pgbench_simple_update_workload[neon_off-45-100].read_latency_avg: 0.003 s
test_pgbench_simple_update_workload[neon_off-45-100].read_latency_stdev: 0.005 s
test_pgbench_simple_update_workload[neon_off-45-100].lsn_write_lag_max: 1 MB
test_pgbench_simple_update_workload[neon_off-45-100].lsn_write_lag_avg: 0 MB
test_pgbench_simple_update_workload[neon_off-45-100].lsn_write_lag_stdev: 0 MB
test_pgbench_simple_update_workload[neon_on-45-100].run_duration: 45.501 s
test_pgbench_simple_update_workload[neon_on-45-100].lsn_write_lag_max: 1 MB
test_pgbench_simple_update_workload[neon_on-45-100].lsn_write_lag_avg: 0 MB
test_pgbench_simple_update_workload[neon_on-45-100].lsn_write_lag_stdev: 0 MB
test_pgbench_simple_update_workload[neon_on-45-100].read_latency_max: 0.065 s
test_pgbench_simple_update_workload[neon_on-45-100].read_latency_avg: 0.005 s
test_pgbench_simple_update_workload[neon_on-45-100].read_latency_stdev: 0.011 s
test_pgbench_intensive_init_workload[vanilla-1000].run_duration: 99.138 s
test_pgbench_intensive_init_workload[vanilla-1000].read_latency_max: 0.013 s
test_pgbench_intensive_init_workload[vanilla-1000].read_latency_avg: 0.007 s
test_pgbench_intensive_init_workload[vanilla-1000].read_latency_stdev: 0.001 s
test_pgbench_intensive_init_workload[neon_off-1000].run_duration: 56.939 s
test_pgbench_intensive_init_workload[neon_off-1000].read_latency_max: 0.763 s
test_pgbench_intensive_init_workload[neon_off-1000].read_latency_avg: 0.149 s
test_pgbench_intensive_init_workload[neon_off-1000].read_latency_stdev: 0.136 s
test_pgbench_intensive_init_workload[neon_off-1000].lsn_write_lag_max: 6 MB
test_pgbench_intensive_init_workload[neon_off-1000].lsn_write_lag_avg: 2 MB
test_pgbench_intensive_init_workload[neon_off-1000].lsn_write_lag_stdev: 1 MB
test_pgbench_intensive_init_workload[neon_on-1000].run_duration: 87.980 s
test_pgbench_intensive_init_workload[neon_on-1000].read_latency_max: 1.153 s
test_pgbench_intensive_init_workload[neon_on-1000].read_latency_avg: 0.210 s
test_pgbench_intensive_init_workload[neon_on-1000].read_latency_stdev: 0.282 s
test_pgbench_intensive_init_workload[neon_on-1000].lsn_write_lag_max: 15 MB
test_pgbench_intensive_init_workload[neon_on-1000].lsn_write_lag_avg: 10 MB
test_pgbench_intensive_init_workload[neon_on-1000].lsn_write_lag_stdev: 4 MB

Differences (baseline is this patch + Heikki's patch):

key: test_heavy_write_workload[neon_off-10-5-5].lsn_write_lag_avg, diff: 14.00, diff_percentage: 1400.00%
key: test_heavy_write_workload[neon_off-10-5-5].lsn_write_lag_max, diff: 13.00, diff_percentage: 325.00%
key: test_heavy_write_workload[neon_off-10-5-5].lsn_write_lag_stdev, diff: 2.00, diff_percentage: 200.00%
key: test_heavy_write_workload[neon_off-10-5-5].read_latency_avg, diff: 0.36, diff_percentage: 983.78%
key: test_heavy_write_workload[neon_off-10-5-5].read_latency_max, diff: 0.97, diff_percentage: 468.93%
key: test_heavy_write_workload[neon_off-10-5-5].read_latency_stdev, diff: 0.31, diff_percentage: 1036.67%
key: test_heavy_write_workload[neon_off-10-5-5].run_duration, diff: -54.65, diff_percentage: -23.98%
key: test_heavy_write_workload[neon_on-10-5-5].lsn_write_lag_avg, diff: 13.00, diff_percentage: 1300.00%
key: test_heavy_write_workload[neon_on-10-5-5].lsn_write_lag_max, diff: 13.00, diff_percentage: 325.00%
key: test_heavy_write_workload[neon_on-10-5-5].lsn_write_lag_stdev, diff: 2.00, diff_percentage: 200.00%
key: test_heavy_write_workload[neon_on-10-5-5].read_latency_avg, diff: 0.49, diff_percentage: 1025.00%
key: test_heavy_write_workload[neon_on-10-5-5].read_latency_max, diff: 0.80, diff_percentage: 160.80%
key: test_heavy_write_workload[neon_on-10-5-5].read_latency_stdev, diff: 0.28, diff_percentage: 452.38%
key: test_heavy_write_workload[neon_on-10-5-5].run_duration, diff: -61.17, diff_percentage: -24.46%
key: test_heavy_write_workload[vanilla-10-5-5].read_latency_avg, diff: 0.00, diff_percentage: 0.00%
key: test_heavy_write_workload[vanilla-10-5-5].read_latency_max, diff: -0.02, diff_percentage: -71.43%
key: test_heavy_write_workload[vanilla-10-5-5].read_latency_stdev, diff: -0.00, diff_percentage: -75.00%
key: test_heavy_write_workload[vanilla-10-5-5].run_duration, diff: 1.66, diff_percentage: 4.40%
key: test_pgbench_simple_update_workload[neon_off-45-100].read_latency_avg, diff: -0.00, diff_percentage: -25.00%
key: test_pgbench_simple_update_workload[neon_off-45-100].read_latency_max, diff: -0.01, diff_percentage: -32.50%
key: test_pgbench_simple_update_workload[neon_off-45-100].read_latency_stdev, diff: -0.00, diff_percentage: -28.57%
key: test_pgbench_simple_update_workload[neon_off-45-100].run_duration, diff: -0.18, diff_percentage: -0.40%
key: test_pgbench_simple_update_workload[neon_on-45-100].lsn_write_lag_max, diff: 0.00, diff_percentage: 0.00%
key: test_pgbench_simple_update_workload[neon_on-45-100].read_latency_avg, diff: 0.00, diff_percentage: 0.00%
key: test_pgbench_simple_update_workload[neon_on-45-100].read_latency_max, diff: 0.01, diff_percentage: 10.17%
key: test_pgbench_simple_update_workload[neon_on-45-100].read_latency_stdev, diff: 0.00, diff_percentage: 22.22%
key: test_pgbench_simple_update_workload[neon_on-45-100].run_duration, diff: -7.53, diff_percentage: -14.19%
key: test_pgbench_simple_update_workload[vanilla-45-100].read_latency_avg, diff: 0.00, diff_percentage: 0.00%
key: test_pgbench_simple_update_workload[vanilla-45-100].read_latency_max, diff: -0.01, diff_percentage: -12.28%
key: test_pgbench_simple_update_workload[vanilla-45-100].read_latency_stdev, diff: -0.00, diff_percentage: -10.00%
key: test_pgbench_simple_update_workload[vanilla-45-100].run_duration, diff: -0.31, diff_percentage: -0.69%
key: test_pgbench_intensive_init_workload[neon_off-1000].lsn_write_lag_avg, diff: 0.00, diff_percentage: 0.00%
key: test_pgbench_intensive_init_workload[neon_off-1000].lsn_write_lag_max, diff: -3.00, diff_percentage: -33.33%
key: test_pgbench_intensive_init_workload[neon_off-1000].lsn_write_lag_stdev, diff: -1.00, diff_percentage: -50.00%
key: test_pgbench_intensive_init_workload[neon_off-1000].read_latency_avg, diff: -0.63, diff_percentage: -80.77%
key: test_pgbench_intensive_init_workload[neon_off-1000].read_latency_max, diff: -1.27, diff_percentage: -62.41%
key: test_pgbench_intensive_init_workload[neon_off-1000].read_latency_stdev, diff: -0.34, diff_percentage: -71.49%
key: test_pgbench_intensive_init_workload[neon_off-1000].run_duration, diff: -2.54, diff_percentage: -4.27%
key: test_pgbench_intensive_init_workload[neon_on-1000].lsn_write_lag_avg, diff: -3.00, diff_percentage: -23.08%
key: test_pgbench_intensive_init_workload[neon_on-1000].lsn_write_lag_max, diff: 0.00, diff_percentage: 0.00%
key: test_pgbench_intensive_init_workload[neon_on-1000].lsn_write_lag_stdev, diff: 0.00, diff_percentage: 0.00%
key: test_pgbench_intensive_init_workload[neon_on-1000].read_latency_avg, diff: -19.15, diff_percentage: -98.92%
key: test_pgbench_intensive_init_workload[neon_on-1000].read_latency_max, diff: -25.88, diff_percentage: -95.73%
key: test_pgbench_intensive_init_workload[neon_on-1000].read_latency_stdev, diff: -8.80, diff_percentage: -96.89%
key: test_pgbench_intensive_init_workload[neon_on-1000].run_duration, diff: -53.72, diff_percentage: -37.91%
key: test_pgbench_intensive_init_workload[vanilla-1000].read_latency_avg, diff: -0.00, diff_percentage: -12.50%
key: test_pgbench_intensive_init_workload[vanilla-1000].read_latency_max, diff: -0.00, diff_percentage: -23.53%
key: test_pgbench_intensive_init_workload[vanilla-1000].read_latency_stdev, diff: -0.00, diff_percentage: -66.67%
key: test_pgbench_intensive_init_workload[vanilla-1000].run_duration, diff: -29.52, diff_percentage: -22.94%

Notes

  • in general, at least based on the new tests, this patch significantly improves the read latency.
  • notice that there is a big difference in the run duration of test_pgbench_intensive_init_workload[neon_on-1000], test_heavy_write_workload[neon-*] tests between two runs, which is probably related to Heikki's comment

With this workload, postgres will call smgrexists() on the FSM fork repeatedly, whenever it needs to extend the index. That again causes the backend to wait for the pageserver to catch up, draining any lag. (I noticed this by looking at pg_stat_activity, saw the backends blocked on the "Extension" wait event, and then I connected to the backend process with 'gdb' to see what exactly it was doing)

  • in test_pgbench_intensive_init_workload, even though the read relation is never in the cache, the query LSN still lags behind the latest written LSN as LSNs of hot relations are stored in the cache, so the overall read latency is not that big.

Co-authored-by: Thang Pham <phamducthang1234@gmail.com>
@knizhnik
Copy link
Author

in test_pgbench_intensive_init_workload, even though the read relation is never in the cache, the query LSN still lags behind the latest written LSN as LSNs of hot relations are stored in the cache, so the overall read latency is not that big.

Please notice, that this LSN cache works not only only when relation is present in the cache (cache hit).
Almost opposite: this cache allows to increase speed of read-only queries which access other relations (which LSNs are not cached). Cache allows to postpone update of lastWrittenPageNonCacheLsn and so use old LSN for retrieving pages of this relations from server. Certainly it cn work only if number of updated relations is relatively small. But there can be arbitrary number of read-only relations.

@knizhnik
Copy link
Author

I think that's good enough. The synchronization overhead is tiny compared to sending a request over the network.

This LSN cache is used not only in get_page_at_lsn whenpage is requested from page server.
But also when dirty page is written to the disk. Actually it also should not happen frequently, so I have to agree - synchronization overhead should not be noticeable here.

Yeah, I think that's worth exploring. The small per-relation cache certainly helps, but it's very easy to overflow it...

I will try.

@knizhnik
Copy link
Author

If we have a more fine-grained cache with page-ids or chunks, can we use it for smgrnblocks() and smgrexists() calls too? We also have a relation size cache, so maybe

Not sure that the same cache can be used both for caching relation size and keeping last written LSN.
First of all how we can locate the last page? Just standard hash is not able to do it.
Also what will happen if cache size is limited and last page was evicted from the cache, but some other page - not?
It seems to be more useful and realistic to try to combine last written lsn cache and relation size cache

@knizhnik
Copy link
Author

I have created another PR #177 with ls written LSN maintaine for each relation chunk (8Mb).

@knizhnik
Copy link
Author

Replaced with #191

@knizhnik knizhnik closed this Aug 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.